RSelenium文本提取在Loop中不工作。

我正试图从这个政府网站的案件数据库中提取文本——。 https:/www.te.gob.mxbuscador – 使用RSelenium。

我已经设法让RSelenium将我感兴趣的文本手动提取出来并存储在一个数据框架中,然而,我希望通过一个 for loop

浏览器进入了一个看起来像这样的网站。Sample Page from MX Website

然后点击网站上 “Resumen “下的第一个链接 打开了一个看起来像这样的页面。Sample Resumen Subpage

我从每个 “Resumen “子页面中提取一些文本 并将它们存储在一个数据框架中。

这就是我的代码的样子。

setwd("C:/Users/ohenr/Dropbox/10-19 Research Projects/16 R")
getwd()
pacman::p_load(rvest, tidyverse, stringr, RSelenium, data.table) #loads all the packages in one command

url <- "https://www.te.gob.mx/buscador"


# Setting up the remote driver

remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445L,
                       browserName = "firefox")

# Input this into the terminal to start the firefox image in docker
# docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.0

# Open the remote Driver (open firefox in R Selenium)
remDr$open()

# Navigating throught the mx resumen website
remDr$navigate(url)

# Click the regions on the left side of the webpage

region_lists <- remDr$findElements(using = "css selector", ".salas-tree")

region_lists[[1]]$clickElement()

#List resumen elements from the first page
res <- remDr$findElements("css selector", "#resumenResultados")

# number of resumen on the first page
res_n <- length(res)

#build a dataframe that has that same number of observations
resumen.df <- data.frame(expediente = character(res_n),
                          entidad = character(res_n),
                          turno = character(res_n),
                          res_text = character(res_n),
                          stringsAsFactors = F)

for (j in 1:res_n) {

    res[[j]]$clickElement() # click on the jth resumen

    elements <- remDr$findElements(using = "css selector", "h4") #extract the h4 elements from the resumen subpage

    expediente <- unlist(elements[[1]]$getElementText())

    entidad <- unlist(elements[[8]]$getElementText())

    turno <- unlist(elements[[5]]$getElementText())

    res_text <- remDr$findElement("css selector", "#swal2-content > div > div > p")

    res_text <- unlist(res_text$getElementText())

    resumen.df$expediente[j] <- expediente

    resumen.df$entidad[j] <- entidad

    resumen.df$turno[j] <- turno

    resumen.df$res_text[j] <- res_text

#click the okay button on the page to exit the resumen subpage
    button <- remDr$findElement("css selector", "body > div.swal2-container.swal2-center.swal2-fade.swal2-shown > div > div.swal2-actions > button.swal2-confirm.swal2-styled")

    button$clickElement()

  }

然而一旦我运行循环,我就得到了这个错误。

Error in elements[[1]] : subscript out of bounds

我想问题应该与循环中的索引有关,因为我可以一次只填一行数据框。有什么办法可以让我正确地迭代这个过程?

解决方案:

@SlowLearning在评论中的建议最终解决了这个问题,但我不得不在更多的地方添加Sys.sleep(2)来让它工作。脚本运行的速度比网站加载的速度还快。

  n <- remDr$findElement(using = "css selector", "#resultadosgsa_paginate > span > a:nth-child(7)")

  n <- n$getElementText()

  n <- as.numeric(n)

  n

for (i in 1:n) {

  # click through each page in the region, collecting the text
    res <- remDr$findElements("css selector", "#resumenResultados")

    res_n <- length(res)

    resumen.df <- data.frame(expediente = character(res_n),
                          entidad = character(res_n),
                          turno = character(res_n),
                          res_text = character(res_n),
                          stringsAsFactors = F)

    for (j in 1:res_n) {
      Sys.sleep(2)

      res[[j]]$clickElement()

      Sys.sleep(2)

      ex_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(1)")

      expediente <- unlist(ex_location$getElementText())

      en_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(8)")

      entidad <- unlist(en_location$getElementText())

      tu_location <- remDr$findElement("css selector", "#swal2-content > div > div > h4:nth-child(5)")

      turno <- unlist(tu_location$getElementText())

      te_location <- remDr$findElement("css selector", "#swal2-content > div > div > p")

      res_text <- unlist(te_location$getElementText())

      resumen.df$expediente[j] <- expediente

      resumen.df$entidad[j] <- entidad

      resumen.df$turno[j] <- turno

      resumen.df$res_text[j] <- res_text

      Sys.sleep(2)

      # close out the subpage and wait before opening the next one

      button <- remDr$findElement("css selector", "body > div.swal2-container.swal2-center.swal2-fade.swal2-shown > div > div.swal2-actions > button.swal2-confirm.swal2-styled")

      button$clickElement()

    }

  global.list <- list(global.df, resumen.df)
  global.df <- rbindlist(global.list)

  Sys.sleep(2)

  next.page.button <- remDr$findElement("css selector", "#resultadosgsa_next")

  next.page.button$clickElement()

  Sys.sleep(2)
}

给TA打赏
共{{data.count}}人
人已打赏
未分类

e.$OneSignal.on不是Nuxtjs PWA的函数--OneSignal。

2022-9-8 13:45:41

未分类

机器人在ROS导航栈中的初始姿势。

2022-9-8 13:45:43

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索