r - Using rvest to fill out search form and download attachments -
i'm trying scrape department of labor data using rvest. have list of eins , pns (parameters in web search form) want search by. here's have far:
library(rvest) library(magrittr) ## url page search form populated site <- "http://www.efast.dol.gov/portal/app/disseminate?execution=e1s1" session <- html_session(site) form <- session %>% html_nodes("form") %>% extract2(1) %>% html_form() %>% set_values(`ein` = "060646973", # example ein `pn` = "001") # example pn result <- submit_form(session, form)
this leads page there list of plans. however, i'm not familiar enough rvest know how navigate result page , download attachments. it's accomplished in browser, want write script automate task.
any on navigating resulting webpage , downloading attachments using rvest or other package in r appreciated. thank much!
this doesn't solve problem (there plenty of rselenium responses , blog posts use rselenium), "why" have ugly site (and provides pointer have start url-wise rselenium approach work).
the site uses "java server faces" on server-side along javascript maintain state , augment navigation. you'll have start @ https://www.efast.dol.gov/portal/app/disseminate end can start session correctly.
once fill in 2 fields, makes post
request looks (in "copy curl" format):
curl -i -s -k -x 'post' -h 'user-agent: mozilla/5.0 (macintosh; intel mac os x 10.12; rv:43.0) gecko/20100101 firefox/43.0' -h 'content-type: application/x-www-form-urlencoded; charset=utf-8' -h 'faces-request: partial/ajax' -h 'x-requested-with: xmlhttprequest' -h 'referer: https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1' -b 'jsessionid=0000ug27gxfj4svgfvxnui3ix9c:18fl2akcj' --data-binary $'javax.faces.partial.ajax=true&javax.faces.source=form%3anextbtn&javax.faces.partial.execute=%40all&javax.faces.partial.render=form&form%3anextbtn=form%3anextbtn&form=form&planname=&sponsorname=&administratorname=&filingid=&ackid=&ein=060646973&pn=001&form%3aj_idt939%3apybcalendar_input=&form%3aj_idt942%3apyecalendar_input=&formyear=&form%3anumresults_input=100&form%3anumresults_editableinput=100&javax.faces.viewstate=e1s1' 'https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1'
i post let see of additional fields submits aren't directly in <form>
initially.
the response that post
like:
http/1.1 200 ok x-powered-by: servlet/3.0 pragma: no-cache expires: thu, 01 jan 1970 00:00:00 gmt cache-control: no-cache cache-control: no-store x-powered-by: jsf/2.0 x-powered-by: jsf/2.0 x-ua-compatible: ie=emulateie7 content-type: application/xml; charset=utf-8 content-language: en-us date: fri, 23 dec 2016 13:10:26 gmt content-length: 142 connection: keep-alive <?xml version='1.0' encoding='utf-8'?> <partial-response><redirect url="/portal/app/disseminate?execution=e1s2"></redirect></partial-response>
that's java server faces ajax redirect response causes redirected results page actual results in <<table role="treegrid">
(provided target table in horrible html returns).
you'll need figure out how ensure can click checkboxes , download info.
any mis-step in automated navigation result in breaking session. so, may in tedious trial & error ensure target selection actions correct.
Comments
Post a Comment