r - Using rvest to fill out search form and download attachments -


i'm trying scrape department of labor data using rvest. have list of eins , pns (parameters in web search form) want search by. here's have far:

library(rvest) library(magrittr)  ## url page search form populated site <- "http://www.efast.dol.gov/portal/app/disseminate?execution=e1s1"  session <- html_session(site)  form <- session %>%   html_nodes("form") %>%   extract2(1) %>%   html_form() %>%   set_values(`ein` = "060646973", # example ein              `pn` = "001") # example pn  result <- submit_form(session, form) 

this leads page there list of plans. however, i'm not familiar enough rvest know how navigate result page , download attachments. it's accomplished in browser, want write script automate task.

any on navigating resulting webpage , downloading attachments using rvest or other package in r appreciated. thank much!

this doesn't solve problem (there plenty of rselenium responses , blog posts use rselenium), "why" have ugly site (and provides pointer have start url-wise rselenium approach work).

the site uses "java server faces" on server-side along javascript maintain state , augment navigation. you'll have start @ https://www.efast.dol.gov/portal/app/disseminate end can start session correctly.

once fill in 2 fields, makes post request looks (in "copy curl" format):

curl -i -s -k        -x 'post'      -h 'user-agent: mozilla/5.0 (macintosh; intel mac os x 10.12; rv:43.0) gecko/20100101 firefox/43.0'       -h 'content-type: application/x-www-form-urlencoded; charset=utf-8' -h 'faces-request: partial/ajax'       -h 'x-requested-with: xmlhttprequest' -h 'referer: https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1'       -b 'jsessionid=0000ug27gxfj4svgfvxnui3ix9c:18fl2akcj'       --data-binary $'javax.faces.partial.ajax=true&javax.faces.source=form%3anextbtn&javax.faces.partial.execute=%40all&javax.faces.partial.render=form&form%3anextbtn=form%3anextbtn&form=form&planname=&sponsorname=&administratorname=&filingid=&ackid=&ein=060646973&pn=001&form%3aj_idt939%3apybcalendar_input=&form%3aj_idt942%3apyecalendar_input=&formyear=&form%3anumresults_input=100&form%3anumresults_editableinput=100&javax.faces.viewstate=e1s1'      'https://www.efast.dol.gov/portal/app/disseminate?execution=e1s1' 

i post let see of additional fields submits aren't directly in <form> initially.

the response that post like:

http/1.1 200 ok x-powered-by: servlet/3.0 pragma: no-cache expires: thu, 01 jan 1970 00:00:00 gmt cache-control: no-cache cache-control: no-store x-powered-by: jsf/2.0 x-powered-by: jsf/2.0 x-ua-compatible: ie=emulateie7 content-type: application/xml; charset=utf-8 content-language: en-us date: fri, 23 dec 2016 13:10:26 gmt content-length: 142 connection: keep-alive  <?xml version='1.0' encoding='utf-8'?> <partial-response><redirect url="/portal/app/disseminate?execution=e1s2"></redirect></partial-response> 

that's java server faces ajax redirect response causes redirected results page actual results in <<table role="treegrid"> (provided target table in horrible html returns).

you'll need figure out how ensure can click checkboxes , download info.

any mis-step in automated navigation result in breaking session. so, may in tedious trial & error ensure target selection actions correct.


Comments

Popular posts from this blog

python - How to insert QWidgets in the middle of a Layout? -

python - serve multiple gunicorn django instances under nginx ubuntu -

module - Prestashop displayPaymentReturn hook url -