Rectangle 27 114

Scraping with JS support:

You can also use Python library dryscrape to scrape javascript driven websites.

To give an example, I created a sample page with following HTML code. (link):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>
No javascript support
Yay! Supports javascript
>>> import requests
>>> from bs4 import BeautifulSoup
>>> response = requests.get(my_url)
>>> soup = BeautifulSoup(response.text)
>>> soup.find(id="intro-text")
<p id="intro-text">No javascript support</p>
>>> import dryscrape
>>> from bs4 import BeautifulSoup
>>> session = dryscrape.Session()
>>> session.visit(my_url)
>>> response = session.body()
>>> soup = BeautifulSoup(response)
>>> soup.find(id="intro-text")
<p id="intro-text">Yay! Supports javascript</p>

Is the performance with JS Support way worse?

Any alternatives for those of us programming within Windows?

Web-scraping JavaScript page with Python - Stack Overflow

python web-scraping urlopen
Rectangle 27 7

Instead of trying to reverse engineer it, you can use ghost.py to directly interact with JavaScript on the page.

If you run the following query in a chrome console, you'll see it returns everything you want.

document.getElementsByClassName('inline-text-org');
[<div class="inline-text-org" title="University of Manchester">University of Manchester</div>, 
 <div class="inline-text-org" title="University of California Irvine">University of California ...</div>
  etc...

You can run JavaScript through python in a real life DOM using ghost.py.

from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://academic.research.microsoft.com/Search?query=lander')
result, resources = ghost.evaluate(
    "document.getElementsByClassName('inline-text-org');")

web scraping dynamic content with python - Stack Overflow

python web-scraping screen-scraping
Rectangle 27 4

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would. The only difference is that its main interface is not GUI/HMI but an API.

An example of this is Ghost.py - a webkit web client written in python.

There are of course other alternatives. You can use Qt's QWebKit for the same purpose, as shown in this example.

You can find a more complete list of headless browsers here.

i am able to get ghost work and load the page but what should i do get whole webpage out of it. the documentation describes a function get_page but it is not there even in the code itself.

python - Scraping HTML and JavaScript - Stack Overflow

javascript python parsing web-scraping web-crawler
Rectangle 27 4

First of all, scrapping and parsing JS from pages is not trivial. It can however be vastly simplified if you use a headless web client instead, which will parse everything for you just like a regular browser would. The only difference is that its main interface is not GUI/HMI but an API.

An example of this is Ghost.py - a webkit web client written in python.

There are of course other alternatives. You can use Qt's QWebKit for the same purpose, as shown in this example.

You can find a more complete list of headless browsers here.

i am able to get ghost work and load the page but what should i do get whole webpage out of it. the documentation describes a function get_page but it is not there even in the code itself.

python - Scraping HTML and JavaScript - Stack Overflow

javascript python parsing web-scraping web-crawler
Rectangle 27 14

You are going to have to make the same request (using the Requests library) that the javascript is making. You can use any number of tools (including those built into Chrome and Firefox) to inspect the http request that is coming from javascript and simply make this request yourself from Python.

No, Requests is an http library. It cannot run javascript.

Where did you look and what tool did you use to find it Ben? I'm trying this right now and I'm stuck

Ben, can you post your solution please?

web scraping - Using python Requests with javascript pages - Stack Ove...

python web-scraping python-requests
Rectangle 27 9

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

Thanks a lot. I'll try it

Scraping javascript-generated data using Python - Stack Overflow

javascript python screen-scraping web-scraping
Rectangle 27 7

It sounds like the data you're really looking for can be accessed via secondary URL called by some javascript on the primary page.

While you could try running javascript on the server to handle this, a simpler approach to might be to load up the page using Firefox and use a tool like Charles or Firebug to identify exactly what that secondary URL is. Then you can just query that URL directly for the data you are interested in.

Hi Stephen, would you be able to explain this in a little more detail? maybe using a simple example?

Web-scraping JavaScript page with Python - Stack Overflow

python web-scraping urlopen
Rectangle 27 5

Scraping javascript-based webpages is possible with selenium. In particular, try the Selenium WebDriver.

I tried Selenium. I donot want to mimic the user action. As I see it from running a sample program, it opens browser window and mimics the action. I donot want that. I want to extract the data from the webpage into my code.

You don't have to mimic user actions if you don't need to. Just download the page and parse it. The point of using selenium is that it processes javascript for you.

Screen Scraping a Javascript based webpage in Python - Stack Overflow

python screen-scraping beautifulsoup web-scraping
Rectangle 27 8

Use Python Selenium, it's pretty easy, and if you don't like the graphical browser it opens, use phantomJs with selenium, I've a video on Youtube showing a usage case, search "Scraping javascript forms with Python". Channel "KeyStrokes".

web scraping - Using python Requests with javascript pages - Stack Ove...

python web-scraping python-requests
Rectangle 27 3

import urllib
import requests
import json

url = "https://daphnecaruanagalizia.com/2017/10/crook-schembri-court-today-pleading-not-crook/"

encoded = urllib.parse.quote_plus(url)
# encoded = urllib.quote_plus(url) # for python 2 replace previous line by this
j = requests.get('https://count-server.sharethis.com/v2.0/get_counts?url=%s' % encoded).text
obj = json.loads(j)
print(obj['clicks']['twitter'] + obj['shares']['twitter'])

# => 5008

Inspecting the webpage, you can see that it does a request to this :

https://count-server.sharethis.com/v2.0/get_counts?url=https%3A%2F%2Fdaphnecaruanagalizia.com%2F2017%2F10%2Fcrook-schembri-court-today-pleading-not-crook%2F&cb=stButtons.processCB&wd=true

If you paste it in your browser you'll have all your answers. Then playing a bit with the url, you can see that removing extra parameters will give you a nice json.

So as you can see, you just have to replace the url parameter of the request with the url of the page you want to get the twitter counts.

Thanks a lot. I can confirm that this worked. Could you please give me some more information on how you found the request? I would like to do the same thing but for the number of comments, i.e get the number of comments. I am using chrome and was trying to look for a request through the network tab. Are there any tricks that can help in identifying the desired request?

you could filter on "javascript", "XHR", and "WS" I did just open the "response" tab, and scroll through requests, until I found "twitter"

For the number of comments, you should check disqus

Also, on the previous page : daphnecaruanagalizia.com You can see the number of comments per post, and it's not javascript, so quite easy to get. Check BeautifulSoup

Scraping elements generated by javascript queries using python - Stack...

javascript python html web-scraping
Rectangle 27 2

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few). Sometimes you'll get what you need with just one of these modules. Sometimes you'll need two, three, or all of these modules. Sometimes you'll need to switch off the js on your browser. Sometimes you'll need header info in your script. No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure. If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle. Just keep searching how to try what with these modules and copying and pasting your errors into the Google.

Web-scraping JavaScript page with Python - Stack Overflow

python web-scraping urlopen
Rectangle 27 2

You could do something similar to the following after launching a Selenium web browser, then passing driver.page_source to the BeautifulSoup library (unfortunately cannot test this at work with firewalls in place):

soup = BeautifulSoup(driver.page_source, 'html.parser')

shares = soup.find('span', {'class': 'st_twitter_hcount'}).find('span', {'class': 'stBubble_hcount'})

Scraping elements generated by javascript queries using python - Stack...

javascript python html web-scraping
Rectangle 27 34

#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait

# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
     browser.get(url)
     button = browser.find_element_by_name('button')
     button.click()
     # wait for the page to load
     WebDriverWait(browser, timeout=10).until(
         lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
     # store it to string variable
     page_source = browser.page_source
print(page_source)

is the WebDriverWait with someId_that_must_be_on_new_page neccessary? Could it be done only with some sleep or delay function? And is it possible to set the user-agent string?

There is one problem yet. On the web page is select element and something have to be selected. If nothing is selected the button won't work. And is neccessary to open and close firefox? Without guit this won't work?

you could use any condition you like e.g., x.title == 'New Title'. You probably could modify user-agent by using appropriate firefox profile.

here's an example on how to select option. .quit() is not necessary.

The method select_option(self, selector, value) takes selector parameter. I'm not sure what this parameter should be. Let's say I want to click on option with value = 100 of select with id = 'sel_id' and name = 'sel_name'. Could this be expressed in CSS?

Get page generated with Javascript in Python - Stack Overflow

javascript python html download urllib2
Rectangle 27 1

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

  • using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
  • use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
  • see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

python - Scraping HTML and JavaScript - Stack Overflow

javascript python parsing web-scraping web-crawler
Rectangle 27 1

If there is a lot of javascript dynamic load involved in the page loading, things get more complicated.

Basically, you have 3 ways to crawl the data from the website:

  • using browser developer tools see what AJAX requests are going on a page load. Then simulate these requests in your crawler. You will probably need the help of json and requests modules.
  • use tools that utilizes real browsers like selenium. In this case you don't care how the page is loaded - you'll get what a real user see. Note: you can use a headless browser too.
  • see if the website provides an API (e.g. walmart API)

Also take a look at Scrapy web-scraping framework - it doesn't handle AJAX calls too, but this is really the best tool in web-scraping world I've ever worked with.

python - Scraping HTML and JavaScript - Stack Overflow

javascript python parsing web-scraping web-crawler
Rectangle 27 2

The issue here is that the value in the textbox is added by javascript. When the page loads the value in the text field is 0. So, even if you scrape, you won't get the value as the scraped content gets this

<input class="distanceinput2" id="totaldistancemiles" name="totaldistancemiles" readonly="readonly" size="5" title="Distance in miles" type="text" value="0"/>
<input class="distanceinput2" id="totaldistancekm" name="totaldistancekm" readonly="readonly" size="5" title="Distance in kilometers" type="text" value="0"/>
<input class="distanceinput2" id="nauticalmiles" name="nauticalmiles" readonly="readonly" size="5" title="Distance in nautical miles" type="text" value="0"/>

So, if you want to get the value as on the website, it is not possible by scraping.

You could try phantom JS, which acts like a headless browser. Haven't experimented with it but looks like there is a chance. Here is a link that could help.

Yeah I just figured out that. The data is coming from an ajax request. Any idea of how to go about this? Any other library that I could use?

python - Scraping dynamic html fields with lxml - Stack Overflow

python html web-scraping lxml lxml.html
Rectangle 27 3

While Selenium might seem tempting and useful, it has one main problem that can't be fixed: performance. By calculating every single thing a browser does, you will need a lot more power. Even PhantomJS does not compete with a simple request. I recommend that you will only use Selenium when you really need to click buttons. If you only need javascript, I recommend PyQt (check https://www.youtube.com/watch?v=FSH77vnOGqU to learn it).

However, if you want to use Selenium, I recommend Chrome over PhantomJS. Many users have problems with PhantomJS where a website simply does not work in Phantom. Chrome can be headless (non-graphical) too!

First, make sure you have installed ChromeDriver, which Selenium depends on for using Google Chrome.

Then, make sure you have Google Chrome of version 60 or higher by checking it in the URL chrome://settings/help

Now, all you need to do is the following code:

from selenium.webdriver.chrome.options import Options
from selenium import webdriver

chrome_options = Options()
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(chrome_options=chrome_options)

If you do not know how to use Selenium, here is a quick overview:

driver.get("https://www.google.com") #Browser goes to google.com

Finding elements: Use either the ELEMENTS or ELEMENT method. Examples:

driver.find_element_by_css_selector("div.logo-subtext") #Find your country in Google. (singular)
  • driver.find_element(s)_by_css_selector(css_selector) # Every element that matches this CSS selector
  • driver.find_element(s)_by_class_name(class_name) # Every element with the following class
  • driver.find_element(s)_by_id(id) # Every element with the following ID
  • driver.find_element(s)_by_link_text(link_text) # Every with the full link text
  • driver.find_element(s)_by_partial_link_text(partial_link_text) # Every with partial link text.
  • driver.find_element(s)_by_tag_name(tag_name) # Every element with the tag name argument

Ok! I found an element (or elements list). But what do I do now?

Here are the methods you can do on an element elem:

  • elem.tag_name # Could return button in a .
  • elem.get_attribute("id") # Returns the ID of an element.
  • elem.text # The inner text of an element.
  • elem.clear() # Clears a text input.
  • elem.is_selected() # Is this radio button or checkbox element selected?
  • elem.location # A dictionary representing the X and Y location of an element on the screen.
  • elem.submit() # Submit the form in which elem takes part.
  • driver.back() # Click the Back button.
  • driver.forward() # Click the Forward button.
  • driver.refresh() # Refresh the page.
  • driver.quit() # Close the browser including all the tabs.

web scraping - Using python Requests with javascript pages - Stack Ove...

python web-scraping python-requests
Rectangle 27 10

In Python, I think Selenium 1.0 is the way to go. Its a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

Is there a way to do it with requests and beautiful soup itself? I have been using requests and it works fine in every other case but this. Please let me know if requests can also solve this thing.

@Shaardool: solve what thing? Scraping HTML thats generated in the browser by JavaScript? No for that you need something that runs the JavaScript so that it can produce the HTML. Beautiful Soup doesnt run JavaScript.

thanks for the insight, can Requests library do it? It works well with AJAX requests to server, but I want to know if it can work with javascript that creates HTML too. I didn't find any such thing in their documentation, though.

@Shaardool Im not familiar with Requests library. Youll likely get an answer quicker by asking a new question specifically about that library.

scrape html generated by javascript with python - Stack Overflow

javascript python browser screen-scraping
Rectangle 27 5

To scrape off JS rendered pages, we will need a browser that has a JavaScript engine (e.i, support JavaScript rendering)

Options like Mechanize, url2lib will not work since they DO NOT support JavaScript.

Setup PhantomJS to run with Selenium. After installing the dependencies for both of them (refer this), you can use the following code as an example to fetch the fully rendered website.

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.get('http://jokes.cc.com/')
soupFromJokesCC = BeautifulSoup(driver.page_source) #page_source fetches page after rendering is complete
driver.save_screenshot('screen.png') # save a screenshot to disk

driver.quit()

scrape html generated by javascript with python - Stack Overflow

javascript python browser screen-scraping