Rectangle 27 0

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Thanks for your answer. Let me try to reiterate some of that: so when you click on the different categories like "Luxury" or "Families", the changes you see on the page are generated solely through javascript? (ie the code for the page never changes?) And what I need to do is find a tool that will run the JS and then return that content? What is easiest/the best from what you recommended? I feel an api is not appropriate for what I'm trying to do in this case.

selenium maybe the best way to do this, it uses the actual browser though fully automated but as such it needs a browser installed with at least a virtual frame-buffer or a desktop environment, since it will call one up ...

python urllib2 - wait for page to finish loading/redirecting before sc...

python urllib2
Rectangle 27 0

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Thanks for your answer. Let me try to reiterate some of that: so when you click on the different categories like "Luxury" or "Families", the changes you see on the page are generated solely through javascript? (ie the code for the page never changes?) And what I need to do is find a tool that will run the JS and then return that content? What is easiest/the best from what you recommended? I feel an api is not appropriate for what I'm trying to do in this case.

selenium maybe the best way to do this, it uses the actual browser though fully automated but as such it needs a browser installed with at least a virtual frame-buffer or a desktop environment, since it will call one up ...

python urllib2 - wait for page to finish loading/redirecting before sc...

python urllib2
Rectangle 27 0

Inreresting the problem isn't a redirect is that page modifies the content using javascript, but urllib2 doesn't have a JS engine it just GETS data, if you disabled javascript on your browser you will note it loads basically the same content as what urllib2 returns

import urllib2
from BeautifulSoup import BeautifulSoup

bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
soup = BeautifulSoup(bostonPage)
open('test.html', 'w').write(soup.read())

test.html and disabling JS in your browser, easiest in firefox content -> uncheck enable javascript, generates identical result sets.

But if we still need to scrape it, with JS, then we can use selenium http://seleniumhq.org/ its mainly used for testing, but its easy and has fairly good docs.

As a side note:

>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> 
>>> bostonPage = urllib2.urlopen("http://www.tripadvisor.com/HACSearch?geo=34438#02,1342106684473,rad:S0,sponsors:ABEST_WESTERN,style:Szff_6")
>>> value = bostonPage.read()
>>> soup = BeautifulSoup(value)
>>> open('test.html', 'w').write(value)

Thanks for your answer. Let me try to reiterate some of that: so when you click on the different categories like "Luxury" or "Families", the changes you see on the page are generated solely through javascript? (ie the code for the page never changes?) And what I need to do is find a tool that will run the JS and then return that content? What is easiest/the best from what you recommended? I feel an api is not appropriate for what I'm trying to do in this case.

selenium maybe the best way to do this, it uses the actual browser though fully automated but as such it needs a browser installed with at least a virtual frame-buffer or a desktop environment, since it will call one up ...

python urllib2 - wait for page to finish loading/redirecting before sc...

python urllib2