Rectangle 27 1

Seems like this spider is outdated in the tutorial. The website has changed a bit so all of the xpaths now capture nothing. This is easily fixable:

def parse(self, response):
    sites = response.xpath('//div[@class="title-and-desc"]/a')
    for site in sites:
        item = dict()
        item['name'] = site.xpath("text()").extract_first() 
        item['url'] = site.xpath("@href").extract_first() 
        item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
        yield item

For future reference you can always test whether a specific xpath works with scrapy shell command. e.g. what I did to test this:

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li') 
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!

There are few PR for updating this but none of them have made it to master yet since the Dmoz layout change was quite recent. Related issue: github.com/scrapy/scrapy/issues/2090

Thank you so much. Just wanted to make sure i was following it correctly and didnt goof it on my end. Again, thank you, now i can move forward :)

@user1901959 great, could you accept the answer so other people can navigate this easier if they pop in here?

python - Scrapy Tutorial Example - Stack Overflow

python web-scraping scrapy web-crawler
Rectangle 27 1

Here is the correction of the Scrapy code to extract details from DMOZ:

import scrapy

class MozSpider(scrapy.Spider):
name = "moz"
allowed_domains = ["www.dmoz.org"]
start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/']

    def parse(self, response):
        sites = response.xpath('//div[@class="title-and-desc"]')
        for site in sites:
            name = site.xpath('a/div[@class="site-title"]/text()').extract_first()
            url = site.xpath('a/@href').extract_first()
            description = site.xpath('div[@class="site-descr "]/text()').extract_first().strip()

            yield{'Name':name, 'URL':url, 'Description':description}
scrapy crawl moz -o result.csv

Here is another basic Scrapy tutorial: to extract company details from YellowPages:

scrapy crawl ylp -o result.csv
import scrapy

class YlpSpider(scrapy.Spider):
    name = "yelp"
    allowed_domains = ["www.yelp.com"]
    start_urls = ['https://www.yelp.com/search?find_desc=Java+Developer&find_loc=Denver,+CO']


    def parse(self, response):
        companies = response.xpath('//*[@class="biz-listing-large"]')

        for company in companies:
            name = company.xpath('.//span[@class="indexed-biz-name"]/a/span/text()').extract_first()
            address1 = company.xpath('.//address/text()').extract_first('').strip()
            address2 = company.xpath('.//address/text()[2]').extract_first('').strip()  # '' means the default attribute if not found to avoid adding None.
            address = address1 + " - " + address2
            phone = company.xpath('.//*[@class="biz-phone"]/text()').extract_first().strip()
            website = "https://www.yelp.com" + company.xpath('.//@href').extract_first()

            yield{'Name':name, 'Address':address, 'Phone':phone, 'Website':website}

To export it into CSV, open the spider folder in your Terminal/CMD and type:

scrapy crawl yelp -o result.csv

python - Scrapy Tutorial Example - Stack Overflow

python web-scraping scrapy web-crawler
Rectangle 27 2

Scrapy open source framework will help to web scrap in python.This open source and collaborative framework for extracting the data you need from websites.

Web scraping is closely related to web indexing, which indexes information on the web using a bot or web crawler and is a universal technique adopted by most search engines.

Web scraping with Python - Stack Overflow

python screen-scraping
Rectangle 27 2

First, make sure you are not violating the web-site's Terms of Use by taking the web-scraping approach. Be a good web-scraping citizen.

Next, you can set the User-Agent header to pretend to be a browser. Either provide a User-Agent in the DEFAULT_REQUEST_HEADERS setting:

DEFAULT_REQUEST_HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36'
}

or, you can rotate User Agents with a middleware. Here is the one I've implemented based on fake-useragent package:

Another possible problem could be that you are hitting the web-site too often, consider tweaking DOWNLOAD_DELAY setting:

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

There is an another relevant setting that can have a positive impact: CONCURRENT_REQUESTS:

The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader.

Thanks a lot! It seems that the website I want to scribe forbid fetching too often, so after I set the DOWNLOAD_DELAY to 2, it works well.

python - Twist failure when using scrapy to crawl a bbs - Stack Overfl...

python web-scraping scrapy scrapy-spider twist
Rectangle 27 1

I agree with 0xc0de and Joddy. PyCurl and HTTrack can do what you want. If you're using a 'Nix OS, you can also use wget.

Yes, it's possible. As a matter of fact, I finished writing a script that you'd described a few days ago. ;) I won't post the script here, but I'll give you some hints based on what I've done.

  • Download the webpage. You can use urllib2.urlopen (Python 2.x) or urllib.request.urlopen (Python 3) for that.
  • Then after downloading the page, parse the source code of the downloaded page (well, you can also parse the source code online but that would mean another call tourllib2.urlopen/urllib.request.urlopen) and get all the links you needed. You can use BeautifulSoup for this. Then download all the content stuff you need (use the same code you used to download the webpage in step 1).
  • Update the local page by changing all the href/src to the local path of your css/image/js files. You can use fileinput for inplace text replacements. Refer to this SO post for further details.

That's it. Optional stuff you have to worry about are connecting/downloading from the net with proxy (if you're behind one), creating folders, and logger.

You could also use Scrapy. Check this blog post on how to crawl the website using Scrapy.

Is it possible to get complete source code of a website including css ...

python
Rectangle 27 0

Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

web crawler - How to crawl various websites to find specific departmen...

python web-crawler
Rectangle 27 0

Here's the python program that worked for me:

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not url.startswith('http://'):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

Save this in a file called spider.py.

You can then use a shell pipeline to post process this text:

bash$ scrapy runspider spider.py > urls.out
bash$ cat urls.out| grep 'example.com' |sort |uniq |grep -v '#' |grep -v 'mailto' > example.urls

This gives me a list of all the unique urls in my site.

That's cool. You have got the answer. Now go ahead and accept the answer... and, oh yeah, there might be a "Self Learner" badge waiting for you. :)

web crawler - How do I use the Python Scrapy module to list all the UR...

python web-crawler scrapy
Rectangle 27 0

class DmovieSpider(BaseSpider):
    name = "dmovie"
    allowed_domains = ["movie.douban.com"]
    start_urls = ['http://movie.douban.com/']
    def parse(self, response):
        req = []

        hxl = HtmlXPathSelector(response)
        urls = hxl.select("//a/@href")

        for url in urls:
            r = Request(url, callback=self.parse_detail)
            req.append(r)

        return req

    def parse_detail(self, response):
        hxl = HtmlXPathSelector(response)
        title = hxl.select("//span[@property='v:itemreviewed']/text()").extract()
        item = DmovieItem()
        item['title'] = title[0].strip()
        return item

web crawler - how to use scrapy to crawl all items in a website - Stac...

scrapy web-crawler
Rectangle 27 0

You need to add item.yhd.com to the allowed_domains. The requests are getting filtered as being offsite by the OffsiteMiddleware middleware which is enabled by default.

'offsite/domains': 1,
'offsite/filtered': 2,

This middleware filters out every request whose host names arent in the spiders allowed_domains attribute.

You have a couple of choices. If the spider doesnt define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests.

If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in allowed domains.

My code is: allowed_domains = "item.yhd.com"

web scraping - My scrapy does not crawl the websites, can anyone help ...

web-scraping scrapy scraper
Rectangle 27 0

Longer answer: What you could do is write the article id or the article url to a file and during the scraping, you would match the id or url with the records in the file.

Remember to load your file only once and assign it to a variable. Don't load it during your iteration when scraping.

Is this the common practice to do incremental crawling? I though this is a common task for most, if not all, web crawlers. In such a case, every time you have to repeat visiting all pages that have been visited before. Does Google also do this way? It sounds like a terrible job given the fact that the whole web is so huge.

Google usually gets a sitemap from the owner of the website. What most crawlers do, is basically go through all the links it finds on a site. Doesn't matter if it was already crawled. If the site is done correctly, an article page would have micro data snippets (vcard or something it was called) with author, published timestamp, ratings etc. Which helps the google bot a lot

De-duplication happens as post-processing step on those large companies... Not at the crawler level. This is how they attribute and penalise duplicate content. They also have refresh frequencies for each URL/domain depending on how quickly content changes on sites. They also don't care about sitemaps :-) but they respect robots.txt. Annotations are nice and I guess they might have been promoted for a while in an effort to move the industry forward to better quality markup and pave the way to more semantic content but they aren't essential neither for search nor for identifying unique content.

web crawler - Incrementally crawl a website with Scrapy - Stack Overfl...

scrapy web-crawler
Rectangle 27 0

Yes you can and it's actually quite easy. Every news website has a few very important index pages like the homepage and the categories (eg politics, entertainment etc.) There is no article that doesn't go through these pages for at least a few minutes. Scan those pages every minute or so and save just the links. Then do a diff with what you already have in your databases and a few times a day issue a crawl to scrape all the missing links. Very standard practice.

web crawler - Incrementally crawl a website with Scrapy - Stack Overfl...

scrapy web-crawler
Rectangle 27 0

When you have trouble replicating browser behavior using scrapy, you generally want to look at what are those things which are being communicated differently when your browser is talking to the website compared with when your spider is talking to the website. Remember that a website is (almost always) not designed to be nice to webcrawlers, but to interact with web browsers.

In [1]: request.headers
Out[1]:
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 'Accept-Encoding': 'gzip,deflate',
 'Accept-Language': 'en',
 'User-Agent': 'Scrapy/0.24.6 (+http://scrapy.org)'}

If you examine the headers sent by a request for the same page by your web browser, you might see something like:

**Request Headers**

GET /blog/page/10/ HTTP/1.1    
Host: www.bornfitness.com    
Connection: keep-alive    
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.124 Safari/537.36
DNT: 1    
Referer: http://www.bornfitness.com/blog/page/11/
Accept-Encoding: gzip, deflate, sdch    
Accept-Language: en-US,en;q=0.8
Cookie: fealty_segment_registeronce=1; ... ... ...

Try changing the User-Agent in your request. This should allow you to get around the redirect.

Thanks, changing USER_AGENT from default 'Scrapy/0.24.6 (+scrapy.org)'; to 'born_fitness'(or anything) resolved the issue. Any idea why this is happening only for some urls(/page/10/ but not /page/8/) & why only for USER_AGENT 'Scrapy/0.24.6 (+scrapy.org)'; ?

Scrapy redirects to homepage for some urls - Stack Overflow

scrapy scrapy-shell
Rectangle 27 0

Providing few examples would help to make a better answer, but the general idea could be to:

  • find the "Contact Us" link
  • follow the link and extract the address

assuming you don't have any information about the web-sites you'll be given.

Let's focus on the first problem.

The main problem here is that the web-sites are structured differently and, strictly speaking, you cannot build a 100% reliable way to find the "Contact Us" page. But, you can "cover" the most common cases:

  • follow the a tag with the text "Contact Us", "Contact", "About Us", "About" etc
/about
/contact_us
  • follow all links that have contact, about etc text inside

From these you can build a set of Rules for your CrawlSpider.

The second problem is no easier - you don't know where on the page an address is located (and may be it doesn't exist on a page), and you don't know the address format. You may need to dive into Natural Language Processing and Machine Learning.

So you suggest first tracking down the contact us page and then look for address on that page. Do you think a regex to locate the pincode would be a good idea?

@DharmanshuKamra it is possible, but might not be easy to write an expression that would support all the possible address formats. Difficult to tell more. Hope that helps.

web scraping - How to scrape address from websites using Scrapy? - Sta...

web-scraping scrapy scrape
Rectangle 27 0

Well, your question is not well-framed. How you can use Scrapy is up to you.

1) Websites have a tree structure a->b, a->c, a->d, b->e, c->f .....etc

2) Scrapy helps you crawl through the tree recursively

3) While crawling, Scrapy lets you 'mine' for information. For that you need to learn XPaths to locate and parse the DOM values in the page

4) Parse the values and store it in your database.

Let us know exactly what you are crawling for. If you're just crawling and saving the web pages, you might as well go for softwares like [HTTrack] http://www.httrack.com

web crawler - How can I crawl a website using scrapy? - Stack Overflow

web-crawler web-scraping scrapy
Rectangle 27 0

You would need a hosting service where you could install the scrapyd service so that you can automate your screen scraping. I've never done it as I am just getting started playing around with Scrapy, but here is the information on scrapyd: http://readthedocs.org/docs/scrapy/en/latest/topics/scrapyd.html

Best practice to web hosting a website with Scrapy Spiders running in ...

web-hosting scrapy
Rectangle 27 0

You should consider using Scrapy instead of working directly with lxml and urllib. Scrapy is "a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages." It's built on top of Twisted so that it can be inherently asynchronous, and as a result it is very very FAST.

I can't give you any specific numbers on how much faster your scraping will go, but imagine that your requests are happening in parallel instead of serially. You'll still need to write the code to extract the information that you want, using xpath or Beautiful Soup, but you won't have to work out the fetching of pages.

Though parallel requests are obviously faster, do keep in mind that different scraping targets will have different reactions to aggressive scraping. A scraping target can make your life quite difficult if they care to (if you don't cover your tracks, possibly even legally), so be sure to respect their wishes to the extent possible.

Python scraping data paralel requests with urllib2 - Stack Overflow

python
Rectangle 27 0

class DmovieSpider(BaseSpider):
    name = "dmovie"
    allowed_domains = ["movie.douban.com"]
    start_urls = ['http://movie.douban.com/']
    def parse(self, response):
        req = []

        hxl = HtmlXPathSelector(response)
        urls = hxl.select("//a/@href")

        for url in urls:
            r = Request(url, callback=self.parse_detail)
            req.append(r)

        return req

    def parse_detail(self, response):
        hxl = HtmlXPathSelector(response)
        title = hxl.select("//span[@property='v:itemreviewed']/text()").extract()
        item = DmovieItem()
        item['title'] = title[0].strip()
        return item

web crawler - how to use scrapy to crawl all items in a website - Stac...

scrapy web-crawler
Rectangle 27 0

Title can be extracted using //title/text(), video source link via //video/source/@src:

selector = Selector(response=response)

title = selector.xpath('//title/text()').extract()[0]
description = selector.xpath('//edindex/text()').extract()
video_sources = selector.xpath('//video/source/@src').extract()[0]

code_url = selector.xpath('//meta[@name="EdImage"]/@content').extract()[0]
code = re.search(r'(\w+)-play-small.jpg$', code_url).group(1)

print title
print description
print video_sources
print code
Best Babies Laughing Video Compilation 2012 [HD] - Guardalo
[u'Best Babies Laughing Video Compilation 2012 [HD]', u"Ciao a tutti amici di guardalo,quello che propongo oggi \xe8 un video sui neonati buffi con risate travolgenti, facce molto buffe,iniziamo con una coppia di gemellini che se la ridono fra loro,per passare subito con una biondina che si squaqqera dalle risate al suono dello strappo della carta ed \xe8 solo l'inizio.", u'\r\nBuone risate a tutti', u'Elia ride', u'Funny Triplet Babies Laughing Compilation 2014 [NEW HD]', u'Real Talent Little girl Singing Listen by Beyonce .', u'Bimbo Napoletano alle Prese con il Distributore di Benzina', u'Telecamera nascosta al figlio guardate che fa,video bambini divertenti,video bambini divertentissimi']
http://static.guardalo.org/video_image/pre-roll-guardalo.mp4
L49VXZwfup8

for video need capture this code : L49VXZwfup8, is code video from youtube !

@pythoncoder okay, updated the answer, is this what you were asking about? Thanks.

@pythoncoder also note that Alex Martelli has a valid point here - if you are using Scrapy to extract the data from this single URL - then this is a huge overhead. I'm assuming you are going to extend the solution to multiple URLs of this kind.

web scraping - python scrapy extract data from website - Stack Overflo...

python web-scraping scrapy
Rectangle 27 0

No need for scrapy for a single-URL fetch -- just get that single page's HTML with a simpler tool (even simplest urllib.urlopen(theurl).read()!) then analyze the HTML e.g with BeautifulSoup. From a simple "view source" it looks like you're looking for:

<title>Best Babies Laughing Video Compilation 2012 [HD] - Guardalo</title>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.mp4" type='video/mp4'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.webm" type='video/webm'>
<source src="http://static.guardalo.org/video_image/pre-roll-guardalo.ogv" type='video/ogg'>

(the video linkS, plural, and I can't pick one because you don't tell us which format[s] you prefer!-), and

<meta name="description" content="Ciao a tutti amici di guardalo,quello che propongo oggi  un video sui neonati buffi con risate" />

(the description). BeautifulSoup makes it pretty trivial to get each one, e.g after the needed imports

html = urllib.urlopen('http://www.guardalo.org/99407/').read()
soup = BeautifulSoup(html)
title = soup.find('title').text

etc etc (but you'll have to pick one video link -- and I see in their sources they're mentioned as "pre-rolls" so it may be that the links to actual non-ads videos are in fact not on the page but only accessible after a log-in or whatever).

need capture this code for video : L49VXZwfup8 because this is code for video youtube

web scraping - python scrapy extract data from website - Stack Overflo...

python web-scraping scrapy
Rectangle 27 0

Try Scrapy. It is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

OK thanks for the responses. I am using OS 10.7.5 and having issues installing Scrapy. Will go back over the instructions though.

Nilesh - do you have any helpful tips on how to install scrapy on Mac? I am completely useless in understanding how to do it. I have no idea where to put the "pip install". I'm sorry I really have no knowledge when it comes to this thing :(

screen scraping - Code for web crawling with Python 2.7.3 in mac termi...

python screen-scraping wget web-crawler