Rectangle 27 8

The RSS itself has probably just the first page of data. You can access the original data from the 'link' attribute in the RSS item (at least that what it's called in feedparser). Something like:

feed = feedparser.parse('http://reddit.com/.rss')
for entry in feed['entries']:
    content = urlopen(entry['link']).read()
    # Do something with content

python - How to read all articles from a RSS feed? - Stack Overflow

python rss feed feedparser rss-reader
Rectangle 27 3

Ian's answer has a lot of weight. You could buy all those books and read them all and know nothing about web development. What you really need to do is start with something that is not nearly as big as Stack Overflow. Start with your personal site. Read some web dev/css articles on a list apart. Learn about doctypes and why to use them. Add some css and change the colors around. Go over to quirksmode and peruse the site. Add some js. Follow some links on Crockfords site. You will probably stumble across his awesome video lectures, which you should watch. Then after that go back to all the js that you wrote and rewrite it. Then pick a server side language that you want to learn. Python is pretty easy, but it really doesn't matter what you pick. Then come back and integrate all those together in your site. At this point you will at least be getting started with web development and will have worked with several different technologies.

Many developers that I have worked with in the past have gotten through their career without really advancing after a certain point. I could be totally wrong, but I attribute it to not reading enough books and relying on using their same bad code over and over.

language agnostic - Resources for getting started with web development...

language-agnostic
Rectangle 27 0

The problem is that there are multiple div tags with class="bd" on the page. Looks like you need the one that contains an actual article - it is inside of article tag:

import urllib2
from bs4 import BeautifulSoup

# Ask user to enter URL
url = raw_input("Please enter a valid URL: ")

soup = BeautifulSoup(urllib2.urlopen(url))

# retrieve all of the paragraph tags
paragraphs = soup.find('article').find("div", {'class': 'bd'}).find_all('p')
for paragraph in paragraphs:
    print paragraph.text
Libyan government forces on Monday seized a North Korea-flagged tanker after...
...

+1: Almost posted my variation but you got it first. Though, in mine, I had to add encode("utf-8") to my print line. Only difference is I used requests instead of urllib2.

@Nanashi thanks, usually I prefer requests too, but the OP uses urllib2 - decided to make the code close to what the OP provided.

Thanks this worked great! Out of curiosity why do you prefer requests? I am new to python so I'm looking to learn everything I can atm.

for humans

lol I made the change. It stopped the code from breaking on quotes as well.

python 2.7 - beautiful soup article scraping - Stack Overflow

python-2.7 beautifulsoup
Rectangle 27 0

You won't find the URL of a page in the page itself, but that's not a problem since you must have known the URL before you fetched the page.

Scraping is at its most powerful when it's site-specific: You need to examine the format of (say) the CNN site's pages, determine where they put the article date, find your way in the document hierarchy by examining the html source, and then design a way to extract it.

In a most general way you can at best recognize generic types of information: You can write a script that extracts all dates from a page (or as many as your criteria can match), but there's no general way to know which one represents the date of publication. Similarly, extracting the title and text in a really general way is at best guesswork, since there are so many ways to embed this information in a web page (and so many other things the site could be mixing in with it).

Finally, don't forget that many sites (though not all) will send you a bare-bones html page and use javascript to populate it with content. Unless you use something like webkit to interpret the javascript before you try to scrape the page, your script will see something very different from what your browser shows you.

python - Web Scraping news article and exporting to csv file - Stack O...

python web-scraping beautifulsoup
Rectangle 27 0

BeautifulSoup, by itself, is incapable to extract text from the "article", since what and article is, html-wise, is entirely subjective, and will change from one site to the next. You need to write a different parser for each site.

class Webpage(object):
    def __init__(self, html_string):
        self.html= BeautifulSoup(html_string)
    def getArticleText(self):
        raise NotImplemented

class NewYorkTimesPage(Webpage):
    def getArticleText(self):
        return self.html.find(...)

python - Get the text from an article BeautifulSoup - Stack Overflow

python beautifulsoup
Rectangle 27 0

You can search for text in just the body text with BeautifulSoup, by converting sys.argv[2] into a regular expression:

import sys
from bs4 import BeautifulSoup
import urllib2
import re

response = urllib2.urlopen(sys.argv[1])
soup = BeautifulSoup(response.read(), from_encoding=response.info().getparam('charset'))
text_pattern = re.compile(re.escape(sys.argv[2]))

if soup.find('body').find(text=text_pattern):
    print 'Found the text in the page')

However, to narrow this down further to exclude navigation, footers, etc., you'll need to apply some heuristics. Each site is different and detecting what part of the page makes up the main text is not a straightforward task.

Instead of re-inventing that wheel, you may want to look at the Readability API instead; they've already built a large library of heuristics to parse out the 'main' part of a site for you.

python - Get the text from an article BeautifulSoup - Stack Overflow

python beautifulsoup
Rectangle 27 0

(?:\n^[ ]{5}[A-Za-z--0-9_\-:,\. ]+)*

After the second capture group, as in:

^([A-Za-z--0-9_\-:,\. ]+)\n{2}^[ ]{5}([A-Za-z--0-9_\-:,\. ]+(?:\n^[ ]{5}[A-Za-z--0-9_\-:,\. ]+)*)$

Thanks, its works. As I understand this non-captured group match only last row of article and this extend diapason of group.

@Dimm It matches each other row following the first article row. The way it works is that it tries to find the 5 spaces indent and if there's no match, it stops there. That's how it can match the last row of the article :)

find article text Regex Python - Stack Overflow

python regex parsing
Rectangle 27 0

Bruce Eckel wrote a nice article that points out some of the weird names Twisted uses: Grokking Twisted. According to that article, there are some good examples in The Python Cookbook, 2nd Ed (O'Reilly).

Where can I find good python Twisted framework documentation, blog ent...

python twisted
Rectangle 27 0

utf8_url = 'Escola Superior de Ci\xc3\xaancias Empresariais (Set\xc3\xbabal)'
percent_url = urllib2.quote(utf8_url)

python - Wikipedia API: getting articles with unicoded titles - Stack ...

python xml unicode wikipedia wikipedia-api
Rectangle 27 0

I wrote a Python library that aims to make this very easy. Check it out at Github.

$ pip install wikipedia

Then to get the first paragraph of an article, just use the wikipedia.summary function.

>>> import wikipedia
>>> print wikipedia.summary("Albert Einstein", sentences=2)

Albert Einstein (/lbrt anstan/; German: [albt antan] ( listen); 14 March 1879 18 April 1955) was a German-born theoretical physicist who developed the general theory of relativity, one of the two pillars of modern physics (alongside quantum mechanics). While best known for his massenergy equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), he received the 1921 Nobel Prize in Physics "for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect".

As far as how it works, wikipedia makes a request to the Mobile Frontend Extension of the MediaWiki API, which returns mobile friendly versions of Wikipedia articles. To be specific, by passing the parameters prop=extracts&exsectionformat=plain, the MediaWiki servers will parse the Wikitext and return a plain text summary of the article you are requesting, up to and including the entire page text. It also accepts the parameters exchars and exsentences, which, not surprisingly, limit the number of characters and sentences returned by the API.

The library is very well designed, and pretty easy to use! Good job. :)

Extract the first paragraph from a Wikipedia article (Python) - Stack ...

python wikipedia
Rectangle 27 0

As others have said, one approach is to use the wikimedia API and urllib or urllib2. The code fragments below are part of what I used to extract what is called the "lead" section, which has the article abstract and the infobox. This will check if the returned text is a redirect instead of actual content, and also let you skip the infobox if present (in my case I used different code to pull out and format the infobox.

contentBaseURL='http://en.wikipedia.org/w/index.php?title='

def getContent(title):
    URL=contentBaseURL+title+'&action=raw&section=0'
    f=urllib.urlopen(URL)
    rawContent=f.read()
    return rawContent

infoboxPresent = 0
# Check if a redirect was returned.  If so, go to the redirection target
    if rawContent.find('#REDIRECT') == 0:
        rawContent = getFullContent(title)
        # extract the redirection title
        # Extract and format the Infobox
        redirectStart=rawContent.find('#REDIRECT[[')+11   
        count = 0
        redirectEnd = 0
        for i, char in enumerate(rawContent[redirectStart:-1]):
            if char == "[": count += 1
            if char == "]}":
                count -= 1
                if count == 0:
                    redirectEnd = i+redirectStart+1
                    break
        redirectTitle = rawContent[redirectStart:redirectEnd]
        print 'redirectTitle is: ',redirectTitle
        rawContent = getContent(redirectTitle)

    # Skip the Infobox
    infoboxStart=rawContent.find("{{Infobox")   #Actually starts at the double {'s before "Infobox"
    count = 0
    infoboxEnd = 0
    for i, char in enumerate(rawContent[infoboxStart:-1]):
        if char == "{": count += 1
        if char == "}":
            count -= 1
            if count == 0:
                infoboxEnd = i+infoboxStart+1
                break

    if infoboxEnd <> 0:
        rawContent = rawContent[infoboxEnd:]

You'll be getting back the raw text including wiki markup, so you'll need to do some clean up. If you just want the first paragraph, not the whole first section, look for the first new line character.

Extract the first paragraph from a Wikipedia article (Python) - Stack ...

python wikipedia
Rectangle 27 0

pattern
pip install pattern

from pattern.web import Wikipedia
article = Wikipedia(language="af").search('Kaapstad', throttle=10)
print article.string

Extract the first paragraph from a Wikipedia article (Python) - Stack ...

python wikipedia
Rectangle 27 0

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

Which will, in turn, still block requests from urllib using the library default user-agent header. So the OP will still have the very same problem, although the API may be an easier way to interface the wiki content, depending on what are the OP goals.

urllib2 - Fetch a Wikipedia article with Python - Stack Overflow

python urllib2 user-agent wikipedia http-status-code-403
Rectangle 27 0

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice.

Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all.

urllib2 - Fetch a Wikipedia article with Python - Stack Overflow

python urllib2 user-agent wikipedia http-status-code-403
Rectangle 27 0

You can set the output filename by using the Template metadata in your individual files. This will override the default that you set in your configuration file.

For Markdown you would include this in your header:

Template: template_name

python - How can I override the default template used by a page or art...

python pelican
Rectangle 27 0

its inside an iframe. check for a frame with id="dsq2".

now the iframe has a src attr which is a link to the actual site that has the comments.

so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.

to get the actual comments, after you get the page from src you can use this css selector: .post-message p

and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:

http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

python - Extracting comments from news articles - Stack Overflow

python comments web-scraping beautifulsoup
Rectangle 27 0

ARTICLE_ORDER_BY = 'filename'
ARTICLE_ORDER_BY = lambda x:os.path.basename(x.source_path or '')

thanks. the second one works. I don't know why the first one doesn't, and it doesn't with 'basename' either.

python - How to sort articles by filename in pelican? - Stack Overflow

python pelican
Rectangle 27 0

  • You may want to check mwlib to parse the wikipedia source
  • Alternatively, use the wikidump lib

Ah, there is a question already on SO on this topic:

html content extraction - Extracting the introduction part of a Wikipe...

python html-content-extraction
Rectangle 27 0

Whenever I'm browsing a new subject on Wikipedia, I typically perform a "breadth-first" search; I refuse to move on to another topic until I've scanned each and every link that the page connects to (which introduces a topic I'm not already familiar with). I read the first sentence of each paragraph, and if I see something in that article that appears to relate to the original topic, I repeat the process.

If I were to design the interface for a Wikipedia "summarizer", I would

Always print the entire introductory paragraph.

For the rest of the article, print any sentence that has a link in it.

2a. Print any comma separated lists of links as a bullet pointed list.

If the link to the article is "expanded", print the first paragraph for that article.

If that introductory paragraph is expanded, repeat the listing of sentences with links.

What I'm saying is that summarizing Wikipedia articles isn't the same as summarizing an article from a magazine, or a posting on a blog. The act of crawling is an important part of learning introductory concepts quickly via Wikipedia, and I feel it's for the best. Typically, the bottom half of articles is where the citation needed tags start popping up, but the first half of any given article is considered given knowledge by the community.

python - Summarizing a Wikipedia Article - Stack Overflow

python statistics machine-learning wikipedia summarization
Rectangle 27 0

Given the number of articles on wikipedia, it would take a unaffordable time to compute THE shortest (my assumption - I haven't tried).

The real problem is to find an acceptable and efficent short path between two articles.

Algorithms that deal with this kind problem are related to The travelling salesman problem. It could be a good point to start from.

I'm personnally also fond of the genetic algorithms approach to find an acceptable optimum in a certain amount of time.

I have just looked at that image and that sets the number of articles to 4.000.000 for en.wikipedia.com in 2013. Much less than I thought indeed.

EDIT: I first stated it was a NP-Hard problem and commenters explain it's not.

Why would this be NP-hard?

Yeah this is just graph search (of course realistically you need to do this carefully, so you don't run out of memory), which has a poly time solution in the number of vertices and edges in the graph. What NP-Hard reduction are you referring to, I don't think this is TSP actually, since that is asking the shortest path through every node, not the shortest path between two nodes.

@tigger that's right. Re-reading the article on TSP and NP-hard. Our case is not NP-Hard and not a case of TSP. But do you mean that something in complexity bigger than O(n), like graph walking, is a possible approach for browsing the wikipedia graph ? ( I find myself confused with NP hard and complete )

@StephaneRolland NP-hard >= NP-complete. Problem X that is at least as hard as the most complex problem Y in NP is NP-hard; X does not have to be a member of NP. NP-complete means "NP-hard and in NP", so those are the most complex problems that are in NP. Also, I'm curious why you mention the number of articles on wikipedia in the context of complexity.

algorithm - Find shortest path between two articles in english Wikiped...

python algorithm dijkstra