Rectangle 27 1

Take a look at Scrapy. It is a python framework specifically for scrapping. It makes it very easy to extract information using the XPath to the element. It also has some very interesting capabilities such as defining models for the scraped data (to be able to export it in different formats), authentication and recursively following links.

Parsing HTML with Python 2.7 - HTMLParser, SGMLParser, or Beautiful So...

python html parsing beautifulsoup html-parsing
Rectangle 27 2

Since it's an XML file, you can use an XPath query to extract the urls. In the XML file, it looks like the rss feed urls are stored in xmlUrl attributes. The XPath expression //@xmlUrl will select all values of that attribute.

If you want to test this out in your web-browser, you can use an online XPath tester. If you want to perform this XPath query in Python, this question explains how to use XPath in Python. Additionally, the lxml docs have a page on using XPath in lxml that might be helpful.

... and you can use XPath in a bash pipeline with xmlstarlet.

python - How do you extract feed urls from an OPML file exported from ...

python xml parsing bash opml
Rectangle 27 -1

I'd use XPATH, see here for a question what package would be appropriate in Python.

Just curious as to why XPATH over BeautifulSoup? I've never used XPATH, but BeautifulSoup seems to be a standard answer for HTML/XML parsing as per the OP's question.

I tend to work with more rigid formats, and prefer plainer errors. BeautifulSoup could indeed be a better answer if the HTML isn't to be trusted though, but I'd still be tempted to us XPath because of it's portability.

Thanks, I'll have to take a look into XPATH.

Extract URLs from specific tags in python - Stack Overflow

python url tags extract