You can get a string from the element and then write that from lxml tutorial
str = etree.tostring(root, pretty_print=True)
et = etree.ElementTree(root) et.write(sys.stdout, pretty_print=True)
write
pretty_print=True
str
with open('pretty.html', 'wb') as file: file.write(str)
As of python3, you need to use sys.stdout.buffer instead of sys.stdout - which essentially is the same as what @laviex pointed out, only for the special case of sys.stdout.
Write xml file using lxml library in Python - Stack Overflow
To get to your particular troubles, as the comments point out, you're missing GCC. On OS X, Xcode Command Line Tools provides GCC, as well as many other programs necessary for building software on OS X. For OS X 10.9 (Mavericks) and newer, either install Xcode through the App Store, or alternatively, install only the Xcode Command Line Tools with
xcode-select --install
For more details, please see the Apple Developer FAQ or search the web for "install Xcode Command Line Tools".
For older versions of OS X, you can get Xcode Command Line Tools from the downloads page of the Apple Developer website (free registration required).
Once you have GCC installed, you may still encounter errors during compilation if the C/C++ library dependencies are not installed on your system. On OS X, the Homebrew project is the easiest way to install and manage such dependencies. Follow the instructions on the Homebrew website to install Homebrew on your system, then issue
brew update brew install libxml2 libxslt
Possibly causing further trouble in your case, you placed the downloaded setuptools in /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/. Please do not download any files to this location. Instead, I suggest you download the file to your home directory, or your usual Downloads directory. After downloading it, you're supposed to run sh setuptools-X.Y.Z.egg, which will then install it properly into the appropriate site-packages and put the executable easy_install on your path.
+1 for the great graphic
Above pip link is broken. I think it is this: pip.pypa.io/en/latest
Sign up for our newsletter and get our top new questions delivered to your inbox (see an example).
python - Installing easy_install... to get to installing lxml - Stack ...
I had the same problem. If you have installed it with pip as follows: pip install lxml
STATIC_DEPS=true pip install lxml
thanks, i have already installed that package after dozens of tests. but your solution seems good.
get errors when import lxml.etree to python - Stack Overflow
I had the same problem. If you have installed it with pip as follows: pip install lxml
STATIC_DEPS=true pip install lxml
thanks, i have already installed that package after dozens of tests. but your solution seems good.
get errors when import lxml.etree to python - Stack Overflow
I had the same problem. If you have installed it with pip as follows: pip install lxml
STATIC_DEPS=true pip install lxml
thanks, i have already installed that package after dozens of tests. but your solution seems good.
get errors when import lxml.etree to python - Stack Overflow
If you've installed libxml2, then it's possible that it's just not picking up the right version (there's a version installed with OS X by default). In particular, suppose you've installed libxml2 to /usr/local. You can check what shared libraries etree.so references:
$> otool -L /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so: /usr/lib/libxslt.1.dylib (compatibility version 3.0.0, current version 3.24.0) /usr/local/lib/libexslt.0.dylib (compatibility version 9.0.0, current version 9.17.0) /usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.3.0) /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.5) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)
Checking for that symbol in the system-installed version:
$> nm /usr/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext
For me, it's not present in the system-installed library. In the version I installed, however:
$> nm /usr/local/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext 000000000007dec0 T ___xmlStructuredErrorContext
DYLD_LIBRARY_PATH
$> export DYLD_LIBRARY_PATH=/usr/local/lib $> python >>> from lxml import etree # Success!
This solved my problem after an hour of trying... What is the permanent solution to this problem?
/etc/ld.so.conf.d/lxml.conf
ldconfig
as for me, doing the opposite solved my problem (because the symbol was actually in the system's libxml2 version), so I had to put /usr/lib as the first entry of DYLD_LIBRARY_PATH
get errors when import lxml.etree to python - Stack Overflow
If you've installed libxml2, then it's possible that it's just not picking up the right version (there's a version installed with OS X by default). In particular, suppose you've installed libxml2 to /usr/local. You can check what shared libraries etree.so references:
$> otool -L /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so: /usr/lib/libxslt.1.dylib (compatibility version 3.0.0, current version 3.24.0) /usr/local/lib/libexslt.0.dylib (compatibility version 9.0.0, current version 9.17.0) /usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.3.0) /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.5) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)
Checking for that symbol in the system-installed version:
$> nm /usr/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext
For me, it's not present in the system-installed library. In the version I installed, however:
$> nm /usr/local/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext 000000000007dec0 T ___xmlStructuredErrorContext
DYLD_LIBRARY_PATH
$> export DYLD_LIBRARY_PATH=/usr/local/lib $> python >>> from lxml import etree # Success!
This solved my problem after an hour of trying... What is the permanent solution to this problem?
/etc/ld.so.conf.d/lxml.conf
ldconfig
as for me, doing the opposite solved my problem (because the symbol was actually in the system's libxml2 version), so I had to put /usr/lib as the first entry of DYLD_LIBRARY_PATH
get errors when import lxml.etree to python - Stack Overflow
If you've installed libxml2, then it's possible that it's just not picking up the right version (there's a version installed with OS X by default). In particular, suppose you've installed libxml2 to /usr/local. You can check what shared libraries etree.so references:
$> otool -L /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so /Library/Python/2.7/site-packages/lxml-3.2.1-py2.7-macosx-10.7-intel.egg/lxml/etree.so: /usr/lib/libxslt.1.dylib (compatibility version 3.0.0, current version 3.24.0) /usr/local/lib/libexslt.0.dylib (compatibility version 9.0.0, current version 9.17.0) /usr/lib/libxml2.2.dylib (compatibility version 10.0.0, current version 10.3.0) /usr/lib/libz.1.dylib (compatibility version 1.0.0, current version 1.2.5) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0)
Checking for that symbol in the system-installed version:
$> nm /usr/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext
For me, it's not present in the system-installed library. In the version I installed, however:
$> nm /usr/local/lib/libxml2.2.dylib | grep ___xmlStructuredErrorContext 000000000007dec0 T ___xmlStructuredErrorContext
DYLD_LIBRARY_PATH
$> export DYLD_LIBRARY_PATH=/usr/local/lib $> python >>> from lxml import etree # Success!
This solved my problem after an hour of trying... What is the permanent solution to this problem?
/etc/ld.so.conf.d/lxml.conf
ldconfig
as for me, doing the opposite solved my problem (because the symbol was actually in the system's libxml2 version), so I had to put /usr/lib as the first entry of DYLD_LIBRARY_PATH
get errors when import lxml.etree to python - Stack Overflow
sudo apt-get install python-lxml
Better to use pip. It will give you a more recent version than your package maintainer. Sooner or later you'll be glad you used it.
python - Installing easy_install... to get to installing lxml - Stack ...
How to get content of page dynamically modified within browser?
Here is snippet from the page:
<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();"> <img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br /> <table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent"> <tr> <td colspan="2" class="vCSS_breadcrumb_td"><b> <a href="http://www.makospearguns.com/">Home</a> >
id
"v65-product-parent" is of type
and has subelement
There can be only one element with such id (otherwise it would be broken xml).
The xpath is expecting tbody as child of given element (table) and there is none in whole page.
>>> "tbody" in page.text False
$ wget http://www.makospearguns.com/product-p/mcffgb.htm
and review content of it, it does not contain a single element named tbody
This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.
You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.
from selenium import webdriver from lxml import html url = "http://www.makospearguns.com/product-p/mcffgb.htm" xpath = '//*[@id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()' browser = webdriver.Firefox() browser.get(url) html_source = browser.page_source print "test tbody", "tbody" in html_source tree = html.fromstring(html_source) text = tree.xpath(xpath) print text
$ python byselenimum.py test tbody True ['$149.95']
Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.
I just now went to the page and inspected it. When I right click on the span with the price and select "Copy XPath", this is exactly what it gives me. And when I plug that copied xpath into firepath, it shows me the correct part of the page. If the path is simply wrong than why did that work?
-1 because this "It gets generated dynamically by JavaScript after it is loaded into browser" is wrong.
@JanVlcinsky Check my answer. The page is altered by the browser to massage it into the DOM, before any Javascript is called.
xml - Why does this xpath fail using lxml in python? - Stack Overflow
Parameters in a URL (e.g. key=listOfUsers/user1) are GET parameters and you shouldn't be using them for POST requests. A quick explanation of the difference between GET and POST can be found here.
In your case, to make use of REST principles, you should probably have:
http://ip:5000/users http://ip:5000/users/<user_id>
Then, on each URL, you can define the behaviour of different HTTP methods (GET, POST, PUT, DELETE). For example, on /users/<user_id>, you want the following:
GET /users/<user_id> - return the information for <user_id> POST /users/<user_id> - modify/update the information for <user_id> by providing the data PUT - I will omit this for now as it is similar enough to `POST` at this level of depth DELETE /users/<user_id> - delete user with ID <user_id>
So, in your example, you want do a POST to /users/user_1 with the POST data being "John". Then the XPath expression or whatever other way you want to access your data should be hidden from the user and not tightly couple to the URL. This way, if you decide to change the way you store and access data, instead of all your URL's changing, you will simply have to change the code on the server-side.
Now, the answer to your question: Below is a basic semi-pseudocode of how you can achieve what I mentioned above:
@app.route('/users/<user_id>', methods = ['GET', 'POST', 'DELETE']) def user(user_id): if request.method == 'GET': """return the information for <user_id>""" . . . if request.method == 'POST': """modify/update the information for <user_id>""" # you can use <user_id>, which is a str but could # changed to be int or whatever you want, along # with your lxml knowledge to make the required # changes data = request.form # a multidict containing POST data . . . if request.method == 'DELETE': """delete user with ID <user_id>""" . . . else: # POST Error 405 Method Not Allowed . . .
There are a lot of other things to consider like the POST request content-type but I think what I've said so far should be a reasonable starting point. I know I haven't directly answered the exact question you were asking but I hope this helps you. I will make some edits/additions later as well.
Thanks and I hope this is helpful. Please do let me know if I have gotten something wrong.
do you have to do something special for the POST to get routed back correctly? I have /competitions/<int: id> set up but when the POST occurs, it posts to /competitions instead so my post handling logic is never reached.
python - Flask example with POST - Stack Overflow
I had exactly this problem. Turned out to be a memory problem - I was installing reporter.py, which depends on lxml, on a server with only 500MB RAM, of which only 150MB was free. I killed off a few things to get up to ~300MB free, and just managed to squeeze out the installation of lxml. (Watching TOP showed available memory going down to 4MB at one point!)
I had the same problem and my solution was based on yours. Actually, I'm using Vagrant and my VM was with only 512 MB RAM, so I've changed this limit to 1 GB. The installation ocurred without error.
I have solved this by adding swap file with 500M size because there was no ability to increase RAM.
Same for me. Vagrant default box. Bumped up to 1024 with config.vm.provider "virtualbox" do |vb| vb.memory = 1024 end fixes this hellish error.
python - can't installing lxml on Ubuntu 12.04 - Stack Overflow
I had exactly this problem. Turned out to be a memory problem - I was installing reporter.py, which depends on lxml, on a server with only 500MB RAM, of which only 150MB was free. I killed off a few things to get up to ~300MB free, and just managed to squeeze out the installation of lxml. (Watching TOP showed available memory going down to 4MB at one point!)
I had the same problem and my solution was based on yours. Actually, I'm using Vagrant and my VM was with only 512 MB RAM, so I've changed this limit to 1 GB. The installation ocurred without error.
I have solved this by adding swap file with 500M size because there was no ability to increase RAM.
Same for me. Vagrant default box. Bumped up to 1024 with config.vm.provider "virtualbox" do |vb| vb.memory = 1024 end fixes this hellish error.
python - can't installing lxml on Ubuntu 12.04 - Stack Overflow
I had exactly this problem. Turned out to be a memory problem - I was installing reporter.py, which depends on lxml, on a server with only 500MB RAM, of which only 150MB was free. I killed off a few things to get up to ~300MB free, and just managed to squeeze out the installation of lxml. (Watching TOP showed available memory going down to 4MB at one point!)
I had the same problem and my solution was based on yours. Actually, I'm using Vagrant and my VM was with only 512 MB RAM, so I've changed this limit to 1 GB. The installation ocurred without error.
I have solved this by adding swap file with 500M size because there was no ability to increase RAM.
Same for me. Vagrant default box. Bumped up to 1024 with config.vm.provider "virtualbox" do |vb| vb.memory = 1024 end fixes this hellish error.
python - can't installing lxml on Ubuntu 12.04 - Stack Overflow
name = None level = 0 for event, element in etree.iterparse(gzip.GzipFile(f), events=('end', 'start' ), tag='label'): # Update current level if event == 'start': level += 1; elif event == 'end': level -= 1; # Get name for top level label if level == 0: name = element.xpath('name/text()')
As an alternate solution, parse the whole file and use xpath to get the top label name:
from lxml import html with gzip.open(f, 'rb') as f: file_content = f.read() tree = html.fromstring(file_content) name = tree.xpath('//label/name/text()')
The file is huge. Parsing the hole thing at once is not an option.
python - lxml eTree iterparse depth - Stack Overflow
import xml.etree.ElementTree as et import csv xmltext = """ <dicts> <key>1375</key> <dict> <key>Key 1</key><integer>1375</integer> <key>Key 2</key><string>Some String</string> <key>Key 3</key><string>Another string</string> <key>Key 4</key><string>Yet another string</string> <key>Key 5</key><string>Strings anyone?</string> </dict> </dicts> """ f = open('output.txt', 'w') writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC) tree = et.fromstring(xmltext) # iterate over the dict elements for dict_el in tree.iterfind('dict'): data = [] # get the text contents of each non-key element for el in dict_el: if el.tag == 'string': data.append(el.text) # if it's an integer element convert to int so csv wont quote it elif el.tag == 'integer': data.append(int(el.text)) writer.writerow(data)
Thanks for posting so soon. The problem is, I cannot get lxml to run on my machine. I have python 2.7 and have made several attempts to get that module installed, but have failed. I was hoping there was another way that doesn't involve lxml.
I'm running Ubuntu Maverick Meerkat Netbook edition...
How are you trying to install it? have you tried installing it with PIP?
Python XML Parsing - Stack Overflow
import xml.etree.ElementTree as et import csv xmltext = """ <dicts> <key>1375</key> <dict> <key>Key 1</key><integer>1375</integer> <key>Key 2</key><string>Some String</string> <key>Key 3</key><string>Another string</string> <key>Key 4</key><string>Yet another string</string> <key>Key 5</key><string>Strings anyone?</string> </dict> </dicts> """ f = open('output.txt', 'w') writer = csv.writer(f, quoting=csv.QUOTE_NONNUMERIC) tree = et.fromstring(xmltext) # iterate over the dict elements for dict_el in tree.iterfind('dict'): data = [] # get the text contents of each non-key element for el in dict_el: if el.tag == 'string': data.append(el.text) # if it's an integer element convert to int so csv wont quote it elif el.tag == 'integer': data.append(int(el.text)) writer.writerow(data)
Thanks for posting so soon. The problem is, I cannot get lxml to run on my machine. I have python 2.7 and have made several attempts to get that module installed, but have failed. I was hoping there was another way that doesn't involve lxml.
I'm running Ubuntu Maverick Meerkat Netbook edition...
How are you trying to install it? have you tried installing it with PIP?
Python XML Parsing - Stack Overflow
Since it seems that /usr/include/libxml2 is being included, I think the most probable reason is that you don't have libxml2 installed on your system. This is most likely due to missing "command line tools". Get them here: https://developer.apple.com/downloads/index.action?=command%20line%20tools
Can also be solved by installing libxml2 via macports or brew (But don't do this other than as last resort). Using system libraries instead of homebrew or macports whenever possible can save you from a lot of incompatibility pitfalls.
Hi i reinstalled libxml2 via homebrew. maybe the upgrade to mavericks 10.9.1 overwrote the prior install. Will get back here.
I did "brew install libxml2" which was successful. But "pip install --upgrade lxml" still fails with same error
Have you checked that /usr/include/libxml2 exists? I bet it doesn't, in which case my recommendation is to install command line tools from Apple. Preferring system libs when both system and homebrew are available will almost always save you some headache. PS. The reason why homebrew-libxml didn't solve it is that that /usr/local/include probably isn't included by pip while building. Before trying to hack that, I'd really suggest checking /usr/include and installing command line tools - that will probably solve it all :)
You are correct - the /usr/include/libxml2 does not exist. I just upgraded to 10.9.1 - did that wipe out the command line tools??
Yep. The "command line tools" are OS-specific and need to be re-installed each time you upgrade to a new "major version" of OSX.
Error installing python module lxml on osx mavericks - Stack Overflow
There is no silver bullet. Different HTML parsers behave differently and you should pick the one that works for your particular page. Works in this case basically means, that you can get to your desired data.
lxml parser is generally faster, html5lib is the most lenient one - this kind of difference would be relevant if you have a broken or non-well-formed HTML to parse. html.parser is built-in and can help to avoid extra dependencies, if this is a problem. Here is a related table that highlights the differences.
So to be sure to get all the links, I must use several methods, several parsers?
@Anonymus nope, usually you just pick a parser and stick to it. But, I can imagine a page being non well-formed and parsing it with different parsers might get a bigger picture than with a single one. Though, I haven't been in that situation ever yet. Thanks.
python beautifulsoup : lxml html.parser - Stack Overflow
It looks like lxml wants to build an extension that requires access to a C compiler. You will need gcc for that. Try running sudo apt-get install build-essential and that should fix this particular issue.
sudo apt-get install gcc sudo: apt-get: command not found
@John The better command for Debain/Ubuntu is sudo apt-get install build-essential because it includes tools like make and a few other friends that are usually used in concert with gcc/g++.
Ah. OSX doesn't install the gcc compiler. Get Homebrew (github.com/mxcl/homebrew) or its less-intelligent cousin, ports, and then install gcc through them instead. Most of your pain is happening because there isn't an official, sensible packager on OS X. Sorry. =/