Rectangle 27 0

dom Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?


JavaScript
    var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
    html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");

PHP
    $html = $dom->saveHTML();
    $html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
    $dom->loadHTML($html);
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag

    /(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/

replace with

    $1<tbody>

the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:

    /(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/

replace with

    $1</tbody>$4

@Alex That is a great post. But does it really apply to this case? The regex is used on a string to insert a possibly missing part before it is being parsed by an actual xml parser. The same could be done on the object it is parsed into by searching, checking, inserting, moving and iterating, but this seemed faster based on the computer not caring about what the bits represent. Please note that I may be completely wrong here and if anyone can show this method resulting in a security problem, unintended behavior, etc., please provide an explanation or example, so we can all learn from it.

Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.

Jens Erat gives a good explanation, but here is

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)

Just the regex:

Personally, I couldn't think of an example where using a regex here could cause unintended behaviour, otherwise I would've mentioned it. This wasn't meant to unnecessary criticize your post, I just wanted to remind that using regexes could (in theory, at least) bring some fallbacks with it.

This way the dom will ALWAYS have the <tbody> tags where necessary.

Note
Rectangle 27 0

dom Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?


//table[@id="example"]/(tbody, .)/tr[2]/td[1]
//table[@id="example"]//tr[2]/td[1]
//table[@id="example"]/tr[2]/td[1] | //table[@id="example"]/tbody/tr[2]/td[1]
/tbody

Check if the table you're stuck at really does not contain a <tbody/> element (see last paragraph). If it does, you've probably got another kind of problem.

Compare the source shown by Firebug (or Chrome's Dev Tools) with the one you get by right-clicking and selecting "Show Page Source" (or whatever it's called in your browsers) -- or by using curl http://your.example.org on the command line. Latter will probably not contain any <tbody/> elements (they're rarely used), Firebug will always show them.

Excluding JavaScript, most XPath processors work on raw XML, not the DOM, thus do not add <tbody/> tags. Also HTML parser libraries like tag-soup and htmltidy only output XHTML, not "DOM-HTML".

Firebug, Chrome's Developer Tool, XPath functions in JavaScript and others work on the DOM, not the basic HTML source code.

I've been searching for a solution for 4 hours, because the data I wanted from a site didn't want to be mine. All the values were easy by their xpaths, however one of the tables returned an error, and the solution was to delete tbody and replace it with an extra /.

If you're not sure in advance that your table or use the query in both "HTML source" and DOM context; and don't want/cannot use the hack from solution 2, provide an alternative query (for XPath 1.0) or use an "optional" axis step (XPath 2.0 and higher).

In addition to what was stated above, for my scraper on these scenarios, I have a flag for "skipFirstRow" which actually works perfectly (for the pages I'm scraping).

Replace the /tbody axis step by a descendant-or-self step:

The DOM for HTML requires that all table rows not contained in a table header of footer (<thead/>, <tfoot/>) are included in table body tags <tbody/>. Thus, browsers add this tag if it's missing while parsing (X)HTML. For example, Microsoft's DOM documentation says

The TBODY start tag is always required except when the table contains only one table body and no table head or foot sections.

The tbody element is exposed for all tables, even if the table does not explicitly define a tbody element.

This is a common problem posted on Stackoverflow for PHP, Ruby, Python, Java, C#, Google Docs (Spreadsheets) and lots of others. Selenium runs inside the browser and works on the DOM -- so it is not affected!

This is a rather dirty solution and likely to fail for nested tables (can jump into inner tables). I would only recommend to to this in very rare cases.

Note
Rectangle 27 0

dom Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?


JavaScript
    var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
    html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");

PHP
    $html = $dom->saveHTML();
    $html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
    $dom->loadHTML($html);
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag

    /(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/

replace with

    $1<tbody>

the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:

    /(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/

replace with

    $1</tbody>$4

Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.

Jens Erat gives a good explanation, but here is

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)

Just the regex:

This way the dom will ALWAYS have the <tbody> tags where necessary.

Note
Rectangle 27 0

dom Why does my XPath query (scraping HTML tables) only work in Firebug, but not the application I'm developing?


JavaScript
    var html = '<html><table><tr><td>foo</td><td>bar</td></tr></table></html>';
    html.replace(/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/g,"$1<tbody>").replace(/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/g,"$1</tbody>$4");

PHP
    $html = $dom->saveHTML();
    $html = preg_replace(array('/(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/','/(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/'),array('$1<tbody>','$1</tbody>$4'),$html);
    $dom->loadHTML($html);
matches `<table>` tag with whatever else junk inside the tag and between this and the next tag if the next tag is NOT `<tbody>` also with stuff inside the tag

    /(<table([^>]+)?>([^<>]+)?)(?!<tbody([^>]+)?>)/

replace with

    $1<tbody>

the $1 referencing the captured `<table>` tag with contents.
Do the same for the closing tag like this:

    /(<(?!(\/tbody))([^>]+)?>)(<\/table([^>]+)?>)/

replace with

    $1</tbody>$4

Before parsing, get the html as a string. Insert missing <tbody> and </tbody> tags with regex, then load it back into your DOMDocument object.

Jens Erat gives a good explanation, but here is

Just came across the same problem. I almost wrote a recursive funtion to check for every tbody tag if it exists and traverse the dom that way, then I remembered I know regex. :)

Just the regex:

This way the dom will ALWAYS have the <tbody> tags where necessary.

Note