Rectangle 27 58

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

Java HTML Parsing - Stack Overflow

java html parsing web-scraping
Rectangle 27 150

I'd use a decent HTML parser like Jsoup. It's then as easy as:

String html = Jsoup.connect("http://stackoverflow.com").get().html();

It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversing and manipulation by CSS selectors like as jQuery can do. You only have to grab it as Document, not as a String.

Document document = Jsoup.connect("http://google.com").get();
;)

Why did noone tell me about .html() before. I looked so hard into how to easily store the html fetched by Jsoup and that helps a lot.

http - How do you Programmatically Download a Webpage in Java - Stack ...

java http compression
Rectangle 27 2

I have tried most of the parsers here for a crawler / data extraction framework I have been developing. I use HTMLCleaner for the bulk of the data extraction work. This is because it supports a reasonably modern dialect of HTML, XHTML, HTML 5, with namespaces, and it supports DOM, so it is possible to use it with Java's built in XPath implementation.

It's a lot easier to do this with HTMLCleaner than some of the other parsers: JSoup for example supports a DOM like interface, rather than DOM, so some assembly required. Jericho has a SAX-line interface so again it is requires some work although Sujit Pal has a good description of how to do this but in the end HTMLCleaner just worked better.

I also use HTMLParser and Jericho for a table extraction task, which replaced some code written using Perl's libhtml-tableextract-perl. I use HTMLParser to filter the HTML for the table, then use Jericho to parse it. I agree with MJB's and Adam's comments that Jericho is good in some cases because it preserves the underlying HTML. It has a kind of non-standard SAX interface, so for XPath processing HTMLCleaner is better.

Parsing HTML in Java is a surprisingly hard problem as all the parsers seem to struggle on certain types of malformed HTML content.

What are the pros and cons of the leading Java HTML parsers? - Stack O...

java html parsing
Rectangle 27 2

I have tried most of the parsers here for a crawler / data extraction framework I have been developing. I use HTMLCleaner for the bulk of the data extraction work. This is because it supports a reasonably modern dialect of HTML, XHTML, HTML 5, with namespaces, and it supports DOM, so it is possible to use it with Java's built in XPath implementation.

It's a lot easier to do this with HTMLCleaner than some of the other parsers: JSoup for example supports a DOM like interface, rather than DOM, so some assembly required. Jericho has a SAX-line interface so again it is requires some work although Sujit Pal has a good description of how to do this but in the end HTMLCleaner just worked better.

I also use HTMLParser and Jericho for a table extraction task, which replaced some code written using Perl's libhtml-tableextract-perl. I use HTMLParser to filter the HTML for the table, then use Jericho to parse it. I agree with MJB's and Adam's comments that Jericho is good in some cases because it preserves the underlying HTML. It has a kind of non-standard SAX interface, so for XPath processing HTMLCleaner is better.

Parsing HTML in Java is a surprisingly hard problem as all the parsers seem to struggle on certain types of malformed HTML content.

What are the pros and cons of the leading Java HTML parsers? - Stack O...

java html parsing
Rectangle 27 1

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

Note that the downside of using JSoup is that it currently converts attribute names to lowercase. There is an open bug detailing this.

Replacing newlines in XML attributes with XSLT - Stack Overflow

xml xslt xslt-2.0
Rectangle 27 8

That "XML" is worse than invalid -- it's not well-formed; see Well Formed vs Valid XML.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

  • Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)

Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:

xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null

Standalone and C:

  • Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more suggestions for dealing with not-well-formed markup in Python. See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
  • Java: JSoup focuses on HTML. FilterInputStream can be used for preprocessing cleanup.
  • R: See htmlTreeParse() for fault-tolerant markup parsing in R.

Process the data as text manually using a text editor or programmatically using character/string functions. Doing this programmatically can range from tricky to impossible as what appears to be predictable often is not -- rule breaking is rarely bound by rules.

preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')

Yeah, I was thinking these might be the only options, but I wanted to see if there were any other solutions first. Thanks for the input.

java - How to parse invalid (bad / not well-formed) XML? - Stack Overf...

java xml xml-parsing xml-validation