Rectangle 27 58

Another library that might be useful for HTML processing is jsoup. Jsoup tries to clean malformed HTML and allows html parsing in Java using jQuery like tag selector syntax.

Java HTML Parsing - Stack Overflow

java html parsing web-scraping
Rectangle 27 150

I'd use a decent HTML parser like Jsoup. It's then as easy as:

String html = Jsoup.connect("http://stackoverflow.com").get().html();

It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML traversing and manipulation by CSS selectors like as jQuery can do. You only have to grab it as Document, not as a String.

Document document = Jsoup.connect("http://google.com").get();
;)

Why did noone tell me about .html() before. I looked so hard into how to easily store the html fetched by Jsoup and that helps a lot.

http - How do you Programmatically Download a Webpage in Java - Stack ...

java http compression
Rectangle 27 1

I have solved(ish) the issue by preprocessing the XML with JSoup (which is a nod to @Ian Roberts's comment about parsing the XML with a non-XML tool). JSoup is (or was) designed for HTML documents, however works well in this context.

My code is as follows:

@Test
public void verifyNewlineEscaping() {
    final List<Node> nodes = Parser.parseXmlFragment(FileUtils.readFileToString(sourcePath.toFile(), "UTF-8"), "");

    fixAttributeNewlines(nodes);

    // Reconstruct XML
    StringBuilder output = new StringBuilder();
    for (Node node : nodes) {
        output.append(node.toString());
    }

    // Print cleansed output to stdout
    System.out.println(output);
}

/**
 * Replace newlines and surrounding whitespace in XML attributes with an alternative delimiter in
 * order to avoid whitespace normalisation converting newlines to a single space.
 * 
 * <p>
 * This is useful if newlines which have semantic value have been incorrectly inserted into
 * attribute values.
 * </p>
 * 
 * @param nodes nodes to update
 */
private static void fixAttributeNewlines(final List<Node> nodes) {

    /*
     * Recursively iterate over all attributes in all nodes in the XML document, performing
     * attribute string replacement
     */
    for (final Node node : nodes) {
        final List<Attribute> attributes = node.attributes().asList();

        for (final Attribute attribute : attributes) {

            // JSoup reports whitespace as attributes
            if (!StringUtils.isWhitespace(attribute.getValue())) {
                attribute.setValue(attribute.getValue().replaceAll("\\s*\r?\n\\s*", "|"));
            }
        }

        // Recursively process child nodes
        if (!node.childNodes().isEmpty()) {
            fixAttributeNewlines(node.childNodes());
        }
    }
}

For the sample XML in my question, the output of this method is:

<sample> 
    <p att="John|Paul|Ringo"></p> 
</sample>

Note that I am not using because JSoup is rather vigilant in its character escaping and escapes everything in attribute values. It also replaces existing numeric entity references with their UTF-8 equivalent, so time will tell whether or not this is a a passable solution.

Note that the downside of using JSoup is that it currently converts attribute names to lowercase. There is an open bug detailing this.

Replacing newlines in XML attributes with XSLT - Stack Overflow

xml xslt xslt-2.0