Rectangle 27 0

html Can I give jsoup a fallback character encoding to use when meta tags aren't found?


I decided to use Apache Tika. It has an HtmlEncodingDetector class to find HTML meta tags. When that fails due to meta tags not existing I fallback to Tika's UniversalEncodingDetector. (The latter is a wrapper for juniversalchardet. I use the wrapper instead of calling juniversalchardet directly because it's handy for both detectors to have the same Java interface.)

The only caveat is that Tika is quite a large project and adding it pulled in a large number of irrelevant dependencies.

Note