Rectangle 27 0

java How do I improve the accuracy of the OCR text from Tesseract?


Further, not being able to detect more than 4 words depends on a lot of factors, what kind (with how many features) of test image, the size of the image, platform etc.

I am developing using Tess4j Which is a Java JNA wrapper for tesseract-ocr, and it gives quite good results after checking.

Inaccurate results might be due to the text size, check this out. It says "Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi."

Oh of course, image pre-processing would increase the accuracy of OCR engine, but with an additional cost of time. for pre-processing you can: Increase the DPI of the image, Resize the image and you can also check Bluring/Sharpening. High contrast betweent text and background is recognized much better. after that try to de-noising it and binarize the image. It increases the accuracy quite good.

Tesseract API class provides a isValidWord Method to check if the string is a valid word. You can use this to check the recognized characters. This will increase the accuracy of the output.

Thanks but i wanted to know how can we improve the recognition ? Like for instance if you see the project uploaded by Robert Theis at github.com/rmtheis/android-ocr then you can see he has used image enhancement algorithms and even though he uses the same Tesseract API as mine the recognition rate is higher

Note