Rectangle 27 0

java How do I improve the accuracy of the OCR text from Tesseract?

Further, not being able to detect more than 4 words depends on a lot of factors, what kind (with how many features) of test image, the size of the image, platform etc.

I am developing using Tess4j Which is a Java JNA wrapper for tesseract-ocr, and it gives quite good results after checking.

Inaccurate results might be due to the text size, check this out. It says "Accuracy drops off below 10pt x 300dpi, rapidly below 8pt x 300dpi."

Oh of course, image pre-processing would increase the accuracy of OCR engine, but with an additional cost of time. for pre-processing you can: Increase the DPI of the image, Resize the image and you can also check Bluring/Sharpening. High contrast betweent text and background is recognized much better. after that try to de-noising it and binarize the image. It increases the accuracy quite good.

Tesseract API class provides a isValidWord Method to check if the string is a valid word. You can use this to check the recognized characters. This will increase the accuracy of the output.

Thanks but i wanted to know how can we improve the recognition ? Like for instance if you see the project uploaded by Robert Theis at then you can see he has used image enhancement algorithms and even though he uses the same Tesseract API as mine the recognition rate is higher