Comparing Optical Character Recognition Tools for Text-Dense Documents vs. Scene Text

By: Sayema Mashhadi, Statistics Canada

Optical Character Recognition (OCR) is used to capture text from photos containing some textual information, as well as from scanned documents, such as forms, invoices or reports. Text can be any combination of printed or handwritten characters and includes symbols, numbers, punctuation marks, etc. There are many open-source OCR tools in various programming languages that can be used without any data or machine learning training prior to their use. This article compares two popular open-source OCR tools used to recognize printed text used in information extraction projects at Statistics Canada's Data Science Division: Google's OCR Engine Tesseract and Clova AI's OCR tool CRAFT. Tesseract was developed by Hewlett-Packard (HP) in the 1980s as a proprietary software and later made to be open source in 2005. Since then, Google has continued developing it and releases improved versions every few years for use at no cost. Due to regular meaningful upgrades to the source code, Tesseract is very popular in the open-source community. CRAFT is a state-of-the-art OCR tool for scene text detection (images with regions of text at various angles in a complex background) with the highest Harmonic Mean (H-mean) score across multiple public datasets compared to other open-source scene text detection OCR tools. To see more H-mean scores, visit "Character Region Awareness for Text Detection".

Motivation

Information extraction (IE) is the process of extracting useful structured information from unstructured data in the form of text files, images, videos, audio clips, and other types. The extracted information is used to prepare data for analysis. Usually, information from unstructured data is obtained by manually looking through the unstructured data which is time-consuming, and error-prone. IE tools can reduce manual effort, save time and reduce the risk of human error.

OCR tools are required in IE when dealing with images or scanned documents. In past projects, we worked with scanned images of natural health products, scanned forms and scanned financial statements that require text detection and recognition among other things for the purpose of converting unstructured data into usable structured data. To read up on one of our projects on document intelligence, see our recent Data Science Network article Document Intelligence: The art of PDF information extraction.

Image quality

OCR performance, regardless of the algorithm or technology behind it, heavily depends on image quality. If the image quality is crisp and clear to the human eye, an OCR tool is more likely to convert the image to a text string with high accuracy. On the other hand, poor quality images can swiftly degrade the performance of OCR tools. Some image pre-processing techniques can be applied to improve the visibility of textual information in images or scanned data. These techniques can include binarization, noise removal, and increasing contrast and sharpness to improve visibility of the text. Scanned documents can also contain alignment issues that should be fixed by de-skewing and perspective or curvature correction. Read more about typical image pre-processing techniques used prior to OCR by visiting Survey on Image Preprocessing Techniques to Improve OCR Accuracy.

Text Detection and Recognition

Recognition is only part of the solution. Though often simply referred to as OCR, OCR contains two parts – detection and recognition. The tool must first detect where the text is positioned in the image and then recognize that text, or convert it to a plain string. An image can contain empty spaces, objects, figures, drawings, graphs, and other elements that cannot be converted to a plain string, hence a detection model first identifies and isolates regions of text and then feeds it to the recognition model for conversion.

There are different OCR tools available for specific kinds of images. Two widely popular ones are Scene Text Detection/Recognition and Document Detection/Recognition. OCR tools trained on scene imagery are better at detecting text regions in a complex background and recognize curved text better than OCR tools trained on scanned documents that contain text on plain backgrounds. It's important to choose models that are trained for tasks closest to our use-case. Additionally, we have observed that using a detection model from one tool and recognition model from another can also lead to improved performance.

Use Case: Text dense documents

Documents are usually dense with text regions and the OCR tool of choice should be trained on similar data – dark text on light background, paragraphs, different font sizes, often with white spaces, or figures, tables, graphs and diagrams occupying part of the page.

When comparing the performance of Tesseract and CRAFT, it can be observed that Tesseract performs better at recognizing the complete text in an image while CRAFT makes a few mistakes. It was also observed through our experiments that Tesseract performs better at recognizing alphabets and punctuations. However, CRAFT performed better than Tesseract at recognizing numbers and codes (a combination of numbers and alphabets), even in text-dense documents. During our tests, it was noted that CRAFT mistakes the character 'O' for a zero when used on an image in the English language. This indicates that the CRAFT model is not biased to predicting alphabets when dealing with paragraphs of text.

Use Case: Scene Text

Scene Text or Text-in-the-Wild are images with regions of text at various angles in a complex background. Images of a busy marketplace, a street with signage, or commercial products can fall into this category. In the example below, a natural health product is pictured. It can be quite challenging to detect text in various fonts, colours and alignment and neither OCR tool in our tests achieved perfect accuracy. It can be noted, however, that CRAFT detected more regions of text than Tesseract. This is mainly since CRAFT is a Scene Text detection and recognition model and is trained to accurately identify a variety of text regions in complex backgrounds. It was observed through our experiments that CRAFT performs better at identifying even the smallest regions of text in an image and surpasses other Scene Text detection models available.

Image

Photo of a natural health care product Jamieson's Cold Fighter
Description - Photo of a natural health care product Jamieson's Cold Fighter

Photo of a natural health care product Jamieson's Cold Fighter. The photo contains just the front side of the outer box with an angled "NEW" written on the right corner, "Jamieson" logo under a partial rainbow, "Natural Sources" and " Since 1922" in decreasing font sizes. "Chewable Cold Fighter" in three different font sizes. A small description of the product "Fights early signs of cold and flu symptoms" beside image of honey dripping on a slice of lemon. The text "Soothing Honey Lemon" next to the slice of lemon. Another text "Just 1 per day!" in bottom left corner.


Tesseract output

ea
Jamieson:
NATURAL SOURCES,
Fights early signs of
cold and flu symptoms
nN
oneY LeMo!
SOOTHING Ht
'R DAY!


CRAFT output

Jamieson
NATURAL sources
Since 1022
CHEWABLE
FIGHTER
Fights early signs of
cold and flu symptoms
Soothing
DAYI
pOts
6
LEMON
HOneY
JJUST
PER

Customized OCR Tool

In addition to the variety of tools available depending on the use case, a unique OCR tool can be created by using the text detection model from one tool and the recognition model from another tool. This was implemented successfully in one of our information extraction projects dealing with scanned forms, where CRAFT was used for detection and Tesseract was used for recognition. The combination worked well since CRAFT performs better at identifying all regions of text including much smaller font sizes for form field titles, and Tesseract performs better at converting individual identified text regions to plain strings.

Conclusion

Open-source OCR tools are a great commodity that can be used to extract information from images or scanned documents. Many state-of-the-art OCR tools are trained for a specific use case and can be used without any prior training. The performance is heavily reliant on the image quality, but some image pre-processing techniques can be implemented to improve results. In our tests, Google's OCR engine Tesseract performed better on text-dense documents and Clova AI's CRAFT performed better on Text-in-the-Wild or Scene Text data. When combined, we were able to use the strengths of both tools to perform more accurate information extraction from more complex sets of images containing a variety of text.

Please take our quick 2-minute survey

We need your help to improve the Data Science Network. Please take 2 minutes to answer a few quick questions. We welcome and value your feedback. This is your last chance to complete the Data Science Network Newsletter feedback survey. Thanks to those who have already completed it!

Date modified: