
.png)

The most promising advances in OCR technology are happening in the field of scene text recognition. A dictionary isn’t always enough, however, as Wesley Raabe learned as he was transcribing the 1879 edition of Uncle Tom’s Cabin.

Some use a dictionary to improve results-when a string is ambiguous, the engine will err on the side of the known word. Most start with a line detection process that identifies lines of text in a document and then breaks them down into words or letter forms. They assume that material fits on a rectangular page. The current slate of good document recognition OCR engines use a mix of techniques to read text from images, but they are all optimized for documents. In most cases if you need a complete, accurate transcription you’ll have to do additional review and correction. None got perfect results on trickier documents, but most were good enough to make text significantly more comprehensible. Most of the tools handled a clean document just fine. The quality of results varied between applications, but there wasn’t a stand out winner. You can use the scripts to check our work, or to run your own documents against any of the clients we tested. We tested three free and open source options (Calamari, OCRopus and Tesseract) as well as one desktop app (Adobe Acrobat Pro) and three cloud services (Abbyy Cloud, Google Cloud Vision, and Microsoft Azure Computer Vision).Īll the scripts we used, as well as the complete output from each OCR engine, are available on GitHub. We selected several documents-two easy to read reports, a receipt, an historical document, a legal filing with a lot of redaction, a filled in disclosure form, and a water damaged page-to run through the OCR engines we are most interested in. Some are quite expensive, some are free and open source. Some are easy to use, some require a bit of programming to make them work, some require a lot of programming. There are a lot of OCR options available. We couldn’t find single side by side comparison of the most accessible OCR options, so we ran a handful of documents through seven different tools, and compared the results. We have been testing the components that already exist so we can prioritize our own efforts. One of our projects at Factful is to build tools that make state of the art machine learning and artificial intelligence accessible to investigative reporters.

OCR, or optical character recognition, allows us to transform a scan or photograph of a letter or court filing into searchable, sortable text that we can analyze. Do you need to pay a lot of money to get reliable OCR results? Is Google Cloud Vision actually better than Tesseract? Are any cutting edge neural network-based OCR engines worth the time investment of getting them set up?
