ocr

TwitterFacebook
Get flash to fully experience Pearltrees
If you have bunch of scanned PDF files sitting on your hard drive and no OCR software to convert them into text, here’s what you can do to recognize text from PDF files with Google OCR. There are two types of PDF documents – those created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc. Google could always index PDF documents created by conversion but now they also recognize text from PDFs that are generated by scanning paper documents using OCR software. This is a scanned document and this is the html text view of that same document converted by Google. Since scanned PDFs are nothing but images, don’t be surprised if Google adds a "search by text" function to their Image Search engine similar to OneNote or EverNote. That will surely be huge. http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/

Convert Scanned PDF Documents to Text with Google OCR

http://code.google.com/p/tesseract-ocr/ Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0. ReadMe - Installation and usage information.

tesseract-ocr - Google Code

ocropus - Google Code

OCRopus™ is an OCR system written in Python, NumPy, and SciPy focusing on the use of large scale machine learning for addressing problems in document analysis. OCRopus 0.6 is being released. It features much simpler installation, fewer dependencies, and improved character recognition rates. http://code.google.com/p/ocropus/