background preloader

Tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.

Tesseract-ocr - An OCR Engine that was developed at HP Labs between 1985 and 1995... and now at Google.

Chapter 1. Using Google Refine to Clean Messy Data Google Refine (the program formerly known as Freebase Gridworks) is described by its creators as a “power tool for working with messy data” but could very well be advertised as “remedy for eye fatigue, migraines, depression, and other symptoms of prolonged data-cleaning.” Even journalists with little database expertise should be using Refine to organize and analyze data; it doesn't require much more technical skill than clicking through a webpage. For skilled programmers, and journalists well-versed in Access and Excel, Refine can greatly reduce the time spent doing the most tedious part of data-management. Other reasons why you should try Google Refine: It’s free.It works in any browser and uses a point-and-click interface similar to Google Docs.Despite the Google moniker, it works offline. Download and installation instructions for Refine are here. This tutorial covers the same ground as this screencast by Refine’s developer David Huynh (the other two videos are here): Starting a Project

ocropus - Google Code Tesseract (software) Tesseract is an optical character recognition engine for various operating systems.[2] It is free software, released under the Apache License, Version 2.0,[1][3][4] and development has been sponsored by Google since 2006.[5] Tesseract is considered one of the most accurate open source OCR engines currently available.[4][6] Tesseract up to and including version 2 could only accept TIFF images of simple one column text as inputs. These early versions did not include layout analysis and so inputting multi-columned text, images, or equations produced a garbled output. Since version 3.00 Tesseract has supported output text formatting, hOCR positional information and page layout analysis. The initial versions of Tesseract could only recognize English language text. If Tesseract is used to process right-to-left text such Arabic or Hebrew the results are ordered as though it is left-to-right text.[10] Tesseract configuration window in OCRFeeder Libtiff

ScanR - use camera phones for OCR scanR is a free service that lets you transform camera phone pictures into PDF documents. You can take a picture of a document, send it to scanR by email and in less than a minute you'll get a PDF file. If you save the file as text in Acrobat Reader, you'll have the text contained in the document. You can also use it for whiteboard images. scanR requires 1 megapixel cameras for whiteboard scanning and 2 megapixel cameras for document scanning. I tested this online OCR service with Sony Ericsson K750i and the results were pretty good. Related:Use Gmail to break PDF DRM

OCR for Linux: Teaching Linux to Read Rod Smith covers the optical character recognition (OCR) options for Linux, their limitations, and how to install and use Tesseract for your OCR needs on Linux. Computers are excellent number-crunching machines, but they’ve traditionally been very poor at dealing with the” fuzzier” everyday world at which humans excel. Ask a computer to add a thousand numbers and it wouldn’t blink an eye if it had one; however, ask a computer to read those thousand numbers from a sheet of paper and you’ll run into problems. The software that attempts to teach computers about the printed alphabet and words is known as Optical Character Recognition(OCR) software. In some cases the OCR software can use a scanner directly, bypassing the need to store a file on disk. OCR Software Capabilities and Limitations OCR can be very tough; for instance, the difference between a lower-case L (l) and a digit one (1) is very small. OCR software often has limitations other than the raw accuracy rate, though. 3.Type .

Chapter 2: Reading Data from Flash Sites Flash applications often disallow the direct copying of data from them. But we can instead use the raw data files sent to the web browser. Adobe Flash can make data difficult to extract. For example, the data displayed on this Recovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page. Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website. Background In September 2008, drug company Cephalon pleaded guilty to a misdemeanor charge and settled a civil lawsuit involving allegations of fraudulent marketing of its drugs. Cephalon's report is not downloadable and the site disables the mouse’s right-click function, which typically brings up a pop-up menu with the option to save the webpage or inspect its source code. We asked the company why it chose this format. Software to Get A Series of Tubes...and Files Firebug can tell you what files your browser is receiving.

Convert Scanned PDF Documents to Text with Google OCR If you have bunch of scanned PDF files sitting on your hard drive and no OCR software to convert them into text, here’s what you can do to recognize text from PDF files with Google OCR. There are two types of PDF documents – those created by sending Office files, images, etc. to an Acrobat like PDF printer and those created by scanning physical paper like pages of a book, legal documents, etc. Google could always index PDF documents created by conversion but now they also recognize text from PDFs that are generated by scanning paper documents using OCR software. This is a scanned document and this is the html text view of that same document converted by Google. Since scanned PDFs are nothing but images, don’t be surprised if Google adds a "search by text" function to their Image Search engine similar to OneNote or EverNote. Now if you have bunch of scanned PDF files on your hard drive and no OCR software, here’s what you can do to convert them into recognizable text.

GOCR

Related: