background preloader

Newspapers

Facebook Twitter

How Good Can It Get? Analysing and Improving OCR Accuracy in Lar. Abstract This article details the work undertaken by the National Library of Australia Newspaper Digitisation Program on identifying and testing solutions to improve OCR accuracy in large scale newspaper digitisation programs. In 2007 and 2008 several different solutions were identified, applied and tested on digitised material now available in the Australian Newspapers Digitisation Program beta service < This article gives a state of the art overview of how OCR software works on newspapers, factors that effect OCR accuracy, methods of measuring accuracy, methods of improving accuracy, and testing methods and results for specific solutions that were considered viable for large scale text digitisation projects. 1.

Optical Character Recognition (OCR) software was first used by libraries for historic newspaper digitisation projects in the early 1990's. 2. Some OCR software has the capacity for 'training'. 3. 4. Measuring accuracy rates 5. 6. The Sky is Falling! - O&#039;Reilly Radar. It’s been a busy week for the “death of newspapers” camp. We’ve had Michael Hirschorn’s Atlantic Monthly piece forecasting the demise of The New York Times by May, Jack Shafer weighs in at Slate, James Surowiecki in The New Yorker, Clay Shirky raises some very interesting points, and today Fred Wilson joins the chorus with My Focus Group of One.

A simple Google search for terms like “death of newspapers” or “end of print” will yield millions of results. Some media websites and blogs have “death watch” sections of their sites, ready to ring the bell and announce a new heavyweight champion. Yet we’ve been doing this since the early 1800s–dishing condemnations of past technologies and rushing to announce the incarnation of the next big thing. Let’s travel back to a brisk morning on March 22nd, 1876. Pretty dramatic. Accusations of people “never leaving their house again” or books and the written word “ceasing to exist” didn’t start with the telephone or the phonograph.