No more OCR excuses: Here's how to do it RIGHT
I am, to put in simple words, sick and tired of hearing the same lame excuses for the state of quality of old and backlist books that are scanned and converted via Optical Character Recognition (OCR) software into ebooks. All of us ebook readers are painfully familiar with the ridiculous text errors, the "a"s turned to "o"s, the "p"s turned to "q"s, the lost punctuations, the nonsense words and phrases, the missing lines, paragraphs and entire sections, etc, etc, that turn up in these ebooks. And we are also painfully familiar with the publishers' excuses for this state, which usually boils down to: "We work hard... it's the hardware/software's fault!"
BULL. It's how you're using the hardware/software; in other words, WRONG.
So, I’m going to tell you how to do it…the way we did it over a decade ago, and got less than .1% errors in our work. Pay Attention.
Over a decade ago, I worked in an in-house print department that produced thousands of short documents a day, using 3 high-speed printers, 2 of them networked DocuTech digital printers. Those machines were capable of printing anything (in black and white, anyway) in high quality. All they needed was high quality going in. My initial job, upon arriving there, was showing them how to get that high quality input that would give them high quality output. I had done the same at my previous job. So I knew whereof I spoke when I was hired.
Occasionally, we were asked to produce a digital version of a book our offices had created. After going through the process a few times, I hit upon the best method for clean and accurate digital conversion. My attempts using this method were highly successful, and not that difficult at all... any small to large organization can do this.
The process starts with the original book... and this is where the most important first steps must be taken to ensure high quality. Many books aren't in the best shape for scanning, due to age, flimsiness of the paper, coloring, stains, etc. Also, text is often rendered rather small, especially in paperbacks. This is a notoriously poor source material, and must be improved before it is used.
To deal with this, the first step is to CREATE NEW AND BETTER PAGES. Use a high-quality photocopying machine or scanner that has both adjustable brightness and contrast settings, and an enlarging feature. Before you start, take sample images of a page, adjusting brightness and contrast to basically darken text to as close to 100% black as possible, and erase any image artifacts from browning or stained pages; you want as clean an image as possible, solid black text against pure white backgrounds.
Once you have your clean high-contrast setting, ENLARGE THE IMAGE to letter-size. This simple step, which few OCR processes use, improves the recognition of characters immensely, possibly on a logarithmic scale, during the scanning process.
Finally, if you are using a scanner, this is an important addition to the process: Before doing your scan, MAKE SURE THE SCANNER IS SET TO BLACK-AND-WHITE AND AT LEAST 300DPI RESOLUTION. Many photocopiers will allow you to make the same settings. This results in a larger file, which can take a longer time to process, but is important to get the best image of each character possible.
With these settings saved, SCAN OR PHOTOCOPY YOUR PAGES. For best results, cut the pages free of the spine so they can be perfectly flat when scanning/photocopying. If your scanner/photocopier has a feeder that will automatically feed and scan both sides of a page, by all means, use it and save yourself the grief (and the time).
If you used a photocopier, you should now have a letter-sized, one-sided stack of papers that represent your book. This is ideal for running through a sheetfed scanner with pretty much any OCR software. Many high-speed sheetfed scanners have limited adjustment controls... that's why it is important to provide the highest-quality input sheets. If you do have contrast and resolution controls, make sure they are set to black-and-white and at least 300DPI, just like your earlier photocopied images.
Most scanners these days come with OCR software, or recommend OCR software to use with their hardware. Start with these applications, but don't be afraid to try other apps with more features if needed. When you do your scan, the software should automatically start the OCR process. WAIT UNTIL THE OCR FILE IS DONE, AND SAVE IT.
MAKE A COPY OF THE FILE FOR EDITING. If you have good OCR software, it will allow you to do sophisticated find-and-replace tasks; this is great to have if you discover an odd OCR artifact, such as the transposition of every "h." with "la" or "&" with "$". Most of the incorrect artifacts you're likely to find will be a result of dressy typography that uses non-standard or oddly-shaped characters. These are things that even the best OCR applications can't always interpret correctly. Your larger and high-res images should generate much better recognition and fewer errors of almost all of the rest.
If your OCR software can't do the find-and-replace tasks, open the file in a word processing app like MS Word, and use its find-and-replace functions there. As you make these document-wide edits, check to make sure you didn't mess something up (such as correct words that your find-and-replace made incorrect), and if everything's good, save the file. Then do the next one, check it, and save it. This way, if you do an edit that doesn't work out, and it's too wide-spread to easily fix, you can revert to the last saved file to try it again.
Done with that? Good. Now, READ IT. I mean, REALLY, REALLY READ IT. Look for type artefacts that were missed by your initial find-and-replace process, possibly words that have been re-recognized as similar words, such as "words" to "wards," "it" to "if," etc. A thorough proofing pass should catch these typos, and thanks to the process outlined above, there shouldn't be many.
I have used this process to create ebooks, and have experienced fewer than 1 word or type errors in 20 pages of OCR results.
I am sure that most of the organizations doing scan-and-OCR of older books—most of which are probably contractors working to generate as much text per hour as they can—are not enlarging copy or scanning at 300DPI resolution. Why? To save time; low-res versions of smaller pages process faster. This is why the overall quality of files presented to publishers is so dismal. Unfortunately, the publishers are still responsible for proofing the text and catching these errors, and I have serious doubts that they are doing that job at all. If scan-and-OCR workers use my production steps, the publishers' proofers will have even less of an excuse for bad copy.
Feel free to forward this entry to any publishers or OCR companies you know of; maybe we'll see much better quality ebooks in the future, especially if some people take heed of these steps.