Showing posts with label guest-blog. Show all posts
Showing posts with label guest-blog. Show all posts

3 September 2011

Guest Blog: No more OCR excuses: Here's how to do it right by Steven Lyle Jordan

Steven Lyle Jordan is a Maryland based Sci-Fi author who has published his series The Kestral Voyages in e-book format, as well as a number of stand-alone novels. For more information visit his site at rightbrane.com. His guest-blog is on the advantages of e-publishing in terms of making revisions.


No more OCR excuses: Here's how to do it RIGHT



I am, to put in simple words, sick and tired of hearing the same lame excuses for the state of quality of old and backlist books that are scanned and converted via Optical Character Recognition (OCR) software into ebooks.  All of us ebook readers are painfully familiar with the ridiculous text errors, the "a"s turned to "o"s, the "p"s turned to "q"s, the lost punctuations, the nonsense words and phrases, the missing lines, paragraphs and entire sections, etc, etc, that turn up in these ebooks.  And we are also painfully familiar with the publishers' excuses for this state, which usually boils down to: "We work hard... it's the hardware/software's fault!"


BULL.  It's how you're using the hardware/software; in other words, WRONG.


So, I’m going to tell you how to do it…the way we did it over a decade ago, and got less than .1% errors in our work. Pay Attention.


Over a decade ago, I worked in an in-house print department that produced thousands of short documents a day, using 3 high-speed printers, 2 of them networked DocuTech digital printers.  Those machines were capable of printing anything (in black and white, anyway) in high quality.  All they needed was high quality going in.  My initial job, upon arriving there, was showing them how to get that high quality input that would give them high quality output.  I had done the same at my previous job.  So I knew whereof I spoke when I was hired.


Occasionally, we were asked to produce a digital version of a book our offices had created.  After going through the process a few times, I hit upon the best method for clean and accurate digital conversion.  My attempts using this method were highly successful, and not that difficult at all... any small to large organization can do this.


The process starts with the original book... and this is where the most important first steps must be taken to ensure high quality.  Many books aren't in the best shape for scanning, due to age, flimsiness of the paper, coloring, stains, etc.  Also, text is often rendered rather small, especially in paperbacks. This is a notoriously poor source material, and must be improved before it is used.


To deal with this, the first step is to CREATE NEW AND BETTER PAGES.  Use a high-quality photocopying machine or scanner that has both adjustable brightness and contrast settings, and an enlarging feature.  Before you start, take sample images of a page, adjusting brightness and contrast to basically darken text to as close to 100% black as possible, and erase any image artifacts from browning or stained pages; you want as clean an image as possible, solid black text against pure white backgrounds.


Once you have your clean high-contrast setting, ENLARGE THE IMAGE to letter-size.  This simple step, which few OCR processes use, improves the recognition of characters immensely, possibly on a logarithmic scale, during the scanning process.


Finally, if you are using a scanner, this is an important addition to the process: Before doing your scan, MAKE SURE THE SCANNER IS SET TO BLACK-AND-WHITE AND AT LEAST 300DPI RESOLUTION.  Many photocopiers will allow you to make the same settings.  This results in a larger file, which can take a longer time to process, but is important to get the best image of each character possible.


With these settings saved, SCAN OR PHOTOCOPY YOUR PAGES.  For best results, cut the pages free of the spine so they can be perfectly flat when scanning/photocopying.  If your scanner/photocopier has a feeder that will automatically feed and scan both sides of a page, by all means, use it and save yourself the grief (and the time).


If you used a photocopier, you should now have a letter-sized, one-sided stack of papers that represent your book.  This is ideal for running through a sheetfed scanner with pretty much any OCR software.  Many high-speed sheetfed scanners have limited adjustment controls... that's why it is important to provide the highest-quality input sheets.  If you do have contrast and resolution controls, make sure they are set to black-and-white and at least 300DPI, just like your earlier photocopied images.


Most scanners these days come with OCR software, or recommend OCR software to use with their hardware.  Start with these applications, but don't be afraid to try other apps with more features if needed.  When you do your scan, the software should automatically start the OCR process.  WAIT UNTIL THE OCR FILE IS DONE, AND SAVE IT.


MAKE A COPY OF THE FILE FOR EDITING.  If you have good OCR software, it will allow you to do sophisticated find-and-replace tasks; this is great to have if you discover an odd OCR artifact, such as the transposition of every "h." with "la" or "&" with "$".  Most of the incorrect artifacts you're likely to find will be a result of dressy typography that uses non-standard or oddly-shaped characters.  These are things that even the best OCR applications can't always interpret correctly.  Your larger and high-res images should generate much better recognition and fewer errors of almost all of the rest.


If your OCR software can't do the find-and-replace tasks, open the file in a word processing app like MS Word, and use its find-and-replace functions there.  As you make these document-wide edits, check to make sure you didn't mess something up (such as correct words that your find-and-replace made incorrect), and if everything's good, save the file.  Then do the next one, check it, and save it.  This way, if you do an edit that doesn't work out, and it's too wide-spread to easily fix, you can revert to the last saved file to try it again.


Done with that?  Good.  Now, READ IT.  I mean, REALLY, REALLY READ IT.  Look for type artefacts that were missed by your initial find-and-replace process, possibly words that have been re-recognized as similar words, such as "words" to "wards," "it" to "if," etc.  A thorough proofing pass should catch these typos, and thanks to the process outlined above, there shouldn't be many.


I have used this process to create ebooks, and have experienced fewer than 1 word or type errors in 20 pages of OCR results.


I am sure that most of the organizations doing scan-and-OCR of older books—most of which are probably contractors working to generate as much text per hour as they can—are not enlarging copy or scanning at 300DPI resolution.  Why?  To save time; low-res versions of smaller pages process faster.  This is why the overall quality of files presented to publishers is so dismal.  Unfortunately, the publishers are still responsible for proofing the text and catching these errors, and I have serious doubts that they are doing that job at all.  If scan-and-OCR workers use my production steps, the publishers' proofers will have even less of an excuse for bad copy.


Feel free to forward this entry to any publishers or OCR companies you know of; maybe we'll see much better quality ebooks in the future, especially if some people take heed of these steps.




29 June 2011

Guest Blog: Steven Lyle Jordan on e-book proofing and publishing

Steven Lyle Jordan is a Maryland based Sci-Fi author who has published his series The Kestral Voyages in e-book format, as well as a number of stand-alone novels. For more information visit his site at rightbrane.com. His guest-blog is on the advantages of e-publishing in terms of making revisions.



Take 2: Revising an e-book in no time flat


There’s a unique feeling an ebook author gets when they’ve released their book into the markets, ready to be bought… they’ve alerted the media, and told their customers and friends to spread the word… and they sit back, take a breath, and wait to see what develops. Naturally, what they want to see is right-off healthy sales, and maybe a few emails in their inbox telling them how great their book was.


So it was a let-down, to say the least, when I released a book a year ago, and quickly received emails… counting down my editing and grammatical errors. I felt sick inside. Admittedly, I generally do all of my own editing and proofing, because I can’t afford to hire an editor, and because I regularly received high marks for writing and reports in school.  But for whatever reason, I hadn’t done much of a job on this book.  Over half a dozen outright mistakes… things I should have caught… were right there in front of me. I couldn’t let them slide; I had to fix them.


Fortunately, that decision allowed me to demonstrate one of the greatest advantages of ebooks, not to mention the web and social media, in being able to make reparations, fix the ebook, and get the revised copies to my customers—in days, as opposed to months or years (or sometimes never) with printed books.
To begin with, the edits themselves were simple enough.  I made the changes (kicking myself along the way), and as I had just a few days previous, I made new versions of the ebook.  For me, this meant 5 formats for my website, plus Kindle, Barnes & Noble and Smashwords versions.  Sure, I grumbled about it a bit… but once I started, I had new versions online, in 4 different sites, in less than 2 hours.  Any new customers would get the new version, updated mere days after the first release.


Then, came the real kicker: Through my site, I store the names and emails of all of my customers, so I can easily keep track of who bought what, and so I can respond personally to customer inquiries.  This came in handy, because it allowed me to send an email to each of my customers who had bought that book.  I explained what happened, told them that they only had to contact me, and I’d personally send them a new copy of the ebook.  It worked like a charm: The customers emailed me back, and I attached a new file to each request, in their format of choice.


Even the process of revising the copyright on an ebook is easier now, thanks to the U.S. Copyright Office’s acceptance of digital files.  Through their website, a user can fill out an online form, download the document, make an online payment with a credit card or established account, and your copyright is essentially done in minutes. 


All told, the quick revision and resending of the novel took less than 2 weeks.  And as a bonus, I received new emails, complementing me on my quick and forthright response to the errors, and the efforts I made to satisfy my customers.  I had accomplished in days what print books sometimes never manage: To fix a simple error.


Print book users love to talk about the advantages of print, about the “look and feel” of books, the joy of a physical copy on their shelves, etc, etc.  But print books have errors, too… or did you think publishers do “revisions” just for the heck of it?  Unfortunately, when a book is re-issued with revisions, you can’t take your old book back to the store and ask to have it replaced with the new revision.


And that assumes a revision is made.  A lot of books get only one printing, especially if the market is considered to be limited, or if it doesn’t do as well in the stores as the publishers hoped.  You may have a book with the most glaring error, and have to accept the fact that it will never be revised.  Ever.
Whereas the new world of ebooks means that revisions can be made, and released to the market in days… even hours.  And there are mechanisms in place to manually or automatically send revised copies to consumers.  This is only one of ebooks’ gifts to the 21st century.  With things like this, and many others, going for ebooks… who needs print?


That little episode helped to remind me of the need for a good editing/proofing pass on my backlist books, before they are re-released.  I’ve been working on one book, Evoguía, for the past few months (off and on), and I was surprised at how many little errors crept by my original proofing pass.  Chances are, most of my earlier books are roughly the same, with small errors and mis-types, and the occasional odd phrasing that should have been yanked and fixed before the book was released. (If anyone who’s familiar with the original manuscript takes a “moment,” they’ll know what I mean.)


The only excuse I can offer for the earlier releases is, my eagerness to release the books resulted in rushed proofing.  But now that I am a relative old-hand at this, and don’t feel the need to rush (especially when releasing backlist titles), I can afford to take my time and do it right.


There’s another reason to take that time, as well: Major publishers are scrambling to convert their backlist titles to ebooks; but in many cases, they are doing an incredibly sloppy job at it, doing fast scan-and-OCR jobs, and not putting any effort into proofing their text.  As a result, major publishers are releasing ebook versions of their backlist that can only be called “hack jobs” by any consumer unlucky enough to purchase one.  This is my competition… and anything I can do to make my work look better than a major publisher’s work will benefit my sales.


Hopefully readers will notice the quality (or, at least, note the superior quality compared to other major publishers’ works) and will help spread the word about my work through reviews and recommendations.  So it’s clearly in my best interests to do the better proofing job, elevate my reputation as a quality artist, and be a positive element in the evolution of publishing from Big-Publisher domination to a more integrated field of publishers and independent artists.



Related Posts Plugin for WordPress, Blogger...