![]() ![]() The simple book scanning rig, constructed from MDF, large enough to hold a book upto approximately 20x30cm in size. This might sound like overkill for a single book, but it is well worth the effort. ![]() The solution to this is to construct some kind of book scanning rig that will support the book such that it opens to an angle somewhere in the region of 110-140 degrees. Even when using a camera, the spine of the book is still a problem as if you simply open the book on a flat surface the pages will curve near the spine leading to uneven illumination & distortion of the text which will ruin OCR accuracy. Instead a digital SLR camera is the preferred tool, with flashgun(s) to provide illumination. For this to work effectively the image has to be evenly illuminated any kind of gradient across the background will confound the monochrome conversion leading to large blocks of text getting lost.Ī flatbed scanner is not a satisfactory way of capturing the pages of the book because it is impossible to get the pages flat without damaging the spine. It converts the input image to monochrome, approximately speaking, by applying a threshold algorithm to the image. Reading about how it works it becomes clear that the biggest factor in accuracy is the quality of the input images. ![]() Tesseract just focuses on the core OCR tasks, and leaves image acquisition to other tools likewise post-recognition processing. This dates back into the mid-80’s and was open sourced by HP in 2005. I quickly found the open source Tesseract OCR software which runs on Linux, Windows and OS-X. Transcribing that much content manually was not an enticing prospect, so I started investigating the options for automation. The current copy was written directly on a typewriter and ran to almost 250 pages of very dense text. Last year I had need to help out with digitizing an old book, so that its current content can be updated, expanded and ultimately reprinted. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |