Welcome to Mustek Home
www.mustek.com
Search  
Highlight
Our Products
 

Bitmap vs.Vector
Image Formats


Scanner Basics
H/W Installation
S/W Installation

Scanning Tips

User's Guide
Glossary

Software Tutorials

Scanner Classroom
 

Mustek Global :

 


You just finished writing a rough draft for a term paper and printed a copy of it on your laser printer. Then lightning struck your house and blew out your computer, including the hard drive. You got the computer fixed, but were unable to save any of the data from your hard drive. You dread the thought of re-typing ten pages of text and wonder if there's any other way to get your paper back into your word processing program.

It occurs to you that your new scanner may be useful here. The only problem is that scanners produce bitmap images, which look like the one below. Word processors are not capable of editing bitmap images. So how do you convert the scanned images of your term paper into something that you can edit with a word processor?

Bitmap image of the letter "a" as produced by a scanner.

The OCR software that came with your scanner is designed just for this task. You'll have your term paper back in the computer relatively quickly with OCR software. This document tells you how it all works.

 

The human brain can easily recognize the letter "a" in hundreds of different sizes and fonts. Computers, however, aren't as smart as people. The promise of Optical Character Recognition (OCR) software is to scan and recognize text then convert it to a word processor file for further editing.

OCR software does this in three primary ways: Pattern Matching, Feature Extraction and Spell Checking.

Most text is either in Times, Courier, or Helvetica typefaces in point sizes between 10 and 14. OCR programs which use the Pattern Matching method have bitmaps similar to the picture above stored for every character of each of the different font and type sizes. By comparing the stored bitmaps distributed with the OCR program to the bitmaps of the scanned letters the program attempts to recognize the letters. An obvious limitation to this method is that it is only useful for the fonts and sizes stored.

Rather than trying to match a bitmap to the scanned letters, feature extraction attempts to recognize letters by condensing the scanned letters to their basic "features" which are compared to a list of features stored in the program's code.

    For example: the letter "a" is made from a circle, a line on the right side and an arc over the middle. The arc over the middle is optional. So, if a scanned letter had these "features" it would be correctly identified as the letter "a" by the OCR program.

No OCR software ever recognizes 100% of the scanned letters. Some OCR programs use the Pattern Matching and/or Feature Extraction methods to recognize as many characters as possible. After initial recognition is performed, unrecognized letters can often be determined by looking at the surrounding letters. For example: if the OCR program was unable to recognize the letter "e" in the word "th~ir", by spell checking "th~ir" the program could determine the missing letter is an "e".

The best optical character recognition programs, such as the one shipped with Mustek scanners, use more than one method to determine what a character is. By combining several of the above methods, accuracy is increased dramatically.

All Mustek scanners come with OCR software which will work with common Word Processors. The diagram below shows how TWAIN modules work in conjunction with OCR software to scan text documents into your Word Processor:

1. A Word Processing application calls a TWAIN Compliant OCR application such as TextBridge or Wordlinx.

2. Settings are adjusted if necessary in the OCR application which then calls the TWAIN Module.

3. The TWAIN module takes control of the scanner and allows the user to set the Scan Mode to Line Art and the Resolution to 300 DPI. *See Note Below

4. When the Scan button is clicked, the scanner begins transmitting the image data back to the TWAIN Module.

5. The TWAIN module transfers the image data back to the OCR program that TWAIN was called from. The OCR program uses one or more of the methods described above to convert the bitmap image of your text into letters.

6. TWAIN sends the recognized letters back to your word processor. If the OCR program could not recognize a letter, it places a ~ symbol where the unreadable letter was. Sometimes OCR programs incorrectly recognize letters. This is almost always due to poor quality original documents. **See Note Below

*Note: You can use 400 DPI if your text is smaller than 10 point. If your text is 10 point or larger, use 300 DPI because OCR software is optimized for 300 DPI scans. Believe it or not, OCR will usually be more accurate scanning at 300 DPI than at 400 DPI unless your text is very small.

**Note: Documents printed by high quality printing processes are most suitable for OCR. This includes laser printers, printing presses and books. Pages printed on inkjet printers as well as newspaper articles will give good results, but there will be more mistakes than with laser printed originals. Items printed on dot matrix printers, copy machines and FAX machines do not produce good results with OCR software.

If you are not getting good results with your OCR software, try scanning your document with iPhoto Plus or Picture Publisher. You should set the Scan Mode to Line Art and the Resolution to 300 DPI. After scanning, zoom in on some of your text and see if it looks recognizable to you. If it does not look good and smooth like the letter "a" above, you are either scanning a poor quality document or you need to reset the Brightness and Contrast settings in your TWAIN Module prior to scanning.

Scanner Class Room Class Room
Mustek, Inc.


Copyright, 2000, Mustek, Inc. All rights reserved.