Google Docs OCR option | Good for text only documents

[Friday, December 03, 2010 | 0 comments ]

Recently Google gave its users another useful feature in its Google Documents. Now you can convert your images and pdf documents into searchable text. Procedure to convert images/pdf into documents using Google docs is very easy; all you need to do is to select the option “Convert text from PDF or images files to Google Docs documents”

While uploading your documents in Google Docs ,you have an option of converting it into asearchable text.

Once Google is done with converting the format to Google Docs, you get a searchable text from your images and PDFs.When you try to open the converted document Google will give you a message like

“This document contains text automatically extracted from a PDF or image file. Formatting may have been lost and not all text may have been recognized. “ (And true to its warning, at most of the times I did lose my original formatting while testing this feature.)

So how good is this new OCR feature of Google Docs? I ran a few tests to get an idea about its capabilities and to find out where it falls short of expectations.

And what exacty can be expected from a typical OCR app ?

A good OCR should

Recognize the text in a document !!!
Support the document’s language.
Should retain the format of the document.
Segregate images and text present in a document.
Conversion should be fast.

While we expect these basic functions In any OCR app , the challenges a OCR application faces are

A Scanned document with hardly recognizable text in it
Accuracy of the OCR decreases drastically if a document is not properly vertically aligned.
Documents with complex structure having images and text together.

Now let’s see how Google Docs OCR faired in these expectations. I chose following document to test Google Docs OCR capabilities

A Scan of old document (text only)

A scan of newspaper cutting having different fonts in different sizes and with images.

A document with images and text

And the last one

A scanned document with complex structure.

And here are the results

Test in the first scan was recognized with considerable accuracy .Here is the result

“"Molasses will remove mildew" stains from the most delicate fabrìfo,” Writes a; reader in reply to a remedy for the stahl published recently in these columns. In place of the more laborious method he prescribes the treatment A employed by seamen for mildewed sails. The sheets which are covered for a time with common molasses, when washed are found free from blemishes. I e

A package of absorbent cotton is a con-
venience in the household. One of its uses is in removing grease spots from woolìens. If applied immediately after oil, milk, butt/er or has been spilled onthe fabric it will absorb every trace.
A strong solution of ammonia. is the best ‘agent for 'cleaning out glass. If the carafe shows murky inside markings, fill it half full of the liquid and add some small pieces of potato parings. Shake it vigor: ously and rinse it carefully in clear water, Scrub the outside with a, small brush.

“

With minor glitches, Google docs OCR converted the entire text with considerable accuracy.

Now I tried converting the second scan which had different fonts and had images.

And this is what I got in return

“I Newsmaker
vidya
journey called India
British historian and broadcaster Michael W00d’s "The Story of India” made its debut on American TV recently. India’s long pluralistic history has relevance today in a world coming toterms and cultural issues, says Wood. Excerpts
with 21 host of human from a conversation... PATEL
hat attracted yan tu India? On British and American TV, You see ancient Rome. Greece,
dlines. It aired in England in 200'?. Then
Women's Day
Celebrating the new
Se
mg sateume maps, changing some ila sages, cutting others. Ayodhya, fo

“

Forget about images, in this case Google Docs OCR was not even able to correctly recognize entire text. It messed up entire formatting and gave a result which was far from what is desired from even most basic OCR application.

And not the last test with a scanned document having complex format

to my disappointment result was even poorer than the last one

“Convert to perceptually-oriented color space
Scan a document
Luminance
Next pixel
..._1 Examine local neighborhood
- Inverse halftoning
Y Í More Pixels?
| Corr. y Ü 120 _ 4,“ Variable contrast mapping tf?" - - - -I | L | 12' il
Chrominance Channel Cleaning
Selective smoothing including bleed-
through reduction
* 128 I Post-processing i-I

“

My verdict on Google Docs OCR capabilities: Google docs is not meant to be a replacement for commercial OCR softwares like Abby Finereader , ReadIris etc . It simply gives a very basic functionality of OCR which is capable of converting all text scans only. If you are looking something more than this , read my blog on a comparative test of commercial OCR softwares.

Pages

0 comments

Post a Comment

Subscribe Now

Categories

Popular Posts

Labels

desibomb.in

You may also like

Blog Archive

Total Pageviews