Quick look at Text Detection in the Google Vision API


Last week I posted some observations based on using the Computer Vision API (part of the Azure Cognitive Services) in an attempt to read text content from complex images.  For these experiments I’ve chosen to use Pokémon cards as they contain a large variety of fonts and font sizes, and do not use a basic grid for displaying all text.

Today I’ve taken some time to do a comparative test with the same images using the text detection feature within Google Cloud Vision API (which at the time of writing is also still in preview).

API Comparison

The simplicity of creating a test client was on-par with the Azure service, again requiring only three lines of code once the client library and SDK were installed.

There are a few observable differences in the way Google and Azure services have been constructed;

  • servine-google-vision-api Azure Computer Vision API attempts to break all text down to regionslines, and words – while the Google Cloud Vision API only identifies individual words.
  • Azure Computer Vision API attempts to identify the orientation of the image based on the words detected
  • Azure Computer Vision API returns graphical boundaries as rectangles, while Google Cloud Vision API uses polygons.  This allows the Google service to more accurately deal with angled text, and mitigates the lack of orientation detection.

Using the Google Cloud Vision API, the orientation could be derived based on the co-ordinates provided.   Using the longest word returned one can generally assume that the longest side on the bounding polygon represents the horizontal axis of a properly orientated image.

The Google Cloud Vision API also makes an attempt to group all text into a single ‘description’ value containing an approximation of all the identified words in the correct order.  This works well for standard text where the topmost point of the bound polygons for each word align – however images where the polygons are bottom-aligned (due to different font-sizes) the description is in reverse order.

Text Detection Comparison

The results comparison results varied from card-to-card, but overall on comparison of the two services the Google Cloud Vision API appears slightly more accurate.

Note: Comparison was done using photographs of the cards, saved in .jpg format.  This method was selected as it realistically simulates how the text detection service would be used from a smartphone application.  High-resolution .png images (as opposed to photographs) resulted in slightly more accurate results from both services.

Clean Image Comparison

The following examples are an example of the results based on a ‘clean image’ – in this case Servine provided the best results in both APIs.

  • Both APIs did a good job of reading clear blocks of text, including content in a very small (but non-italic) font.
  • Neither service identified word ‘Ability’, presumably due to the font/style used.
  • Neither service correctly identified the small/italic text correctly (Evolves from Snivy)
  • Neither service identified the font change for the numeric ‘1’ in STAGE1
Original Image Azure Google
screen-shot-2017-02-21-at-11-11-23-am

STAGE1 Servine
Evolves from Snivy
servine-3

STAGEI Ser vine
Evolves from Snivy
screen-shot-2017-02-21-at-11-18-01-am

STAGE1 Servine
Evolves from Snivy

screen-shot-2017-02-21-at-11-22-09-am

NO. 496 Grass Snake Pokémon HT: 2’07” WT 35.5 lbs

screen-shot-2017-02-21-at-11-42-44-am

NO, 496 Grass Snake Pokémon HI: 2’07” WT: 35B lbs
screen-shot-2017-02-21-at-11-23-35-am

NO, 496 Grass Snake Pokémon HT 2 O7 WT 35.3 lbs
 screen-shot-2017-02-21-at-11-49-50-am

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve 1 of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.
 screen-shot-2017-02-21-at-11-49-39-am

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve l of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.
 screen-shot-2017-02-21-at-11-50-08-am

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve l of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.

Lower Quality Images

Lower quality images resulted in much bigger differences between the Vision APIs provided by Azure and Google, with the Google Cloud Vision API providing much more accurate text recognition.

Original Image Azure Google
screen-shot-2017-02-21-at-1-19-11-pm

They photosynthesize by bathing their tails in sunlight.  When they are not feeling well, their tails droop.
screen-shot-2017-02-21-at-1-18-36-pm

 their tails in trot
screen-shot-2017-02-21-at-1-18-25-pm

rhey photosynthesize bybathinv their tails in sunlight, when they
are not feelinu well, their tails
droop.
 screen-shot-2017-02-21-at-1-29-43-pm

Basic Machamp-EX HP180
 screen-shot-2017-02-21-at-1-29-06-pm

Basic Machamp-EX HP180
 screen-shot-2017-02-21-at-1-29-11-pm

Basic Machamp-EX MP180
screen-shot-2017-02-21-at-1-32-05-pm

Pokémon-EX rule
When a Pokémon-EX has been Knocked Out, your opponent takes 2 Prize cards.
screen-shot-2017-02-21-at-1-32-53-pm

Pokémon-EX rule
Wheoa Pokémon•EX has your opponent takes 2 Prue
screen-shot-2017-02-21-at-1-33-01-pm

Pokémon-EX rule
When a Pokémon Exhas been Knocked out,
your opponent takes 2 Prize cards.
 screen-shot-2017-02-21-at-1-42-43-pm

If its coat becomes fully charged with electricity, its tail lights up. It fires hair that zaps on impact.
 screen-shot-2017-02-21-at-1-42-09-pm

l/ its coat becomes fully ch.itxed 9ith electricity, its tail lis’hts up. It fires hair that zaps ort impact.
 screen-shot-2017-02-21-at-1-42-00-pm

If its coat becomes fully charged with electricity, its tail lights up. lt fires hair that zaps on impact.

Conclusion

Although the Azure Computer Vision API does include more helpful features (such as identifying image orientation and pre-grouping regions and lines of text), the text detection exhibited by the Google Cloud Vision API was far more accurate when using lower-quality images.

Advertisements

Author: Nick Hoggard

An architect working with the Microsoft development stack, my real interest lies in how we can utilise the tools we have to continuously improve our processes and quality.

1 thought on “Quick look at Text Detection in the Google Vision API”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s