Quick look at Text Detection in the Google Vision API

Last week I posted some observations based on using the Computer Vision API (part of the Azure Cognitive Services) in an attempt to read text content from complex images.  For these experiments I’ve chosen to use Pokémon cards as they contain a large variety of fonts and font sizes, and do not use a basic grid for displaying all text.

Today I’ve taken some time to do a comparative test with the same images using the text detection feature within Google Cloud Vision API (which at the time of writing is also still in preview).

API Comparison

The simplicity of creating a test client was on-par with the Azure service, again requiring only three lines of code once the client library and SDK were installed.

There are a few observable differences in the way Google and Azure services have been constructed;

  • servine-google-vision-api Azure Computer Vision API attempts to break all text down to regionslines, and words – while the Google Cloud Vision API only identifies individual words.
  • Azure Computer Vision API attempts to identify the orientation of the image based on the words detected
  • Azure Computer Vision API returns graphical boundaries as rectangles, while Google Cloud Vision API uses polygons.  This allows the Google service to more accurately deal with angled text, and mitigates the lack of orientation detection.

Using the Google Cloud Vision API, the orientation could be derived based on the co-ordinates provided.   Using the longest word returned one can generally assume that the longest side on the bounding polygon represents the horizontal axis of a properly orientated image.

The Google Cloud Vision API also makes an attempt to group all text into a single ‘description’ value containing an approximation of all the identified words in the correct order.  This works well for standard text where the topmost point of the bound polygons for each word align – however images where the polygons are bottom-aligned (due to different font-sizes) the description is in reverse order.

Text Detection Comparison

The results comparison results varied from card-to-card, but overall on comparison of the two services the Google Cloud Vision API appears slightly more accurate.

Note: Comparison was done using photographs of the cards, saved in .jpg format.  This method was selected as it realistically simulates how the text detection service would be used from a smartphone application.  High-resolution .png images (as opposed to photographs) resulted in slightly more accurate results from both services.

Clean Image Comparison

The following examples are an example of the results based on a ‘clean image’ – in this case Servine provided the best results in both APIs.

  • Both APIs did a good job of reading clear blocks of text, including content in a very small (but non-italic) font.
  • Neither service identified word ‘Ability’, presumably due to the font/style used.
  • Neither service correctly identified the small/italic text correctly (Evolves from Snivy)
  • Neither service identified the font change for the numeric ‘1’ in STAGE1
Original Image Azure Google

STAGE1 Servine
Evolves from Snivy

STAGEI Ser vine
Evolves from Snivy

STAGE1 Servine
Evolves from Snivy


NO. 496 Grass Snake Pokémon HT: 2’07” WT 35.5 lbs


NO, 496 Grass Snake Pokémon HI: 2’07” WT: 35B lbs

NO, 496 Grass Snake Pokémon HT 2 O7 WT 35.3 lbs

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve 1 of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve l of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.

Ability Serpentine Strangle
When you play this Pokémon from your hand to evolve l of your Pokémon, you may flip a coin. If heads, your opponent’s Active Pokémon is now Paralyzed.

Lower Quality Images

Lower quality images resulted in much bigger differences between the Vision APIs provided by Azure and Google, with the Google Cloud Vision API providing much more accurate text recognition.

Original Image Azure Google

They photosynthesize by bathing their tails in sunlight.  When they are not feeling well, their tails droop.

 their tails in trot

rhey photosynthesize bybathinv their tails in sunlight, when they
are not feelinu well, their tails

Basic Machamp-EX HP180

Basic Machamp-EX HP180

Basic Machamp-EX MP180

Pokémon-EX rule
When a Pokémon-EX has been Knocked Out, your opponent takes 2 Prize cards.

Pokémon-EX rule
Wheoa Pokémon•EX has your opponent takes 2 Prue

Pokémon-EX rule
When a Pokémon Exhas been Knocked out,
your opponent takes 2 Prize cards.

If its coat becomes fully charged with electricity, its tail lights up. It fires hair that zaps on impact.

l/ its coat becomes fully ch.itxed 9ith electricity, its tail lis’hts up. It fires hair that zaps ort impact.

If its coat becomes fully charged with electricity, its tail lights up. lt fires hair that zaps on impact.


Although the Azure Computer Vision API does include more helpful features (such as identifying image orientation and pre-grouping regions and lines of text), the text detection exhibited by the Google Cloud Vision API was far more accurate when using lower-quality images.


Quick look at Computer Vision API (Azure Cognitive Services)

As both my children have been playing Pokemon TCG, an idea I’ve been toying with is an application to use image recognition and OCR to identify and catalogue cards as they are added to the collection.

With recent improvements cloud based machine learning services I’m was interested in whether this historically difficult task has become any simpler – so today I’ve taken some time to look into the OCR capabilities within the Computer Vision API, released as a preview as part of the Azure Cognitive Services family of APIs.

servine-1Using the API couldn’t be simpler (processing the image required only three lines of code) but my interest was more focussed around the quality of the results – which I must admit initially surprised me, however ultimately I found that the output depends largely on the clarity of each individual card.

The primary test image I was using was a photo of Servine. As you can see from the image on the right there is a lot of content within the card, and a huge amount of variety in terms of fonts and font sizes to detect.

As part of the service the Computer Vision API attempts to first split the card into logical text Regions, which are then broken down further to Lines and Words.  


clefairy-1servine-2The red borders overlaid upon the images to the right outline the regions that the Computer Vision API returned. Note that the way the algorithm has grouped the regions doesn’t necessarily align with the logical blocks of text, or obvious ‘human readable’ differences such as font size.

It’s also worth noting that the regions calculated are not necessarily consistent – observe that the regions identified for the Clefairy card are completely different to Servine. Other cards resulted in even larger differences, depending on picture quality and glossiness of the card.

Lines and Words

servine-3Each region is further broken down into lines and words – I’ve highlighted the identified lines using yellow in the image to the right. I was very impressed by the fact that even the text in the very small font was identified – though this was not the case on all cards I tested with.

The actual text recognition wasn’t quite as accurate – for example;

  • STAGE1 was identified as STAGEI
  • Servine was actually identified as two words (Ser vine).
  • Oddly the very small ‘Evolves from Snivy’ in the second line was read perfectly.

servine-4Likewise the main body text read very well – though it was interesting to note that while most fonts on the card were recognised the word ‘Ability’ was not included as part of the line, which I’m putting down to the slightly different font and background.

clefairy-2Other test cards did not process so well though. This example from the Clefairy card shows that while most of the description text was correctly identified, the final line wasn’t recognised at all.


Overall the character recognition is certainly impressive, despite being still in preview.  In a number of my test cases sufficient data was derived from the card to identify the basic characteristics of the Pokémon (name, HP, attack details etc).  This would be enough to populate a basic catalogue of data, though it’s still far from perfect.

Getting a good result relies not just on the quality of the photo, but also on the characteristics of the card itself. Low contrast between the text and background was problematic, as was any reflection from the card such as in the case of a holographic card, or where protective card sleeve were used.

There is still not enough consistency in the results to be able to build out a robust card catalogue application, but it’s still a fun subject matter to investigate a little further.  My next step (for another day) is to look at whether a series of images of the same card can be used to apply some error correction and derive more accurate results.


In a more recent post I’ve taken a look at the comparison of the results for these images using the Google Cloud Vision API.  Take a look at my follow up post Quick look at Text Detection in the Google Vision API.