Quick look at Computer Vision API (Azure Cognitive Services)

As both my children have been playing Pokemon TCG, an idea I’ve been toying with is an application to use image recognition and OCR to identify and catalogue cards as they are added to the collection.

With recent improvements cloud based machine learning services I’m was interested in whether this historically difficult task has become any simpler – so today I’ve taken some time to look into the OCR capabilities within the Computer Vision API, released as a preview as part of the Azure Cognitive Services family of APIs.

servine-1Using the API couldn’t be simpler (processing the image required only three lines of code) but my interest was more focussed around the quality of the results – which I must admit initially surprised me, however ultimately I found that the output depends largely on the clarity of each individual card.

The primary test image I was using was a photo of Servine. As you can see from the image on the right there is a lot of content within the card, and a huge amount of variety in terms of fonts and font sizes to detect.

As part of the service the Computer Vision API attempts to first split the card into logical text Regions, which are then broken down further to Lines and Words.  


clefairy-1servine-2The red borders overlaid upon the images to the right outline the regions that the Computer Vision API returned. Note that the way the algorithm has grouped the regions doesn’t necessarily align with the logical blocks of text, or obvious ‘human readable’ differences such as font size.

It’s also worth noting that the regions calculated are not necessarily consistent – observe that the regions identified for the Clefairy card are completely different to Servine. Other cards resulted in even larger differences, depending on picture quality and glossiness of the card.

Lines and Words

servine-3Each region is further broken down into lines and words – I’ve highlighted the identified lines using yellow in the image to the right. I was very impressed by the fact that even the text in the very small font was identified – though this was not the case on all cards I tested with.

The actual text recognition wasn’t quite as accurate – for example;

  • STAGE1 was identified as STAGEI
  • Servine was actually identified as two words (Ser vine).
  • Oddly the very small ‘Evolves from Snivy’ in the second line was read perfectly.

servine-4Likewise the main body text read very well – though it was interesting to note that while most fonts on the card were recognised the word ‘Ability’ was not included as part of the line, which I’m putting down to the slightly different font and background.

clefairy-2Other test cards did not process so well though. This example from the Clefairy card shows that while most of the description text was correctly identified, the final line wasn’t recognised at all.


Overall the character recognition is certainly impressive, despite being still in preview.  In a number of my test cases sufficient data was derived from the card to identify the basic characteristics of the Pokémon (name, HP, attack details etc).  This would be enough to populate a basic catalogue of data, though it’s still far from perfect.

Getting a good result relies not just on the quality of the photo, but also on the characteristics of the card itself. Low contrast between the text and background was problematic, as was any reflection from the card such as in the case of a holographic card, or where protective card sleeve were used.

There is still not enough consistency in the results to be able to build out a robust card catalogue application, but it’s still a fun subject matter to investigate a little further.  My next step (for another day) is to look at whether a series of images of the same card can be used to apply some error correction and derive more accurate results.


In a more recent post I’ve taken a look at the comparison of the results for these images using the Google Cloud Vision API.  Take a look at my follow up post Quick look at Text Detection in the Google Vision API.