Last week I posted some observations based on using the Computer Vision API (part of the Azure Cognitive Services) in an attempt to read text content from complex images. For these experiments I’ve chosen to use Pokémon cards as they contain a large variety of fonts and font sizes, and do not use a basic grid for displaying all text.
Today I’ve taken some time to do a comparative test with the same images using the text detection feature within Google Cloud Vision API (which at the time of writing is also still in preview).
The simplicity of creating a test client was on-par with the Azure service, again requiring only three lines of code once the client library and SDK were installed.
There are a few observable differences in the way Google and Azure services have been constructed;
- Azure Computer Vision API attempts to break all text down to regions, lines, and words – while the Google Cloud Vision API only identifies individual words.
- Azure Computer Vision API attempts to identify the orientation of the image based on the words detected
- Azure Computer Vision API returns graphical boundaries as rectangles, while Google Cloud Vision API uses polygons. This allows the Google service to more accurately deal with angled text, and mitigates the lack of orientation detection.
Using the Google Cloud Vision API, the orientation could be derived based on the co-ordinates provided. Using the longest word returned one can generally assume that the longest side on the bounding polygon represents the horizontal axis of a properly orientated image.
The Google Cloud Vision API also makes an attempt to group all text into a single ‘description’ value containing an approximation of all the identified words in the correct order. This works well for standard text where the topmost point of the bound polygons for each word align – however images where the polygons are bottom-aligned (due to different font-sizes) the description is in reverse order.
Text Detection Comparison
The results comparison results varied from card-to-card, but overall on comparison of the two services the Google Cloud Vision API appears slightly more accurate.
Clean Image Comparison
The following examples are an example of the results based on a ‘clean image’ – in this case Servine provided the best results in both APIs.
- Both APIs did a good job of reading clear blocks of text, including content in a very small (but non-italic) font.
- Neither service identified word ‘Ability’, presumably due to the font/style used.
- Neither service correctly identified the small/italic text correctly (Evolves from Snivy)
- Neither service identified the font change for the numeric ‘1’ in STAGE1
Lower Quality Images
Lower quality images resulted in much bigger differences between the Vision APIs provided by Azure and Google, with the Google Cloud Vision API providing much more accurate text recognition.
Although the Azure Computer Vision API does include more helpful features (such as identifying image orientation and pre-grouping regions and lines of text), the text detection exhibited by the Google Cloud Vision API was far more accurate when using lower-quality images.