Danny Briskin, QA Consultants Senior Automation Engineer

Motivation

During accessibility testing (WCAG) of a website one of the most confusing tests is about decorative images. The requirement says:

Text Alternatives: Provide text alternatives for any non-text content so that it can be changed into other forms people need, such as large print, braille, speech, symbols or simpler language.

But on the other hand

Decoration, Formatting, Invisible: If non-text content is pure decoration, is used only for visual formatting, or is not presented to users, then it is implemented in a way that it can be ignored by assistive technology.

In other words, if an image is informative one it should have alt attribute filled with intelligible description. Otherwise, alt attribute must be empty.

The issue

Well, it is obvious for a web page author which image is decorative. It is not complicated for a human tester to do the same. But how to figure it out automatically? There are no reliable approaches to solve this issue based on size or color of images because of the nature of web design. All decisions whether an image is decorative or not can be made based on the contents of the image. But it is not the only meaningful criteria to check. The same images could be decorative on certain pages while on others it will not. A picture of a pretty woman in underwear will be quite informative in women underwear shop catalogue but it becomes purely decorative in online banking website.

Let’s sum up: for an image to be informative, its topic should be the same as its page contents topic.

Idea

We are faced with a classic Classification Task. More precisely, there are 2 classifications: one for page text and other one is for image. If classes are the same, the image is marked as informative, otherwise it is decorative. Nowadays, machine learning technologies allow making reliable decisions in a relatively brief time. The simplest way is to use Python as a programming language and one of well-known libraries to create two classifiers. There are some drawbacks to that solution:

We need to create a set of all possible topics
And we need to train a model with a lot of examples to achieve good accuracy.

Attempt number one

I started with a text classifier (yes, it looked much simpler than the one for images). Using spacy library and Wikipedia (English) slice of 2017, Google News and several RSS news feeds as training examples I trained a model with 91% of accuracy in predefined 19 categories. The number of training examples was not big enough, I was working on increasing it, but the 1st drawback worried me much more. What if text does not relate to any of the given topics? Is it a promising idea to increase the number of topics and how will it affect accuracy? Thinking of that I gradually stopped the development.

Attempt number two

Once I have read about Zero shot learning and CLIP. CLIP stands for Contrastive Language-Image Pre-training. “It can’t be possible!” was my first thought. How it works with any image and any text? Literally, any! There are models in that library, but one cannot train a model for all images in the world. A kind of magic…

I started to verify it with some simple tests:

First one

A cow

Category	Probability
a boy	0.0001 %
a cow	99.98 %
a horse	0.01 %
BMW	0.001 %
a cowboy is riding through prairie	0.01 %

Well, it was simple.

Let’s make it a bit confusing (images were taken from a web shop catalog):

what it is?

What do you think it is? A candy? A toy? No, it is “Candy Corn Dancin Tricky Treat Singing Stuffed Animal With Motion 10” - 98% of confidence!

and this one?

sweater

“CHAMPION HERITAGE OVER SHOULDER SCRIPT L/S – MEN’S” - 99%

and finally, here that thing was not absolutely sure

tshirt

“Aeropostale Marilyn Monroe Graphic Tee - Navy, Medium” - 62%

Well, okay, even I, a human, was not sure what do they wanted to sale using this picture (the answer is - a T-shirt)

So, it is working. Let’s make it usable.

Architecture

Let’s assume that we have a WCAG testing software, and we will provide it an API/service that receives a link to a picture and several (>=1) texts/words/sentences that are possible topics of given image. The response is the probability for each given text that the topic of the image is the same.

The request:

{ "image_url": "https://somesite.com/image.jpg",
    "image_texts": [
        "Mens Naruto Hero Of The Hidden Leaf Tee - White",
        "Candy Corn Dancin Tricky Treat Singing Stuffed Animal With Motion 10",
        "Old Skool Rainbow Checkerboard    - Little Kid - Black / Multi",
        "Converse Chuck Taylor All Star Lo Sneaker - Little Kid - Black",
        "DISNEY MICKEY MOUSE QUILTED OH BOY CROSSBODY",
        "Christmas Lady & Tramp",
        "Aeropostale Marilyn Monroe Graphic Tee - Navy, Medium",
        "CHAMPION HERITAGE OVER SHOULDER SCRIPT L/S - MEN'S"
    ]
}

The response:

[
    {
        "probability": 0.0010680541163310409,
        "text": "Mens Naruto Hero Of The Hidden Leaf Tee - White"
    },
    {
        "probability": 2.7145318881593994e-07,
        "text": "Candy Corn Dancin Tr <...> nimal With Motion 10"
    },
    {
        "probability": 0.01749918796122074,
        "text": "  Old Skool Rainbow  <...>  Kid - Black / Multi"
    },
    {
        "probability": 5.644326392939547e-06,
        "text": "Converse Chuck Taylo <...> - Little Kid - Black"
    },
    {
        "probability": 0.00018363054550718516,
        "text": "DISNEY MICKEY MOUSE QUILTED OH BOY CROSSBODY"
    },
    {
        "probability": 0.00025380559964105487,
        "text": "Christmas Lady & Tramp"
    },
    {
        "probability": 4.111427188036032e-05,
        "text": "Aeropostale Marilyn  <...> c Tee - Navy, Medium"
    },
    {
        "probability": 0.9796881079673767,
        "text": "CHAMPION HERITAGE OVER SHOULDER SCRIPT L/S - MEN'S"
    }
]

Swagger-like UI should be created to ease service usage. The service should be easily distributed to any kind of environment and does not depend on the environment.

Implementation

Flask/Waitress was chosen as API/service engine. Swagger-UI will serve Swagger documentation. Service is packed into Docker container.

Project code can be found here

Summary

The idea of using Machine Learning technologies in automation testing is not brand new. I believe that once, in the future a fully qualified artificial intelligence will perform all testing activities. But even now, we can replace the tedious and boring work of a manual tester work with a quite fast and robust automation script. An implementation in the form of API/service packed in a Docker container makes it flexible, reliable, maintainable, testable and scalable solution.

Knowledge Vault

Articles about Information Technologies by Danny Briskin

Automatic detection of decorative images using Machine Learning