A/B Evaluation

You can use Prodigy to evaluate models that generate outputs such as text, images or video. The evaluation works like an A/B test: you give Prodigy the output of your model and the output of a baseline, on the same inputs. This starts a web service that mixes the outputs of the models.

prodigy dataset image_captions "Compare image captions"✨ Created dataset 'image_captions'.prodigy compare image_captions model_a.jsonl model_b.jsonl✨ Starting the web server on port 8080...

Once the service is running, you can open the web app in your browser, and mark which output you prefer. At the end of the session, you're told how often you preferred your system's output, and how often you preferred the baseline.

Photo of a pug wrapped in a blanket outdoors
source: Unsplash by: Matthew Henry url: unsplash.com/@matthewhenry
A pug in a blanket on the grass
An unhappy child in the garden

To keep the interface consistent, the accept button will always approve the second output. To keep the evaluation fair, Prodigy will randomly map the inputs A and B to accept and reject. This mapping is passed through to the front-end, but not displayed.

The two input files should be newline-delimited JSON, with each record containing an input, output and a unique ID to map model A to model B. If an ID is only found in one of the comparison files, the example will be skipped.


{ "id": 0, "input": {"image": "https://images.unsplash.com/photo-1433162653888-a571db5ccccf?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=400&fit=max&s=cb3099ba9dc50a500db3b298c6d7c156"}, "output": {"text": "A pug in a blanket on the grass"} }


{ "id": 0, "input": {"image": "https://images.unsplash.com/photo-1433162653888-a571db5ccccf?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=400&fit=max&s=cb3099ba9dc50a500db3b298c6d7c156"}, "output": {"text": "An unhappy child in the garden"} }

Each annotation record will be serialised as a JSON object and stored as a string in the database. The annotation records will have the following structure:

Annotation format

{ "id": 0, "input": {"image": "https://images.unsplash.com/photo-1433162653888-a571db5ccccf?ixlib=rb-0.3.5&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=400&fit=max&s=cb3099ba9dc50a500db3b298c6d7c156"}, "accept": {"text": "A pug in a blanket on the grass"}, "reject": {"text": "An unhappy child in the garden"}, "mapping": {"accept": "A", "reject": "B"}, "answer": "accept" }

To find out which of the inputs was better, first find whether the user clicked accept or reject, and then find if that corresponded to the output from file A or file B:

answer = annotation['answer']            # user clicked "accept"
result = annotation['mapping'][answer]   # "A" is better

The compare recipe is very flexible, and can be used for many different model types, especially with a custom view component.

Machine translation

Evaluation has always been an important problem for machine translation, especially now that systems are able to take into account longer-range dependencies. To evaluate the translations, you need an annotator who at least speaks the target language. If the annotator doesn't speak the source language, you can display a reference translation in the prompt instead.

FedEx von weltweiter Cyberattacke getroffen
FedEx hit by worldwide cyberattack
FedEx from worldwide Cyberattacke hit

If the differences are very subtle, you can also set the --diff flag to render the annotation using the diff interface. Prodigy supports diffing by sentence, word or character.

FedEx von weltweiter Cyberattacke getroffen
FedEx hit by world-wideworldwide cyberattack

Spelling and grammar correction

Spelling and grammar correction is another example of a task where the system has a range of acceptable outputs, making it hard to automatically evaluate against a single "gold standard". The compact --diff view is useful for this task, because the outputs of the two systems are usually very similar to the input.

Cyber researchers have linked the vulnerability exploited by the latest ransomware to Wanna Cry. Both versions of malicious software rely on weaknessses discovered by the National Security Agency years ago, Kaspersky said.

To make Prodigy highlight diffs on a character basis, add "diff_style": "chars" to your prodigy.json configuration file and run the recipe with the dif view enabled.

prodigy compare spellcheck_datset model_a.jsonl model_b.jsonl --diff✨ Starting the web server on port 8080...

Artistic style transfer

In an style transfer task, the model receives two inputs, and produces an output that adjusts the first input using stylistic cues from the second. This has been applied to visual inputs with very impressive results. However, there's obviously no way to evaluate these systems automatically – beauty is in the eye of the beholder. However, even a subjective evaluation can be conducted rigorously and fairly, so that it's possible to tell which of several approaches is better for a given application.

Take this "Starry Night" rendition from the paper A Neural Algorithm of Artistic Style (Gatys et al., 2015) as an example:

Photo of the Neckarfront in Tübingen next to painting "The Starry Night" by van Gogh
Machine-generated rendition of the Neckarfront in Tübingen in the style of "The Starry Night" by van Gogh, with slightly brighter colors
Machine-generated rendition of the Neckarfront in Tübingen in the style of "The Starry Night" by van Gogh, with slightly browner and less saturated colors

Sentence similarity

Semantic similarity systems assign a numeric score indicating whether two pieces of text have a similar meaning. This task is also hard to evaluate, because there's no easy way to ensure that annotators judge similarity on the same scale. Instead of assigning a numeric score to a pair of sentences, systems can be evaluated by marking which of two sentences is more similar to a single input.

To Close Digital Divide, Microsoft to Harness Unused Television Channels
Microsoft is buying television bandwidth
Microsoft is selling its shares in local news stations