A/B Evaluation

Rigorous evaluation is the most important part of any machine learning project. If you can't evaluate, you can't make progress. Prodigy gives you a simple, sound and efficient methodology that works on any problem. Even if you're making subjective quality decisions, you can still run repeatable experiments.

Photo of a pug wrapped in a blanket outdoors
source: Unsplash by: Matthew Henry
A pug in a blanket on the grass
An unhappy child in the garden

Evaluate your generative models

Many of the most exciting capabilities of today's neural network models produce outputs that aren't simply "right" or "wrong". Prodigy makes it easy to conduct rigorous manual evaluations of these models in just a few minutes, using randomised A/B testing. You won't know which model produced which output, to make sure you aren't biased when marking which output is better. As soon as you exit the server, Prodigy will show you the result. A full log of your decisions is available, allowing anyone to reproduce the evaluation.

Machine Translation

Evaluation has always been an important problem for machine translation, especially now that systems are able to take into account longer-range dependencies. To evaluate the translations, you need an annotator who at least speaks the target language. If the annotator doesn't speak the source language, you can display a reference translation in the prompt instead.


{ "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx hit by worldwide cyberattack"} }


{ "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx from worldwide Cyberattacke hit"} }
Cyber researchers have linked the vulnerability exploited by the latest ransomware to Wanna Cry. Both versions of malicious software rely on weaknessses discovered by the National Security Agency years ago, Kaspersky said.

Spelling and grammar correction

Spelling and grammar correction is another example of a task where the system has a range of acceptable outputs, making it hard to automatically evaluate against a single "gold standard". The compact diff view is useful for this task, because the outputs of the two systems are usually very similar to the input.

Artistic Style Transfer

In an style transfer task, the model receives two inputs, and produces an output that adjusts the first input using stylistic cues from the second. This has been applied to visual inputs with very impressive results. However, there's obviously no way to evaluate these systems automatically – beauty is in the eye of the beholder. However, even a subjective evaluation can be conducted rigorously and fairly, so that it's possible to tell which of several approaches is better for a given application.

Example: "Starry Night" rendition from the paper A Neural Algorithm of Artistic Style (Gatys et al., 2015)

Photo of the Neckarfront in Tübingen next to painting "The Starry Night" by van Gogh
Machine-generated rendition of the Neckarfront in Tübingen in the style of "The Starry Night" by van Gogh, with slightly brighter colors
Machine-generated rendition of the Neckarfront in Tübingen in the style of "The Starry Night" by van Gogh, with slightly browner and less saturated colors