A/B Evaluation · Prodigy · An annotation tool for AI, Machine Learning & NLP

Evaluate your generative models

For tasks like translation, captioning, image generation or dialogue, outputs aren't simply “right” or “wrong”. You can't evaluate these models automatically, but that doesn't mean you should give up on evaluation altogether. If your model takes hours to train, why not spend a few minutes conducting manual evaluation? You'll understand your models better, and get much better data about what’s working and what’s not.

Randomized A/B testing is a simple and effective way to evaluate any model. Prodigy makes this tried and tested approach easy, so you can evaluate your experiments in minutes. You just need the outputs of two models you want to compare, over the same inputs. Prodigy will shuffle them up, and ask you which output you prefer for each example. It only takes a few hundred samples to reach a reliable conclusion, and the decisions are saved so you or a colleague can go back over them later, allowing you to apply a rigorous approach to even the most subjective decisions.

Try it live and select the best answer!

This live demo requires JavaScript to be enabled.

Try it live and select the best answer!

This live demo requires JavaScript to be enabled.

Example: Evaluate contextually-keyed word embeddings

Topic models and word vectors often reveal unexpected associations between words or phrases, making them great tools for knowledge discovery. They're even better if you make the terms you're modeling more precise, using NLP annotations. In 2019, we used our NLP library spaCy to process over 150 billion words of Reddit comments, and trained sense2vec models using the annotations.

Knowledge discovery technologies will always be hard to evaluate automatically, because we're looking for outputs that are useful, rather than merely true. We trained several different word vector models, using different software, settings and data. We then used Prodigy's A/B evaluation to quickly compare our models, helping us hone in on settings that were producing interesting output, without having to form a precise definition of what exactly we were looking for.

See examples