First Steps with Prodigy

Before you get started, make sure Prodigy is installed in your current environment, and the Prodigy home directory is created in the right place. You can always use the PRODIGY_HOME environment variable to change it to a custom location. Also make sure you've downloaded and installed a spaCy model, e.g. the default English model.

For a detailed API reference and available options, see the PRODIGY_README.html available for download with Prodigy. For questions and bug reports, or to exchange ideas and recipes with other users, check out the Prodigy Support forum.


In the documentation, you'll come across a variety of terms specific to collecting annotations with Prodigy. Here's a quick overview of the most common ones, including a short description and links for more information.

annotation task A single question you're collecting feedback on from the annotator. For example, whether an entity is correct or whether a label applies to a text. Internally, annotation tasks are simple dictionaries containing the task properties like the text, the entity spans or the labels. Annotation tasks are also often referred to as "(annotation) examples".
annotation interface The visual presentation of the annotation task. For example, text with highlighted entities, text with a category label, an image or a multiple-choice question. In the code, this is also often referred to as the view_id. See here for a list of available options.
dataset A named collection of annotated tasks. A new dataset is usually created for each project or experiment. The data can be exported or used to train a model later on.
session A single annotation session, from starting the Prodigy server to exiting it. You can start multiple sessions that add data to the same dataset. The annotations of each sessions will also be stored as a separate dataset, named after the timestamp. This lets you inspect or delete individual sessions.
database The storage backend used to save your datasets. Prodigy currently supports SQLite (default), PostgreSQL and MySQL out-of-the-box, but also lets you integrate custom solutions.
recipe A Python function that can be executed from the command line and starts the Prodigy server for a specific task – for example, correcting entity predictions or annotating text classification labels. Prodigy comes with a range of built-in recipes, but also allows you to write your own.
stream An iterable of annotation tasks, e.g. a generator that yields dictionaries. When you load in your data from a file or an API, Prodigy will convert it to a stream. Streams can be annotated in order, or be filtered and reordered to only show the most relevant examples.
loader A function that loads data and returns a stream of annotation tasks. Prodigy comes with built-in loaders for the most common file types and a selection of live APIs, but you can also create your own functions.
sorter A function that takes a stream of (score, example) tuples and yields the examples in a different order, based on the score. For example, to prefer uncertain or high scores. Prodigy comes with several built-in sorters that are used in the active learning-powered recipes.
spaCy model One of the available pre-trained statistical language models for spaCy. Models can be installed as Python packages and are avalable in different sizes and for different languages. They can be used as the basis for training your own model with Prodigy.
active learning Using the model to select examples for annotation based on the current state of the model. Prodigy, the selection is usually based on examples the model is most uncertain about, i.e. the ones with a prediction closest to 50/50.
batch training Training a new model from a dataset of collected annotations. Using larger batches of data and multiple iterations usually leads to better results than just updating the model in the loop. This is why you usually want to collect annotations first, and then use them to batch train a model from scratch.

Create a new dataset

Datasets let you group annotations together. It's recommended to create a new dataset for each annotation project, evaluation run or experiment. If you add a description or author, this info will also be displayed in the web application.

prodigy dataset my_set "A new dataset" --author Me✨ Created dataset 'my_set'.

A single annotation can be part of several datasets. Prodigy will also create a session dataset for each individual annotation session, using the timestamp as the dataset name. For a list of all datasets and sessions, use the prodigy stats -ls command.

Annotate data

To start annotating, you need a source of examples. You can either load in your own data, or use one of the sample datasets below.

News headlines

200 headlines from stories about Silicon Valley from the The New York Times.

GitHub issues

830 GitHub issue titles for search queries related to documentation and instructions.

Named Entity Recognition

The ner.teach recipe uses a spaCy model to detect entities in the stream of examples. It then starts the web server so you can accept or reject the entity suggestions. As you annotate, the model is updated and Prodigy will use the updated predictions to suggest the most relevant entities for annotation. All annotations you collect will be stored in your dataset.

prodigy ner.teach my_set en_core_web_sm news_headlines.jsonl✨ Starting the web server on port 8080...
Disruptions: The Echo Chamber of Silicon Valley loc
source: The New York Times

By default, all entities will be shown. To only annotate one or more specific entity labels, use the --label option – for example, --label ORG or --label ORG,PERSON. Keep in mind that this recipe only works for entity types already present in the model. See here for an overview of spaCy's NER scheme.

Text Classification

The textcat.teach lets you start off with an existing spaCy model to or a blank model. It starts the web server so you can accept or reject texts with a category label. As you annotate, the model is updated and Prodigy will use the updated predictions to suggest the most relevant texts for annotation. All annotations you collect will be stored in your dataset.

prodigy textcat.teach my_set en_core_web_sm news_headlines.jsonl --label POLITICS✨ Starting the web server on port 8080...
Next Job for Obama? Silicon Valley Is Hiring
source: The New York Times

Load your own data

Prodigy supports loading data from a variety of different file types, and will use the file extension to determine which loader to use. For JSON-like and CSV formats, the text you want to load should have the key or column header text.


{"text": "Pinterest Hires Its First Head of Diversity"} {"text": "Airbnb and Others Set Terms for Employees to Cash Out"}


Pinterest Hires Its First Head of Diversity Airbnb and Others Set Terms for Employees to Cash Out
prodigy ner.teach my_set en_core_web_sm /path/to/my_data.jsonlprodigy textcat.teach my_set en_core_web_sm path/to/my_data.txt --label POLITICS

Streaming data from live APIs

Streaming in content like news headlines or images from live APIs is a great way to jumpstart your project, test how your model is performing on real-world data or quickly bootstrap a set of evaluation examples. To get started, pick one of the supported APIs, sign up for a key and add it to your prodigy.json config file.

prodigy textcat.teach my_set en_core_web_sm "Silicon Valley" --api guardian --label POLITICS

Import existing annotations

If you've created annotations using a different tool, you can import them into Prodigy via the db-in command. This will let you use them for training or evaluation. All loadable file types, like .jsonl or .csv, are supported. If your annotations contain entity spans or complex meta data, it's recommended to convert them to Prodigy's JSONL format first.

prodigy dataset my_set "A new dataset"prodigy db-in my_set annotations.jsonl --answer "accept"✨ Imported 600 annotations to 'my_set'.Added 'accept' answer to 600 annotations.

Using the --answer option, you can set whether the annotations are correct or incorrect, and how Prodigy should train from them later. You can also choose to --overwrite all existing answers in your data.

Train a model

Once all of the annotations are corrected, the best accuracy can usually be achieved by retraining from scratch. You can choose to supply an evaluation set, create an evaluation set interactively, or split off a percentage of the collected annotations for evaluation, the --eval-split. If you've specified an output directory, Prodigy will export the best model, together with the training and evaluation data.

prodigy textcat.batch-train gh_issues --output-model /tmp/model --eval-split 0.2Loaded blank modelUsing 20% of examples (156) for evaluationUsing 100% of remaining examples for trainingCorrect 142Incorrect 14Baseline 0.65Precision 0.87Recall 0.87F-score 0.87Model: /tmp/modelTraining data: /tmp/model/training.jsonlEvaluation data: /tmp/model/evaluation.jsonl

Use a model

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production. It's recommended to use spaCy's package command to turn the model into a loadable Python package.

spacy package model /tmp --create-metapython /tmp/en_model/ sdistpip install /tmp/en_model/dist/en_model-1.0.0.tar.gz

Usage in spaCy v2.0

import spacy nlp = spacy.load('en_model') doc = nlp(u"As New Zealand Courts Tech Talent, Isolation Becomes a Draw") # Named entities print([(ent.text, ent.label_) for ent in doc.ents]) # Text classification print(doc.cats)