First Steps with Prodigy

Before you get started, make sure Prodigy is installed in your current environment, and the Prodigy home directory is created in the right place. You can always use the PRODIGY_HOME environment variable to change it to a custom location. Also make sure you've downloaded and installed a spaCy model, e.g. the default English model.

For a detailed API reference and available options, see the PRODIGY_README.html available for download with Prodigy. For questions and bug reports, or to exchange ideas and recipes with other users, check out the Prodigy Support forum.

Create a new dataset

Datasets let you group annotations together. It's recommended to create a new dataset for each annotation project, evaluation run or experiment. If you add a description or author, this info will also be displayed in the web application.

prodigy dataset my_set "A new dataset" --author Me✨ Created dataset 'my_set'.

A single annotation can be part of several datasets. Prodigy will also create a session dataset for each individual annotation session, using the timestamp as the dataset name. For a list of all datasets and sessions, use the prodigy stats -ls command.

Annotate data

To start annotating, you need a source of examples. You can either load in your own data, or use one of the sample datasets below.

News headlines

200 headlines from stories about Silicon Valley from the The New York Times.

GitHub issues

830 GitHub issue titles for search queries related to documentation and instructions.

Named Entity Recognition

The ner.teach recipe uses a spaCy model to detect entities in the stream of examples. It then starts the web server so you can accept or reject the entity suggestions. As you annotate, the model is updated and Prodigy will use the updated predictions to suggest the most relevant entities for annotation. All annotations you collect will be stored in your dataset.

prodigy ner.teach my_set en_core_web_sm news_headlines.jsonl✨ Starting the web server on port 8080...
Disruptions: The Echo Chamber of Silicon Valley loc
source: The New York Times

By default, all entities will be shown. To only annotate one or more specific entity labels, use the --label option – for example, --label ORG or --label ORG,PERSON. Keep in mind that this recipe only works for entity types already present in the model. See here for an overview of spaCy's NER scheme.

Text Classification

The textcat.teach lets you start off with an existing spaCy model to or a blank model. It starts the web server so you can accept or reject texts with a category label. As you annotate, the model is updated and Prodigy will use the updated predictions to suggest the most relevant texts for annotation. All annotations you collect will be stored in your dataset.

prodigy textcat.teach my_set en_core_web_sm news_headlines.jsonl --label POLITICS✨ Starting the web server on port 8080...
POLITICS
Next Job for Obama? Silicon Valley Is Hiring
source: The New York Times

Load your own data

Prodigy supports loading data from a variety of different file types, and will use the file extension to determine which loader to use. For JSON-like and CSV formats, the text you want to load should have the key or column header text.

my_data.jsonl

{"text": "Pinterest Hires Its First Head of Diversity"} {"text": "Airbnb and Others Set Terms for Employees to Cash Out"}

my_data.txt

Pinterest Hires Its First Head of Diversity Airbnb and Others Set Terms for Employees to Cash Out
prodigy ner.teach my_set en_core_web_sm /path/to/my_data.jsonlprodigy textcat.teach my_set path/to/my_data.txt --label POLITICS

Streaming data from live APIs

Streaming in content like news headlines or images from live APIs is a great way to jumpstart your project, test how your model is performing on real-world data or quickly bootstrap a set of evaluation examples. To get started, pick one of the supported APIs, sign up for a key and add it to your prodigy.json config file.

prodigy textcat.teach my_set en_core_web_sm "Silicon Valley" --api guardian --label POLITICS

Import existing annotations

If you've created annotations using a different tool, you can import them into Prodigy via the db-in command. This will let you use them for training or evaluation. All loadable file types, like .jsonl or .csv, are supported. If your annotations contain entity spans or complex meta data, it's recommended to convert them to Prodigy's JSONL format first.

prodigy dataset my_set "A new dataset"prodigy db-in my_set annotations.jsonl --answer "accept"✨ Imported 600 annotations to 'my_set'.Added 'accept' answer to 600 annotations.

Using the --answer option, you can set whether the annotations are correct or incorrect, and how Prodigy should train from them later. You can also choose to --overwrite all existing answers in your data.

Train a model

Once all of the annotations are corrected, the best accuracy can usually be achieved by retraining from scratch. You can choose to supply an evaluation set, create an evaluation set interactively, or split off a percentage of the collected annotations for evaluation, the --eval-split. If you've specified an output directory, Prodigy will export the best model, together with the training and evaluation data.

prodigy textcat.batch-train gh_issues /tmp/model --eval-split 0.2 --label DOCUMENTATIONLoaded blank modelUsing 20% of examples (156) for evaluationUsing 100% of remaining examples for trainingCorrect 142Incorrect 14Baseline 0.65Precision 0.87Recall 0.87F-score 0.87Model: /tmp/modelTraining data: /tmp/model/training.jsonlEvaluation data: /tmp/model/evaluation.jsonl

Use a model

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production. It's recommended to use spaCy's package command to turn the model into a loadable Python package.

spacy package model /tmp --create-metapython /tmp/en_model/setup.py sdistpip install /tmp/en_model/dist/en_model-1.0.0.tar.gz

Usage in spaCy v2.0

import spacy nlp = spacy.load('en_model') doc = nlp(u"As New Zealand Courts Tech Talent, Isolation Becomes a Draw") # Named entities print([(ent.text, ent.label_) for ent in doc.ents]) # Text classification print(doc.cats)
scikit