API

Built-in Recipes

A Prodigy recipe is a Python function that can be run via the command line. Prodigy comes with lots of useful recipes, and it’s very easy to write your own. Recipes don’t have to start the web server – you can also use the recipe decorator as a quick way to make your Python function into a command-line utility. To view the recipe arguments and documentation on the command line, run the command with --help, for example prodigy ner.manual --help.

Named Entity RecognitionTag names and concepts as spans in text.
Span CategorizationLabel arbitrary and potentially overlapping spans in text.
Text ClassificationAssign one or more categories to whole texts.
Part-of-speech TaggingAssign part-of-speech tags to tokens.
Sentence SegmentationAssign sentence boundaries.
Dependency ParsingAssign and correct syntactic dependency attachments in text.
Coreference ResolutionResolve mentions and references to the same words in text.
RelationsAnnotate any relations between words and phrases.
Computer VisionAnnotate images and image segments.
Audio & VideoAnnotate and segment audio and video files.
TrainingTrain models and export training corpora.
Vectors & TerminologyCreate patterns and terminology lists from word vectors.
Review & EvaluateReview annotations and outputs and resolve conflicts.
Utilities & CommandsManage datasets, view data and streams, and more.
Deprecated RecipesRecipes that have already been replaced by better alternatives.

Named Entity Recognition

ner.manual manual

  • Interface: ner_manual
  • Saves: annotations to the database
  • Use case: highlight names and concepts in text manually or semi-manually

Mark entity spans in a text by highlighting them and selecting the respective labels. The model is used to tokenize the text to allow less sensitive highlighting, since the token boundaries are used to set the entity spans. The label set can be defined as a comma-separated list on the command line or as a path to a text file with one label per line. If no labels are specified, Prodigy will check if labels are present in the model. This recipe does not require an entity recognizer, and doesn’t do any active learning.


prodigy
ner.manual
dataset
spacy_model
source
--loader
--label
--patterns
--exclude
--highlight-chars
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline for tokenization or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--patterns, -ptstrNew: 1.9 Optional path to match patterns file to pre-highlight entity spans.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--highlight-chars, -CboolNew: 1.10 Allow highlighting individual characters instead of snapping to token boundaries. If set, no "tokens" information will be saved with the example.

Example

prodigy ner.manual ner_news en_core_web_sm ./news_headlines.jsonl --label PERSON,ORG,PRODUCT
This live demo requires JavaScript to be enabled.

ner.correct manual

  • Interface: ner_manual
  • Saves: annotations to the database
  • Use case: correct a spaCy model's predictions manually

Create gold-standard data for NER by correcting the model’s suggestions. The spaCy pipeline will be used to predict entities contained in the text, which the annotator can remove and correct if necessary.


prodigy
ner.correct
dataset
spacy_model
source
--loader
--label
--update
--exclude
--unsegmented
--component
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--update, -UPboolNew: 1.11 Update the model in the loop with the received annotations.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False
--component, -cstrNew: 1.11 Name of NER component in the pipeline."ner"

Example

prodigy ner.correct gold_ner en_core_web_sm ./news_headlines.jsonl --label PERSON,ORG
This live demo requires JavaScript to be enabled.

ner.teach binary

  • Interface: ner
  • Saves: Annotations to the database
  • Updates: spaCy model in the loop
  • Active learning: prefers most uncertain scores
  • Use case: updating and improving NER models

Collect the best possible training data for a named entity recognition model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. If the suggested entity is fully correct, you can accept it. If it’s entirely or partially wrong, you should reject it. As of v1.11, the recipe will also ask you about examples containing no entities at all, which can improve overall accuracy of your model. So if you see an example with no highlighted suggestions, you can accept it if the text contains no entities, or reject it if it does contain entities of the labels you’re annotating.


prodigy
ner.teach
dataset
spacy_model
source
--loader
--label
--patterns
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.None
--patterns, -ptstrOptional path to match patterns file to pre-highlight entity spans in addition to those suggested by the model.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False

Example

prodigy ner.teach ner_news en_core_web_sm ./news_headlines.jsonl --label PERSON,EVENT
This live demo requires JavaScript to be enabled.

ner.silver-to-gold manual

  • Interface: ner_manual
  • Saves: annotations to the database
  • Use case: converting binary datasets to gold-standard data with no missing values

Take existing “silver” datasets with binary accept/reject annotations, merge the annotations to find the best possible analysis given the constraints defined in the annotations, and manually edit it to create a perfect and complete “gold” dataset.


prodigy
ner.silver-to-gold
dataset
silver_sets
spacy_model
--label
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset ID to save annotations to.
silver_setsstrComma-separated names of existing binary datasets to convert.
spacy_modelstrLoadable spaCy pipeline.
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.

ner.eval-ab binary

  • Interface: choice
  • Saves: evaluation results to the database
  • Use case: comparing and evaluating two models (e.g. before and after training)

Load two models and a stream of text, compare their predictions and select which result you prefer. The outputs will be randomized, so you won’t know which model is which. When you stop the server, the results are calculated. This recipe is especially helpful if you’re updating an existing model or if you’re trying out a new strategy on the same problem. Even if two models achieve similar accuracy, one of them can still be subjectively “better”, so this recipe lets you analyze that.


prodigy
ner.eval-ab
dataset
model_a
model_b
source
--loader
--label
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
model_astrFirst loadable spaCy pipeline to compare.
model_bstrSecond loadable spaCy pipeline to compare.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False

Example

prodigy ner.eval-ab eval_dataset en_core_web_sm ./improved_ner_model ./news_headlines.jsonl
This live demo requires JavaScript to be enabled.

Span Categorization

spans.manual manualNew: 1.11

  • Interface: spans_manual
  • Saves: annotations to the database
  • Use case: highlight arbitrary and potentially overlapping spans of text manually or semi-manually

Mark entity spans in a text by highlighting them and selecting the respective labels. The model is used to tokenize the text to allow less sensitive highlighting, since the token boundaries are used to set the entity spans. The label set can be defined as a comma-separated list on the command line or as a path to a text file with one label per line. If no labels are specified, Prodigy will check if labels are present in the model. This recipe does not require an entity recognizer, and doesn’t do any active learning.


prodigy
spans.manual
dataset
spacy_model
source
--loader
--label
--patterns
--suggester
--exclude
--highlight-chars
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline for tokenization or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--patterns, -ptstrOptional path to match patterns file to pre-highlight entity spans.None
--suggester, -sgstrOptional name of suggester function registered in spaCy’s misc registry. If set, annotations will be validated against the suggester during annotation and you will see an error if the annotation doesn’t match any suggestions. Should be a function that creates the suggester with all required arguments. You can use the -F option to provide a Python file.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--highlight-chars, -CboolAllow highlighting individual characters instead of snapping to token boundaries. If set, no "tokens" information will be saved with the example.

Example

prodigy spans.manual covid_articles blank:en ./journal_papers.jsonl --label FACTOR,CONDITION,METHOD,EFFECT
This live demo requires JavaScript to be enabled.

If you’re using a custom suggester function for the span categorizer, you can provide it via the --suggester argument and Prodigy will validate submitted annotations against it as you annotate. If you’re not using a suggester, data-to-spacy and train will infer the best-matching ngram suggester based on the available span annotations in your data.

Example with suggester validation

prodigy spans.manual covid_articles blank:en ./journal_papers.jsonl --label CONDITION --suggester 123_ngram_suggester.v1 -F ./suggester.py
suggester.pyfrom spacy import registry
from spacy.pipeline.spancat import build_ngram_suggester

@registry.misc("123_ngram_suggester.v1")
def custom_ngram_suggester():
    return build_ngram_suggester(sizes=[1, 2, 3])  # all ngrams of size 1, 2 and 3

spans.correct manualNew: 1.11.1

  • Interface: spans_manual
  • Saves: annotations to the database
  • Use case: correct a spaCy model's predictions manually

Create gold-standard data for span categorization by correcting the model’s predictions. Requires a spaCy pipeline with a trained span categorizer and will show all spans in the given group. To customize the span group to read from, you can use the --key argument.


prodigy
spans.correct
dataset
spacy_model
source
--loader
--label
--update
--exclude
--component
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--update, -UPboolUpdate the model in the loop with the received annotations.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--component, -cstrName of span categorizer component in the pipeline."spancat"

Example

prodigy spans.correct gold_spans ./spancat_model ./journal_papers_new.jsonl
This live demo requires JavaScript to be enabled.

Text Classification

textcat.manual manual

  • Interface: choice/ classification
  • Saves: annotations to the database
  • Use case: select one or more categories to apply to the text

Manually annotate categories that apply to a text. If only one label is set, the classification interface is used. If more than one label is specified, the choice interface is used and categories are added as multiple choice options. If the --exclusive flag is set, categories become mutually exclusive, meaning that only one can be selected during annotation.


prodigy
textcat.manual
dataset
source
--loader
--label
--exclusive
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrCategory label to apply.''
--exclusive, -EboolTreat labels as mutually exclusive. If not set, an example may have multiple correct classes.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy textcat.manual news_topics ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment
This live demo requires JavaScript to be enabled.

textcat.correct manualNew: 1.11

  • Interface: choice
  • Saves: annotations to the database
  • Use case: correct a spaCy model's predictions manually

Create training data for an existing trained text classification model by correcting the model’s suggestions. The --threshold is used to determine whether a label should be pre-selected, e.g. if it’s set to 0.5 (default), all labels with a score of 0.5 and above will be checked automatically. Prodigy will automatically infer whether the categories are mutually exclusive, based on the component configuration.


prodigy
textcat.correct
dataset
spacy_model
source
--loader
--label
--update
--threshold
--component
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
--update, -UPboolUpdate the model in the loop with the received annotations.False
--threshold, -tfloatScore threshold to pre-select label, e.g. 0.75 to select all labels with a score of 0.75 and above.0.5
--component, -cstrName of text classification component in the pipeline. Will be guessed if not set.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy textcat.manual news_topics ./en_textcat_news ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment
This live demo requires JavaScript to be enabled.

textcat.teach binary

  • Interface: classification
  • Saves: annotations to the database
  • Updates: spaCy model in the loop
  • Active learning: prefers most uncertain scores
  • Use case: updating and improving text classification models

Collect the best possible training data for a text named entity recognition model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. All annotations will be stored in the database. If a patterns file is supplied via the --patterns argument, the matches will be included in the stream and the matched spans are highlighted, so you’re able to tell which words or phrases the selection was based on. Note that the exact pattern matches have no influence when updating the model – they’re only used to help pre-select examples for annotation.


prodigy
textcat.teach
dataset
spacy_model
source
--loader
--label
--patterns
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrCategory label to apply.''
--patterns, -ptstrOptional path to match patterns file to filter out examples containing terms and phrases.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy textcat.teach news_topics en_core_web_sm ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment
This live demo requires JavaScript to be enabled.

Part-of-speech Tagging

pos.correct manual

  • Interface: pos_manual
  • Saves: annotations to the database
  • Use case: correct a spaCy model's predictions manually

Create gold-standard data for part-of-speech tagging by correcting the model’s suggestions. The spaCy pipeline will be used to predict fine-grained part-of-speech tags (Token.tag_), which the annotator can remove and correct if necessary. It’s often more efficient to focus on a few labels at a time, instead of annotating all labels jointly.


prodigy
pos.correct
dataset
spacy_model
source
--loader
--label
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrOne or more tags to annotate. Supports a comma-separated list or a path to a file with one label per line. If not set, all tags are shown.
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False

Example

prodigy pos.correct news_tag en_core_web_sm ./news_headlines.jsonl --label NN,NNS,NNP,NNPS
This live demo requires JavaScript to be enabled.

pos.teach binary

  • Interface: pos
  • Saves: annotations to the database
  • Updates: spaCy model in the loop
  • Active learning: prefers most uncertain scores
  • Use case: updating and improving part-of-speech tagging models

Collect the best possible training data for a part-of-speech tagging model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few labels at a time, instead of annotating all labels jointly.


prodigy
pos.teach
dataset
spacy_model
source
--loader
--label
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False

Example

prodigy pos.teach tag_news en_core_web_sm ./news_headlines.jsonl --label NN,NNS,NNP,NNPS
This live demo requires JavaScript to be enabled.

Sentence Segmentation

sent.correct manualNew: 1.11

  • Interface: pos_manual
  • Saves: annotations to the database
  • Use case: correct a spaCy model's predictions manually

Create gold-standard data for sentence segmentation by correcting the model’s suggestions. The spaCy pipeline will be used to predict sentence boundaries, which the annotator can correct if necessary. The recipe uses the label S to mark tokens that start a sentence. You can double-click a sentence start token in the UI to add a new sentence boundary, or click on an incorrect prediction to remove it.


prodigy
sent.correct
dataset
spacy_model
source
--loader
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy sent.correct sentence_data xx_sent_ud_sm ./paragraphs.jsonl
This live demo requires JavaScript to be enabled.

sent.teach binary

  • Interface: pos
  • Saves: annotations to the database
  • Updates: spaCy model in the loop
  • Active learning: prefers most uncertain scores
  • Use case: updating and improving sentence segmentation models

Collect the best possible training data for a sentence segmentation model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. The recipe uses S to mark tokens that start sentences and I for all other tokens. You can then hit accept or reject, depending on whether the suggested token is correctly labelled as a sentence start or other token.


prodigy
sent.teach
dataset
spacy_model
source
--loader
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy sent.teach sent_news xx_sent_ud_sm ./news_headlines.jsonl
This live demo requires JavaScript to be enabled.

Dependency Parsing

dep.correct manualNew: 1.10

  • Interface: relations
  • Saves: annotations to the database
  • Updates: spaCy model in the loop (if enabled)
  • Active learning: no example selection
  • Use case: correct a spaCy model's predictions manually

Create gold-standard data for dependency parsing by correcting the model’s suggestions. The spaCy pipeline will be used to predict dependencies for the given labels, which the annotator can remove and correct if necessary. If --update is set, the model in the loop will be updated with the annotations and its updated predictions will be reflected in future batches. The recipe performs no example selection and all texts will be shown as they come in.


prodigy
dep.correct
dataset
spacy_model
source
--loader
--label
--update
--wrap
--unsegmented
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline with a dependency parser.
sourcestrPath to text source, - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be used.None
--update, -UboolWhether to update the model in the loop during annotation.False
--wrap, -WboolWrap lines in the UI by default (instead of showing tokens in one row).False
--unsegmented, -UboolDon’t split sentences.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy dep.correct deps_news en_core_web_sm ./news_headlines.jsonl --label ROOT,csubj,nsubj,dobj,pboj --update
This live demo requires JavaScript to be enabled.

dep.teach binary

  • Interface: dep
  • Saves: annotations to the database
  • Updates: spaCy model in the loop
  • Active learning: prefers most uncertain scores
  • Use case: updating and improving dependency parsing models

Collect the best possible training data for a dependency parsing model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few most relevant labels at a time, instead of annotating all labels jointly.


prodigy
dep.teach
dataset
spacy_model
source
--loader
--label
--exclude
--unsegmented
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--unsegmented, -UboolDon’t split sentences.False

Example

prodigy dep.teach deps_news en_core_web_sm ./news_headlines.jsonl --label csubj,nsubj,dobj,pboj
This live demo requires JavaScript to be enabled.

Coreference Resolution

coref.manual manualNew: 1.10

  • Interface: relations
  • Saves: annotations to the database
  • Use case: create annotations for coreference resolution

Create training data for coreference resolution. Coreference resolution is the challenge of linking ambiguous mentions such as “her” or “that woman” back to an antecedent providing more context about the entity in question. This recipe allows you to focus on nouns, proper nouns and pronouns specifically, by disabling all other tokens. You can customize the labels used to extract those using the recipe arguments. Also see the usage guide on coreference annotation.


prodigy
coref.manual
dataset
spacy_model
source
--loader
--label
--pos-tags
--poss-pron-tags
--ner-labels
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline with the required capabilities (entity recognizer part-of-speech tagger) or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source, - to read from standard input or dataset:name to load from existing annotations.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to use for coreference annotation. Accepts single label or comma-separated list."COREF"
--pos-tags, -psstrList of coarse-grained POS tags to enable for annotation."NOUN,PROPN,PRON,DET"
--poss-pron-tags, -ppstrList of fine-grained tag values for possessive pronoun to use."PRP$"
--ner-labels, -nlstrList of NER labels to use if model has a named entity recognizer."PERSON,ORG"
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy coref.manual coref_movies en_core_web_sm ./plot_summaries.jsonl --label COREF
This live demo requires JavaScript to be enabled.

Relations

rel.manual manualNew: 1.10

  • Interface: relations
  • Saves: annotations to the database
  • Use case: annotate relations between words and expressions

Annotate directional relations and dependencies between tokens and expressions by selecting the head, child and dependency label and optionally assign labelled spans for named entities or other expressions. This workflow is extremely powerful and can be used for basic dependency annotation, as well as joint named entity and entity relation annotation. If --span-label defines additional span labels, a second mode for span highlighting is added.

The recipe lets you take advantage of several efficiency tricks: spans can be pre-defined using an existing NER dataset, entities or noun phrases from a model or fully custom match patterns. You can also disable certain tokens to make them unselectable. This lets you focus on what matters and prevents annotators from introducing mistakes. For more details and examples, check out the usage guide on custom relation annotation and see the task-specific recipes dep.correct and coref.manual that include pre-defined configurations.


prodigy
rel.manual
dataset
spacy_model
source
--loader
--label
--span-label
--patterns
--disable-patterns
--add-ents
--add-nps
--wrap
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline with the required capabilities (if entities or noun phrases should be merged) or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source, - to read from standard input or dataset:name to load from existing annotations.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel(s) to annotate. Accepts single label or comma-separated list.None
--span-label, -slstrOptional span label(s) to annotate. If set, an additional span highlighting mode is added.None
--patterns, -ptstrPath to patterns file defining spans to be added and merged.None
--disable-patterns, -dptstrPath to patterns file defining tokens to disable (make unselectable).None
--add-ents, -AEboolAdd entities predicted by the model.False
--add-nps, -ANboolAdd noun phrases (if noun chunks rules are available), based on tagger and parser.False
--wrap, -WboolWrap lines in the UI by default (instead of showing tokens in one row).False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy rel.manual relation_data en_core_web_sm ./data.jsonl --label COREF,OBJECT --span-label PERSON,PRODUCT,NP --disable-patterns ./disable_patterns.jsonl --add-ents --wrap
disable_patterns.jsonl{"pattern": [{"is_punct": true}]}
{"pattern": [{"pos": "VERB"}]}
{"pattern": [{"lower": {"in": ["'s", "’s"]}}]}
This live demo requires JavaScript to be enabled.

Computer Vision

image.manual manual

  • Interface: image_manual
  • Saves: annotations to the database
  • Use case: Add bounding boxes and segments to images

Annotate images by drawing rectangular bounding boxes and polygon shapes. Each shape will be added to the task’s "spans" with its label and a "points" property containing the [x, y] pixel coordinate tuples. See here for more details on the JSONL format. You can click and drag or click and release to draw boxes. Polygon shapes can also be closed by double-clicking when adding the last point, similar to closing a shape in Photoshop or Illustrator. Clicking on the label will select a shape so you can change the label or delete it.


prodigy
image.manual
dataset
source
--loader
--exclude
--width
--darken
--no-fetch
--remove-base64
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to a directory containing image files or pre-formatted JSONL file if --loader jsonl is set.
--loader, -lostrOptional ID of source loader.images
--label, -lstr / PathOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None
--width, -wintNew: 1.10 Width of card and maximum image width in pixels.675
--darken, -DboolDarken image to make boxes stand out more.False
--no-fetch, -NFboolNew: 1.9 Don’t fetch images as base64. Ideally requires a JSONL file as input, with --loader jsonl set and all images available as URLs.False
--remove-base64, RboolNew: 1.10 Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files!False

Example

prodigy image.manual photo_objects ./stock-photos --label LAPTOP,CUP,PLANT
This live demo requires JavaScript to be enabled.

Audio and Video

audio.manual manualNew: 1.10

  • Interface: audio_manual
  • Saves: annotations to the database
  • Use case: Manually annotate audio regions in audio and video files

Manually label regions for the given labels in the audio or video file. The recipe expects a directory of audio files as the source argument and will use the audio loader (default) to load the data. To load video files instead, you can set --loader video. Each added region will be added to the "audio_spans" with a start and end timestamp and the selected label.


prodigy
audio.manual
dataset
source
--loader
--label
--autoplay
--keep-base64
--fetch-media
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set.
--loader, -lostrOptional ID of source loader, e.g. audio or video.audio
--label, -lstr / PathOne or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
--autoplay, -AboolAutoplay the audio when a new task loads.False
--keep-base64, -BboolIf audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.False
--fetch-media, -FMboolConvert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy audio.manual speaker_data ./recordings --label SPEAKER_1,SPEAKER_2,NOISE
This live demo requires JavaScript to be enabled.

Recipe command

prodigy audio.manual speaker_data ./recordings --loader video --label SPEAKER_1,SPEAKER_2
This live demo requires JavaScript to be enabled.

audio.transcribe manualNew: 1.10

  • Interface: blocks/ audio/ text_input
  • Saves: annotations to the database
  • Use case: Manually create transcriptions for audio and video files

Manually transcribe audio and video files by typing the transcript into a text field. The recipe expects a directory of audio files as the source argument and will use the audio loader (default) to load the data. To load video files instead, you can set --loader video. The transcript will be stored as the key "transcript". To make it easier to toggle play and pause as you transcribe and to prevent clashes with the text input field (like with the default enter), this recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can press command/option/alt/ctrl+enter or provide your own overrides via --playpause-key.


prodigy
audio.transcribe
dataset
source
--loader
--label
--autoplay
--keep-base64
--fetch-media
--playpause-key
--text-rows
--text-rows
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to a directory containing audio files or pre-formatted JSONL file if --loader jsonl is set.
--loader, -lostrOptional ID of source loader, e.g. audio or video.audio
--autoplay, -AboolAutoplay the audio when a new task loads.False
--keep-base64, -BboolIf audio loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.False
--fetch-media, -FMboolConvert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.False
--playpause-key, -pkstrAlternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field."command+enter, option+enter, ctrl+enter"
--text-rows, -trintHeight of the text input field, in rows.4
--field-id, -fistrNew: 1.10.1 Add the transcript text to the data using this key, e.g. "transcript": "Text here"."transcript"
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

Example

prodigy audio.transcribe speaker_transcripts ./recordings --text-rows 3
This live demo requires JavaScript to be enabled.

Training models

train commandNew: 1.9

  • Interface: terminal only
  • Saves: trained model to a directory (optional)
  • Use case: run training experiments

Train a model with one or more components (NER, text classification, tagger, parser, sentence recognizer or span categorizer) using one or more Prodigy datasets with annotations. The recipe calls into spaCy directly and can update an existing model or train a new model from scratch. For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset. If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

Datasets will be merged and conflicts will be filtered out. If your data contains potentially conflicting annotations, it’s recommended to first use review to resolve them. If you specify an output directory as the first argument, the best model will be saved at the end. You can then load it into spaCy by pointing spacy.load at the directory.


prodigy
train
output_dir
--ner
--textcat
--textcat-multilabel
--tagger
--parser
--senter
--spancat
--eval-split
--config
--base-model
--lang
--label-stats
--verbose
--silent
--gpu-id
overrides
-F
ArgumentTypeDescriptionDefault
output_dirstrPath to output directory. If not set, nothing will be saved.None
--ner, -nstrOne or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets.None
--textcat, -tcstrOne or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets.None
--textcat-multilabel, -tcmstrOne or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets.None
--tagger, -tstrOne or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets.None
--parser, -pstrOne or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets.None
--senter, -sstrOne or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets.None
--spancat, -scstrOne or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets.None
--config, -cstrOptional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings.None
--base-model, -mstrOptional spaCy pipeline to update or use for tokenization and sentence segmentation.None
--lang, -lstrCode of language to use if no config or base model are provided."en"
--eval-split, -esfloatIf no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation.0.2
--label-stats, -LboolShow a breakdown of per-label stats after training.False
--verbose, -VboolEnable verbose logging.False
--silent, -SboolDon’t print any updates.False
--gpu-id, -gintGPU ID for training on GPU or -1 for CPU.-1
overridesanyConfig parameters to override. Should be options starting with -- that correspond to the config section and value to override, e.g. --training.batch_size 128.None
-FstrOne or more comma-separated paths to Python files to import, e.g. for custom registered functions.None

Example

prodigy train --ner fashion_brands_training,eval:fashion_brands_eval ======================== Generating Prodigy config ======================== Auto-generating config with spaCy Generated training config ============================ Training pipeline ============================ Components: ner Merging training and evaluation data for 1 components - [ner] Training: 685 | Evaluation: 300 (from datasets) Training: 685 | Evaluation: 300 Labels: ner (1) Pipeline: ['tok2vec', 'ner'] Initial learn rate: 0.001 ============================ Training pipeline ============================ E # LOSS TOK2VEC LOSS NER ENTS_F ENTS_P ENTS_R SCORE --- ------ ------------ -------- ------ ------ ------ ------ 0 0 0.00 26.50 0.73 0.39 5.43 0.01 0 200 33.58 847.68 10.88 44.44 6.20 0.11 1 400 70.88 267.65 33.50 45.95 26.36 0.33 2 600 67.56 156.63 45.32 62.16 35.66 0.45 3 800 138.28 134.12 48.17 74.19 35.66 0.48 4 1000 177.95 109.77 51.43 66.67 41.86 0.51 6 1200 94.95 52.13 54.63 67.82 45.74 0.55 8 1400 126.85 66.19 56.00 65.62 48.84 0.56 10 1600 38.34 24.16 51.96 70.67 41.09 0.52 13 1800 105.14 23.23 56.88 69.66 48.06 0.57 16 2000 32.27 12.44 54.55 71.25 44.19 0.55

train-curve commandNew: 1.9

  • Interface: terminal only
  • Use case: test how accuracy improves with more data

Train a model with one or more components (NER, text classification, tagger, parser, sentence recognizer or span categorizer) with different portions of the training examples and print the accuracy figures and accuracy improvements with more data. This recipe takes pretty much the same arguments as train. --n-samples sets the number of sample models to train at different stages. For instance, 10 will train models for 10% of the examples, 20%, 30% and so on. This recipe is useful to determine the quality of the collected annotations, and whether more training examples will improve the accuracy. As a rule of thumb, if accuracy improves within the last 25%, training with more examples will likely result in better accuracy.


prodigy
train-curve
--ner
--textcat
--textcat-multilabel
--tagger
--parser
--senter
--spancat
--eval-split
--config
--base-model
--lang
--gpu-id
--n-samples
--show-plot
overrides
-F
ArgumentTypeDescriptionDefault
output_dirstrPath to output directory. If not set, nothing will be saved.None
--ner, -nstrOne or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets.None
--textcat, -tcstrOne or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets.None
--textcat-multilabel, -tcmstrOne or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets.None
--tagger, -tstrOne or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets.None
--parser, -pstrOne or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets.None
--senter, -sstrOne or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets.None
--spancat, -scstrOne or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets.None
--config, -cstrOptional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings.None
--base-model, -mstrOptional spaCy pipeline to use for tokenization and sentence segmentation.None
--lang, -lstrCode of language to use if no config or base model are provided."en"
--eval-split, -esfloatIf no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation.0.2
--verbose, -VboolEnable verbose logging.False
--n-samples, -nsintNumber of samples to train, e.g. 4 for results at 25%, 50%, 75% and 100%.4
--show-plot, -PboolShow a visual plot of the curve (requires the plotext library).False
overridesanyConfig parameters to override. Should be options starting with -- that correspond to the config section and value to override, e.g. --training.batch_size 128.None
-FstrOne or more comma-separated paths to Python files to import, e.g. for custom registered functions.None

Example

prodigy train-curve --ner news_headlines --show-plot ======================== Generating Prodigy config ======================== Auto-generating config with spaCy Generated training config ========================== Train curve diagnostic ========================== Training 4 times with 25%, 50%, 75%, 100% of the data % Score ner ---- ------ ------ 0% 0.00 0.00 25% 0.31 ▲ 0.31 ▲ 50% 0.44 ▲ 0.44 ▲ 75% 0.43 ▼ 0.43 ▼ 100% 0.56 ▲ 0.56 ▲ ┌──────────────────────────────────┐ 0.56┤ •│ 0.44┤ • •••• │ 0.43┤ ••• ••••••••••• │ │ ••• │ 0.31┤ ••• │ │ •• │ │ •• │ │ •• │ 0.00┤•• │ └┬───────┬────────┬───────┬───────┬┘ 0% 25% 50% 75% 100% Accuracy improved in the last sample As a rule of thumb, if accuracy increases in the last segment, this could indicate that collecting more annotations of the same type might improve the model further.

data-to-spacy commandNew: 1.9

  • Interface: terminal only
  • Saves: training and evaluation data in spaCy's format, config and cached labels
  • Use case: merge annotations and export a training corpus

Combine multiple datasets, merge annotations on the same examples and output training and evaluation data in spaCy’s binary .spacy format, which you can use with spacy train. The command takes an output directory and generates all data required to train a pipeline with spaCy, including the config and pre-generated labels data to speed up the training process. This recipe will merge annotations for the different pipeline components and outputs a combined training corpus. If an example is only present in one dataset type, its annotations for the other components will be missing values. It’s recommended to use the review recipe on the different annotation types first to resolve conflicts properly.


prodigy
data-to-spacy
output_dir
--ner
--textcat
--textcat-multilabel
--tagger
--parser
--senter
--spancat
--eval-split
--config
--base-model
--lang
--verbose
-F
ArgumentTypeDescriptionDefault
output_pathstrPath to output directory.
--ner, -nstrOne or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets.None
--textcat, -tcstrOne or more (comma-separated) datasets for the text classifier (exclusive categories). Use the eval: prefix for evaluation sets.None
--textcat-multilabel, -tcmstrOne or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the eval: prefix for evaluation sets.None
--tagger, -tstrOne or more (comma-separated) datasets for the part-of-speech tagger. Use the eval: prefix for evaluation sets.None
--parser, -pstrOne or more (comma-separated) datasets for the dependency parser. Use the eval: prefix for evaluation sets.None
--senter, -sstrOne or more (comma-separated) datasets for the sentence recognizer. Use the eval: prefix for evaluation sets.None
--spancat, -scstrOne or more (comma-separated) datasets for the span categorizer. Use the eval: prefix for evaluation sets.None
--config, -cstrOptional path to training config.cfg to use. If not set, it will be auto-generated using the default setttings.None
--base-model, -mstrOptional spaCy pipeline to use for tokenization and sentence segmentation.None
--lang, -lstrCode of language to use if no config or base model are provided."en"
--eval-split, -esfloatIf no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. If set to 0, no evaluation set will be generated.0.2
--verbose, -VboolEnable verbose logging.False
-FstrOne or more comma-separated paths to Python files to import, e.g. for custom registered functions.None

Example

prodigy data-to-spacy ./corpus --ner news_ner_person,news_ner_org,news_ner_product --textcat news_cats2018,news_cats2019 --eval-split 0.3 Using language 'en' ============================= Generating data ============================= Components: ner, textcat Merging training and evaluation data for 2 components - [ner] Training: 685 | Evaluation: 300 (from datasets) - [textcat] Training: 538 | Evaluation: 230 (30% split) Training: 1223 | Evaluation: 530 Labels: ner (3) | textcat (5) Saved 1223 training examples ./corpus/train.spacy Saved 530 evaluation examples ./corpus/dev.spacy ============================ Generating config ============================ Auto-generating config with spaCy Generated training config ======================= Generating cached label data ======================= Saving label data for component 'ner' ./corpus/labels/ner.json Saving label data for component 'textcat' ./corpus/labels/textcat.json ============================ Finalizing export ============================ Saved training config ./corpus/config.cfg To use this data for training with spaCy, you can run: python -m spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Training in spaCy v3

spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Vectors and Terminology

terms.teach binary

  • Interface: text
  • Saves: accepted and rejected terms to the database
  • Updates: target vector used for similarity comparsion
  • Use case: building terminology lists and pre-processing candiates for NER training

Build a terminology list interactively using a model’s word vectors and seed terms, either a comma-separated list or a text file containing one term per line. Based on the seed terms, a target vector is created and only terms similar to that target vector are shown. As you annotate, the recipe iterates over the vector model’s vocab and updates the target vector with the words you accept.


prodigy
terms.teach
dataset
vectors
--seeds
--resume
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
vectorsstrLoadable spaCy pipeline with word vectors and a vocab, e.g. en_core_web_lg or custom vectors trained on domain-specific text.
--seeds, -sstr / PathComma-separated list or path to file with seed terms (one term per line).''
--resume, -RboolResume from existing terms dataset and update target vector accordingly.False

Example

prodigy terms.teach prog_lang_terms en_core_web_lg --seeds Python,C++,Ruby
This live demo requires JavaScript to be enabled.

terms.to-patterns command

  • Interface: terminal only
  • Saves: JSONL-formatted patterns file
  • Use case: Convert terms dataset to match patterns to bootstrap annotation or for spaCy's entity ruler

Convert a dataset collected with terms.teach or sense2vec.teach to a JSONL-formatted patterns file. You can optionally provide a spaCy pipeline for tokenization to create token-based patterns and make them case-insensitive. If no model is provided, the patterns will be generated as exact string matches. Pattern files can be used in Prodigy to bootstrap annotation and pre-highlight suggestions, for example in ner.manual. You can also use them with spaCy’s EntityRuler for rule-based named entity recognition.


prodigy
terms.to-patterns
dataset
output_file
--label
--spacy-model
--case-sensitive
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset ID to convert.
output_filestrOptional path to an output file.sys.stdout
--label, -lstrLabel to assign to the patterns.None
--spacy-model, -mstrNew: 1.9 Optional spaCy pipeline for tokenization to create token-based patterns, or blank:lang to start with a blank model (e.g. blank:en for English).None
--case-sensitive, -CSboolNew: 1.9 Make patterns case-sensitive.False

Example

prodigy terms.to-patterns prog_lang_terms ./prog_lang_patterns.jsonl --label PROGRAMMING_LANGUAGE --spacy-model blank:en ✨ Exported 59 patterns ./prog_lang_patterns.jsonl

Review and Evaluate

review New: 1.8

  • Interface: review
  • Saves: reviewed master annotations to the database
  • Use case: review annotations by multiple annotators and resolve conflicts

Review existing annotations created by multiple annotators and resolve potential conflicts by creating one final “master annotation”. Can be used for both binary and manual annotations and supports all interfaces except image_manual and compare. If the annotations were created with a manual interface, the “most popular” version, e.g. the version most sessions agreed on, will be pre-selected automatically.


prodigy
review
dataset
in_sets
--label
--view-id
--fetch-media
--show-skipped
--auto-accept
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset ID to save reviewed annotations.
in_setsstrComma-separated names of datasets to review.
--label, -lstrOptional comma-separated labels to display in manual annotation mode.None
--view-id, -vstrInterface to use if none present in the task, e.g. ner or ner_manual.None
--fetch-media, -FMboolNew: 1.10 Temporarily replace paths and URLs with base64 string so thex can be reannotated. Will be removed again before examples are placed in the database.False
--show-skipped, -SboolNew: 1.10.5 Include answers that would otherwise be skipped, like annotations with answer "ignore" or annotations with answer "reject" in manual interfaces.False
--auto-accept, -AboolNew: 1.11 Automatically accept annotations with no conflicts and add them to the dataset.False

Example (binary)

prodigy review food_reviews_final food_reviews2019,food_reviews2018
This live demo requires JavaScript to be enabled.

Example (manual)

prodigy review ner_final ner_news,ner_misc --label ORG,PRODUCT
This live demo requires JavaScript to be enabled.

compare

  • Interface: choice/ diff
  • Saves: evaluation results to the database
  • Use case: A/B evaluation of two outputs, e.g. to evaluate diffent models

Compare the output of your model and the output of a baseline on the same inputs. To prevent bias during annotation, Prodigy will randomly decide which output to suggest as the correct answer. When you exit the application, you’ll see detailed stats, including the preferred output. Expects two JSONL files where each entry has an "id" (to match up the outputs on the same input), and an "input" and "output" object with the content to render, e.g. the "text".


prodigy
compare
dataset
a_file
b_file
--no-random
--diff
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
a_filestrFirst file to compare, e.g. system responses.
b_filestrSecond file to compare, e.g. baseline responses.
--no-random, -nrboolDon’t randomize which annotation is shown as the “correct” suggestion (always use the first option).False
--diff, -DboolShow examples as visual diff.False

prodigy
compare
eval_translation
./model_a.jsonl
./model_b.jsonl
model_a.jsonl{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx hit by worldwide cyberattack"}}
model_b.jsonl{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx from worldwide Cyberattacke hit"}}
This live demo requires JavaScript to be enabled.

Other Utilities and Commands

mark binary

  • Interface: n/a
  • Saves: annotations to the database
  • Use case: show data and accept or reject examples

Start the annotation server, display whatever comes in with a given interface and collect binary annotations. At the end of the annotation session, a breakdown of the answer counts is printed. The --view-id lets you specify one of the existing annotation interfaces – just make sure your input data includes everything the interface needs, since this recipe does no preprocessing and will just show you whatever is in the data. The recipe is also very useful if you want to re-annotate data exported with db-out.


prodigy
mark
dataset
source
--loader
--label
--view-id
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrLabel to apply in classification mode or comma-separated labels to show for manual annotation.''
--view-id, -vstrAnnotation interface to use.None
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

prodigy
mark
news_marked
./news_headlines.jsonl
--label INTERESTING
--view-id classification
This live demo requires JavaScript to be enabled.

match binaryNew: 1.9.8

  • Interface: n/a
  • Saves: annotations to the database
  • Use case: select examples based on match patterns

Select examples based on match patterns and accept or reject the result. Unlike ner.manual with patterns, this recipe will only show examples if they contain pattern matches. It can be used for NER and text classification annotations – for instance, to bootstrap a text category if the classes are very imbalanced and not enough positive examples are presented during manual annotation or textcat.teach. The --label-task and --label-span flags can be used to specify where the label should be added. This will also be reflected via the "label" property (on the top-level task or the spans) in the data you create with the recipe. If --combine-matches is set, all matches will be presented together. Otherwise, each match will be presented as a separate task.


prodigy
match
dataset
spacy_model
source
--loader
--label
--patterns
--label-task
--label-span
--combine-matches
--exclude
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset to save annotations to.
spacy_modelstrLoadable spaCy pipeline for tokenization to initialize the matcher, or blank:lang for a blank model (e.g. blank:en for English).
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
--label, -lstrComma-separated label(s) to annotate or text file with one label per line. Only pattern matches for those labels will be shown.
--patterns, -ptstrPath to match patterns file.
--label-task, -LTboolWhether to add a label to the top-level task if a match for that label was found. For example, if you use this recipe for text classification, you typically want to add a label to the whole task.False
--label-span, -LSboolWhether to add a label to the matched span that’s highlighted. For example, if you use this recipe for NER, you typically want to add a label to the span but not the whole task.False
--combine-matches, -CboolWhether to show all matches in one task. If False, the matcher will output one task for each match and duplicate tasks if necessary.False
--exclude, -estrComma-separated list of dataset IDs containing annotations to exclude.None

prodigy
match
news_matched
blank:en
./news_headlines.jsonl
--patterns ./news_patterns.jsonl
--label ORG,PRODUCT
--label-span
This live demo requires JavaScript to be enabled.
  • Interface: terminal only
  • Use case: quickly view a spaCy model's predictions

Pretty-print the model’s predictions on the command line. Supports named entities and text categories and will display the annotations if the model components are available. For textcat annotations, only the category with the highest score is shown if the score is greater than 0.5.


prodigy
print-stream
spacy_model
source
--loader
ArgumentTypeDescriptionDefault
spacy_modelstrLoadable spaCy pipeline.
sourcestrPath to text source or - to read from standard input.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None
  • Interface: terminal only
  • Use case: quickly inspect collected annotations

Pretty-print annotations from a given dataset on the command line. Supports plain text, text classification and NER annotations. If no --style is specified, Prodigy will try to infer it from the data via the "_view_id" that’s automatically added since v1.8.


prodigy
print-dataset
dataset
--style
ArgumentTypeDescriptionDefault
datasetstrProdigy dataset ID.
--style, -sstrDataset type: auto (try to infer from the data, default), text, spans or textcat.auto

db-out command

  • Interface: terminal only
  • Saves: JSONL file to disk
  • Use case: export annotated data

Export annotations in Prodigy’s JSONL format. If the output directory doesn’t exist, it will be created. If no output directory is specified, the data will be printed so it can be redirected to a file.


prodigy
db-out
dataset
out_dir
--dry
ArgumentTypeDescriptionDefault
datasetstrDataset ID to import or export.
out_dirstrOptional path to output directory to export annotation file to.None
--dry, -DboolPerform a dry run and don’t save any files.False

Example

prodigy db-out news_headlines > ./news_headlines.jsonl

db-merge command

  • Interface: terminal only
  • Saves: merged examples to the database
  • Use case: merging multiple datasets with annotations into one

Merge two or more existing datasets into a new set, e.g. to create a final dataset that can be reviewed or used to train a model. Keeps a copy of the original datasets and creates a new set for the merged examples.


prodigy
db-merge
in_sets
out_set
--rehash
--dry
ArgumentTypeDescriptionDefault
in_setsstrComma-separated names of datasets to merge.
out_setstrName of dataset to save the merged examples to.
--rehash, -RboolNew: 1.10 Force-update all hashes assigned to examples.False
--dry, -DboolPerform a dry run and don’t save anything.False

prodigy
db-merge
news_person,news_org,news_product
news_training
Merged 2893 examples from 3 datasets
Created merged dataset 'news_training'

db-in command

  • Interface: terminal only
  • Saves: imported examples to the database
  • Use case: importing existing annotated data

Import existing annotations to the database. Can load all file types supported by Prodigy. To import NER annotations, the files should be converted into Prodigy’s JSONL annotation format.


prodigy
db-in
dataset
in_file
--rehash
--dry
ArgumentTypeDescriptionDefault
datasetstrDataset ID to import or export.
in_filestrPath to input annotation file.
--rehash, -rhboolUpdate and overwrite all hashes.False
--dry, -DboolPerform a dry run and don’t save any files.False

drop command

  • Interface: terminal only
  • Saves: updated database
  • Use case: remove datasets and sessions

Remove a dataset or annotation session from a project. Can’t be undone. To see all dataset and session IDs in the database, use prodigy stats -ls.


prodigy
drop
dataset
--batch-size
ArgumentTypeDescriptionDefault
datasetstrDataset or session ID.
--batch-size, -nintDelete examples in batches of the given size. Prevents possible database error for large datasets.None

stats command

  • Interface: terminal only
  • Use case: view installation details and database statistics

Print Prodigy and database statistics. Specifying a dataset ID will show detailed stats for the dataset, like annotation counts and meta data. You can also choose to list all available dataset or session IDs.


prodigy
stats
dataset
-l
-ls
--no-format
ArgumentTypeDescriptionDefault
datasetstrOptional Prodigy dataset ID.
--list-datasets, -lboolList IDs of all datasets in the database.False
--list-sessions, -lsboolList IDs of all datasets and sessions in the database.False
--no-format, -nfboolDon’t pretty-print the stats and print a simple dict instead.False

Example

prodigy stats news_headlines -l ============================== Prodigy Stats ============================== Version 1.9.0 Database Name SQLite Database Id sqlite Total Datasets 4 Total Sessions 23 ================================= Datasets ================================= news_headlines, news_headlines_eval, github_docs, test ============================== Dataset Stats ============================== Dataset news_headlines Created 2017-07-29 15:29:28 Description Annotate news headlines Author Ines Annotations 1550 Accept 671 Reject 435 Ignore 444

progress commandNew: 1.11

  • Interface: terminal only
  • Use case: view annotation progress over time

View the annotation progress of one or more datasets over time and optionally compare it against an input source to check the coverage. The command will output the new annotations created during the given intervals, the total annotations at each point, as well as the number of unique annotations if the data contains multiple annotations on the same input data.


prodigy
progress
datasets
--interval
--source
--loader
ArgumentTypeDescriptionDefault
datasetsstrOne or more comma separated dataset names.
--interval, -istrTime period to calculate progress for. Can be "day", "week", "month", "year"."month"
--source, -sstrOptional path to text source or - to read from standard input. If set, will be used to calculate percentage of annotated examples based on the input data.
--loader, -lostrOptional ID of text source loader. If not set, source file extension is used to determine loader.None

Example

prodigy progress news_headlines_person,news_headlines_org ================================== Legend ================================== New New annotations collected in interval Total Total annotations collected Unique Unique examples (not counting multiple annotations of same example) ================================= Progress ================================= New Unique Total Unique ----------- ---- ------ ----- ------ 10 Jul 2021 1123 733 1123 733 12 Jul 2021 200 200 1323 933 13 Jul 2021 831 711 2154 1644 14 Jul 2021 157 150 2311 1790 15 Jul 2021 1464 1401 3775 3191

prodigy command

  • Interface: terminal only
  • Use case: Run recipe scripts

Run a built-in or custom Prodigy recipe. The -F option lets you load a recipe from a simple Python file, containing one or more recipe functions. All recipe arguments will be available from the command line. To print usage info and a list of available arguments, use the --help flag.


prodigy
recipe_name
...recipe arguments
-F
ArgumentTypeDescription
recipe_namepositionalRecipe name.
*recipe_argumentsRecipe arguments.
-FstrPath to recipe file to load custom recipe.
--help, -hboolShow help message and available arguments.

Example

prodigy custom-recipe my_dataset ./data.jsonl --custom-opt 123 -F recipe.py
recipe.pypseudocode 
import prodigy from prodigy.components.loaders import JSONL @prodigy.recipe( "custom-recipe", dataset=("The dataset", "positional", None, str), source_file=("A positional argument", "positional", None, str), custom_opt=("An option", "option", "co", int) ) def custom_recipe_function(dataset, source_file, custom_opt=10): stream = JSONL(source_file) print("Custom option pased in via command line:", custom_opt) return { "dataset": dataset, "stream": stream, "view_id": "text" }

Deprecated recipes

The following recipes have been deprecated in favor of newer workflows and best practices. See the table for details and replacements. The version numbers indicate when the feature was deprecated (but still available) and when it was removed. For instance, 1.10 1.11 indicates that the recipe was deprecated but still available in v1.10 and removed in v1.11. To view the recipe details and documentation of deprecated recipes, run the recipe command with the --help flag.

ner.match1.10 1.11 This recipe has been deprecated in favor of ner.manual with --patterns, which lets you match patterns and allows editing the results at the same time, and the general purpose match, which lets you match patterns and accept or reect the result.
ner.eval1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with ner.manual (fully manual) or ner.correct (semi-automatic).
ner.print-stream1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-stream command that can print streams of all supported types.
ner.print-dataset1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-dataset command that can print datasets of all supported types.
ner.gold-to-spacy1.10 1.11 This recipe has been deprecated in favor of data-to-spacy, which can take multiple datasets of different types (e.g. NER and text classification) and outputs a JSON file in spaCy’s training format that can be used with spacy train.
ner.iob-to-gold1.10 1.11 This recipe has been deprecated because it only served a very limited purpose. To convert IOB annotations, you can either use spacy convert or write a custom script.
ner.batch-train1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components.
ner.train-curve1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components.
textcat.eval1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with textcat.manual.
textcat.print-stream1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-stream command that can print streams of all supported types.
textcat.print-dataset1.10 1.11 This recipe has been deprecated in favor of the general-purpose print-dataset command that can print datasets of all supported types.
textcat.batch-train1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components and works with both binary accept/reject annotations and multiple choice annotations out-of-the-box.
textcat.train-curve1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components.
pos.gold-to-spacy1.10 1.11 This recipe has been deprecated in favor of data-to-spacy, which can take multiple datasets of different types (e.g. POS tags and NER) and outputs a JSON file in spaCy’s training format that can be used with spacy train.
pos.batch-train1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components.
pos.train-curve1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components.
dep.batch-train1.10 1.11 This recipe will be deprecated in favor of the general-purpose train recipe that supports all components.
dep.train-curve1.10 1.11 This recipe will be deprecated in favor of the general-purpose train-curve recipe that supports all components.
terms.train-vectors1.10 1.11 This recipe has been deprecated since wrapping word vector training in a recipe only introduces a layer of unnecessary abstraction. If you want to train your own vectors, use GloVe, fastText or Gensim directly and then add the vectors to a spaCy pipeline.
image.test1.10 1.11 This recipe has been deprecated since it was mostly intended to demonstrate the new image capabilities on launch. For a real-world example of using Prodigy for object detection with a model in the loop, see this TensorFlow tutorial.
pipe1.10 1.11 This command has been deprecated since it didn’t provide any Prodigy-specific functionality. To pipe data forward, you can convert the data to JSONL and run cat data.jsonl | prodigy ... or write a custom loader.
dataset1.10 1.11 This command has been deprecated since it’s mostly redundant. If a dataset doesn’t exist in the database, it’s added automatically.