Built-in Recipes

A Prodigy recipe is a Python function that can be run via the command line. Prodigy comes with lots of useful recipes, and it’s very easy to write your own. Recipes don’t have to start the web server – you can also use the recipe decorator as a quick way to make your Python function into a command-line utility. To view the recipe arguments and documentation on the command line, run the command with --help, for example prodigy ner.manual --help.


Named Entity Recognition	Tag names and concepts as spans in text.
Span Categorization	Label arbitrary and potentially overlapping spans in text.
Text Classification	Assign one or more categories to whole texts.
Part-of-speech Tagging	Assign part-of-speech tags to tokens.
Sentence Segmentation	Assign sentence boundaries.
Dependency Parsing	Assign and correct syntactic dependency attachments in text.
Coreference Resolution	Resolve mentions and references to the same words in text.
Relations	Annotate any relations between words and phrases.
Computer Vision	Annotate images and image segments.
Audio & Video	Annotate and segment audio and video files.
Large Language Models	Perform zero or few-shot annotation using large-language models.
Training	Train models and export training corpora.
Vectors & Terminology	Create patterns and terminology lists from word vectors.
Review & Evaluate	Review data, resolve conflicts and compute inter-annotator agreement.
Utilities & Commands	Manage datasets, view data and streams, and more.
Plugins	Extend Prodigy with more workflows, e.g. PDFs, Hugging Face and more.
Deprecated Recipes	Recipes that have already been replaced by better alternatives.

Named Entity Recognition

`ner.manual` manual

Interface: ner_manual
Saves: annotations to the database
Use case: highlight names and concepts in text manually or semi-manually

Mark entity spans in a text by highlighting them and selecting the respective labels. The model is used to tokenize the text to allow less sensitive highlighting, since the token boundaries are used to set the entity spans. The label set can be defined as a comma-separated list on the command line or as a path to a text file with one label per line. If no labels are specified, Prodigy will check if labels are present in the model. This recipe does not require an entity recognizer, and doesn’t do any active learning.

prodigyner.manualdatasetspacy_modelsource--loader--label--patterns--exclude--highlight-chars--edit-text

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline for tokenization or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--patterns`, `-pt`	str	New: 1.9 Optional path to match patterns file to pre-highlight entity spans.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--highlight-chars`, `-C`	bool	New: 1.14.5 Allow switching between highlighting individual characters and tokens. If set, character highlighing is set by deafault and no `"tokens"` information will be saved with the example.	`False`
`--edit-text`, `-E`	bool	New: 1.18.0 Allow editing text during annotation. On save, the text will be retokenized by the recipe and existing spans will be reset, but you can change how the task JSON is regenerated by providing a custom event hook.	`False`

Example

prodigyner.manualner_newsen_core_web_sm./news_headlines.jsonl--label PERSON,ORG,PRODUCT

This live demo requires JavaScript to be enabled.

`ner.correct` manual

Interface: ner_manual
Saves: annotations to the database
Use case: correct a spaCy model's predictions manually

Create gold-standard data for NER by correcting the model’s suggestions. The spaCy pipeline will be used to predict entities contained in the text, which the annotator can remove and correct if necessary.

prodigyner.correctdatasetspacy_modelsource--loader--label--update--exclude--unsegmented--component

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--update`, `-UP`	bool	New: 1.11 Update the model in the loop with the received annotations.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`
`--component`, `-c`	str	New: 1.11 Name of NER component in the pipeline.	`"ner"`

Example

prodigyner.correctgold_neren_core_web_sm./news_headlines.jsonl--label PERSON,ORG

This live demo requires JavaScript to be enabled.

`ner.teach` binary

Interface: ner
Saves: Annotations to the database
Updates: spaCy model in the loop
Active learning: prefers most uncertain scores
Use case: updating and improving NER models

Collect the best possible training data for a named entity recognition model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. If the suggested entity is fully correct, you can accept it. If it’s entirely or partially wrong, you should reject it. As of v1.11, the recipe will also ask you about examples containing no entities at all, which can improve overall accuracy of your model. So if you see an example with no highlighted suggestions, you can accept it if the text contains no entities, or reject it if it does contain entities of the labels you’re annotating.

prodigyner.teachdatasetspacy_modelsource--loader--label--patterns--exclude--unsegmented

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.	`None`
`--patterns`, `-pt`	str	Optional path to match patterns file to pre-highlight entity spans in addition to those suggested by the model.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`

Example

prodigyner.teachner_newsen_core_web_sm./news_headlines.jsonl--label PERSON,EVENT

This live demo requires JavaScript to be enabled.

`ner.silver-to-gold` manual

Interface: ner_manual
Saves: annotations to the database
Use case: converting binary datasets to gold-standard data with no missing values

Take existing “silver” datasets with binary accept/reject annotations, merge the annotations to find the best possible analysis given the constraints defined in the annotations, and manually edit it to create a perfect and complete “gold” dataset.

prodigyner.silver-to-golddatasetsilver_setsspacy_model--label

Argument	Type	Description
`dataset`	str	Prodigy dataset ID to save annotations to.
`silver_sets`	str	Comma-separated names of existing binary datasets to convert.
`spacy_model`	str	Loadable spaCy pipeline.
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.

`ner.eval-ab` binary

Interface: choice
Saves: evaluation results to the database
Use case: comparing and evaluating two models (e.g. before and after training)

Load two models and a stream of text, compare their predictions and select which result you prefer. The outputs will be randomized, so you won’t know which model is which. When you stop the server, the results are calculated. This recipe is especially helpful if you’re updating an existing model or if you’re trying out a new strategy on the same problem. Even if two models achieve similar accuracy, one of them can still be subjectively “better”, so this recipe lets you analyze that.

prodigyner.eval-abdatasetmodel_amodel_bsource--loader--label--exclude--unsegmented

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`model_a`	str	First loadable spaCy pipeline to compare.
`model_b`	str	Second loadable spaCy pipeline to compare.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`

Example

prodigyner.eval-abeval_dataseten_core_web_sm./improved_ner_model./news_headlines.jsonl

This live demo requires JavaScript to be enabled.

`ner.model-annotate` commandNew: 1.13.1

Interface: terminal only
Saves: model annotations to the database
Use case: use a model to annotate examples as-if done by a human

Leverage a model to add NER annotations to the database. You can repeat this multiple times with different models so that you may easily compare their predictions using the review recipe and curate examples where models disagree.

For more information on this method of annotating you can consult this guide.

prodigyner.model-annotatedatasetspacy_modelsourcemodel_alias--label--loader--component

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spacy pipeline that can do named entity recognition.
`source`	str	Path to text source or `-` to read from standard input.
`model_alias`	str	Model alias to be used as “annotator id” in the UI.
`--label`, `-l`	str	Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.	`None`
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`component`	str	Specific NER component to use in spaCy pipeline	`ner`

Example

prodigyner.model-annotatener_headlinesen_core_web_md./news_headlines.jsonlspacy_md--labels PERSONGetting labels from the 'ner' component
Using 1 labels: ['PERSON']
100%|███████████████████████████████████████████████████████|
201/201 [00:00<00:00, 395.14it/s]

Span Categorization

`spans.manual` manualNew: 1.11

Interface: spans_manual
Saves: annotations to the database
Use case: highlight arbitrary and potentially overlapping spans of text manually or semi-manually

prodigyspans.manualdatasetspacy_modelsource--loader--label--patterns--suggester--exclude--highlight-chars--edit-text

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline for tokenization or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--patterns`, `-pt`	str	Optional path to match patterns file to pre-highlight entity spans.	`None`
`--suggester`, `-sg`	str	Optional name of suggester function registered in spaCy’s `misc` registry. If set, annotations will be validated against the suggester during annotation and you will see an error if the annotation doesn’t match any suggestions. Should be a function that creates the suggester with all required arguments. You can use the `-F` option to provide a Python file.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--highlight-chars`, `-C`	bool	New: 1.14.5 Allow switching between highlighting individual characters and tokens. If set, character highlighing is set by deafault and no `"tokens"` information will be saved with the example.	`False`
`--edit-text`, `-E`	bool	New: 1.18.0 Allow editing text during annotation. On save, the text will be retokenized by the recipe and existing spans will be reset, but you can change how the task JSON is regenerated by providing a custom event hook.	`False`

Example

prodigyspans.manualcovid_articlesblank:en./journal_papers.jsonl--label FACTOR,CONDITION,METHOD,EFFECT

This live demo requires JavaScript to be enabled.

If you’re using a custom suggester function for the span categorizer, you can provide it via the --suggester argument and Prodigy will validate submitted annotations against it as you annotate. If you’re not using a suggester, data-to-spacy and train will infer the best-matching ngram suggester based on the available span annotations in your data.

Example with suggester validation

prodigyspans.manualcovid_articlesblank:en./journal_papers.jsonl--label CONDITION--suggester 123_ngram_suggester.v1-F ./suggester.py

suggester.py
from spacy import registry
from spacy.pipeline.spancat import build_ngram_suggester

@registry.misc("123_ngram_suggester.v1")
def custom_ngram_suggester():
    return build_ngram_suggester(sizes=[1, 2, 3])  # all ngrams of size 1, 2 and 3

`spans.correct` manualNew: 1.11.1

Interface: spans_manual
Saves: annotations to the database
Use case: correct a spaCy model's predictions manually

Create gold-standard data for span categorization by correcting the model’s predictions. Requires a spaCy pipeline with a trained span categorizer and will show all spans in the given group. To customize the span group to read from, you can use the --key argument.

prodigyspans.correctdatasetspacy_modelsource--loader--label--update--exclude--component

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--update`, `-UP`	bool	Update the model in the loop with the received annotations.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--component`, `-c`	str	Name of span categorizer component in the pipeline.	`"spancat"`

Example

prodigyspans.correctgold_spans./spancat_model./journal_papers_new.jsonl

This live demo requires JavaScript to be enabled.

`spans.model-annotate` commandNew: 1.13.1

Interface: terminal only
Saves: model annotations to the database
Use case: use a model to annotate examples as-if done by a human

Leverage a model to add span annotations to the database. You can repeat this multiple times with different models so that you may easily compare their predictions using the review recipe and curate examples where models disagree.

For more information on this method of annotating you can consult this guide.

prodigyspans.model-annotatedatasetspacy_modelsourcemodel_alias--labels--loader--component

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spacy pipeline that can do span categorization.
`source`	str	Path to text source or `-` to read from standard input.
`model_alias`	str	Model alias to be used as “annotator id” in the UI.
`--labels`, `-l`	str	Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.	`None`
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`component`	str	Specific spancat component to use in spaCy pipeline	`spancat`

Example

prodigyspans.model-annotatespans_headlines./local_spacy_model./news_headlines.jsonlspacy_local--labels PERSON,COMPANY,CITYGetting labels from the 'spancat' component
Using 3 labels: ['PERSON', 'COMPANY', 'CITY']
100%|███████████████████████████████████████████████████████|
201/201 [00:00<00:00, 395.14it/s]

Text Classification

`textcat.manual` manual

Interface: choice/ classification
Saves: annotations to the database
Use case: select one or more categories to apply to the text

Manually annotate categories that apply to a text. If only one label is set, the classification interface is used. If more than one label is specified, the choice interface is used and categories are added as multiple choice options. If the --exclusive flag is set, categories become mutually exclusive, meaning that only one can be selected during annotation.

prodigytextcat.manualdatasetsource--loader--label--exclusive--exclude--accept-empty

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Category label to apply.	`''`
`--exclusive, -E`	bool	Treat labels as mutually exclusive. If not set, an example may have multiple correct classes.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--accept_empty`, `-ae`	bool	New: 1.14.7 Allow empty choices, even when annotating mutually exclusive classes.	`False`

Example

prodigytextcat.manualnews_topics./news_headlines.jsonl--label Technology,Politics,Economy,Entertainment

This live demo requires JavaScript to be enabled.

`textcat.correct` manualNew: 1.11

Interface: choice
Saves: annotations to the database
Use case: correct a spaCy model's predictions manually

Create training data for an existing trained text classification model by correcting the model’s suggestions. The --threshold is used to determine whether a label should be pre-selected, e.g. if it’s set to 0.5 (default), all labels with a score of 0.5 and above will be checked automatically. Prodigy will automatically infer whether the categories are mutually exclusive, based on the component configuration.

prodigytextcat.correctdatasetspacy_modelsource--loader--label--update--threshold--component--exclude--accept-empty

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.
`--update`, `-UP`	bool	Update the model in the loop with the received annotations.	`False`
`--threshold`, `-t`	float	Score threshold to pre-select label, e.g. `0.75` to select all labels with a score of `0.75` and above.	`0.5`
`--component`, `-c`	str	Name of text classification component in the pipeline. Will be guessed if not set.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--accept-empty`, `-ae`	bool	New: 1.14.7 Allow empty choices, even when annotating mutually exclusive classes.	`False`

Example

prodigytextcat.correctnews_topics./en_textcat_news./news_headlines.jsonl--label Technology,Politics,Economy,Entertainment

This live demo requires JavaScript to be enabled.

`textcat.teach` binary

Interface: classification
Saves: annotations to the database
Updates: spaCy model in the loop
Active learning: prefers most uncertain scores
Use case: updating and improving text classification models

Collect the best possible training data for a text classification model by using a model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. All annotations will be stored in the database. If a patterns file is supplied via the --patterns argument, the matches will be included in the stream and the matched spans are highlighted, so you’re able to tell which words or phrases the selection was based on. Note that the exact pattern matches have no influence when updating the model – they’re only used to help pre-select examples for annotation.

prodigytextcat.teachdatasetspacy_modelsource--loader--label--patterns--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Category label to apply.	`''`
`--patterns`, `-pt`	str	Optional path to match patterns file to filter out examples containing terms and phrases.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigytextcat.teachnews_topicsen_core_web_sm./news_headlines.jsonl--label Technology,Politics,Economy,Entertainment

This live demo requires JavaScript to be enabled.

`textcat.model-annotate` commandNew: 1.13.1

Interface: terminal only
Saves: model annotations to the database
Use case: use a model to annotate examples as-if done by a human

Leverage a model to add texcat annotations to the database. You can repeat this multiple times with different models so that you may easily compare their predictions using the review recipe and curate examples where models disagree.

For more information on this method of annotating you can consult this guide.

prodigytextcat.model-annotatedatasetspacy_modelsourcemodel_alias--labels--loader--threshold--component

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spacy pipeline that can do text classification.
`source`	str	Path to text source or `-` to read from standard input.
`model_alias`	str	Model alias to be used as “annotator id” in the UI.
`--labels`, `-l`	str	Optional subset of labels to annotate. Supports a comma-separated list or a path to a file with one label per line. If no labels are set, Prodigy will check the model for available labels.	`None`
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`threshold`	str	Override the threshold for the classification model	`None`
`component`	str	Specific textcat component to use in spaCy pipeline. Will try to make an educated guess if no component is passed.	`None`

Example

prodigytextcat.model-annotatetextcat_headlines./local_spacy_model./news_headlines.jsonlspacy_local--labels POLICY,VC,AIGetting labels from the 'textcat' component.
Using 3 labels: ['POLICY', 'VC', 'AI']
100%|███████████████████████████████████████████████████████|
201/201 [00:00<00:00, 395.14it/s]

Part-of-speech Tagging

`pos.correct` manual

Interface: pos_manual
Saves: annotations to the database
Use case: correct a spaCy model's predictions manually

Create gold-standard data for part-of-speech tagging by correcting the model’s suggestions. The spaCy pipeline will be used to predict fine-grained part-of-speech tags (Token.tag_), which the annotator can remove and correct if necessary. It’s often more efficient to focus on a few labels at a time, instead of annotating all labels jointly.

prodigypos.correctdatasetspacy_modelsource--loader--label--exclude--unsegmented

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	One or more tags to annotate. Supports a comma-separated list or a path to a file with one label per line. If not set, all tags are shown.
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`

Example

prodigypos.correctnews_tagen_core_web_sm./news_headlines.jsonl--label NN,NNS,NNP,NNPS

This live demo requires JavaScript to be enabled.

`pos.teach` binary

Interface: pos
Saves: annotations to the database
Updates: spaCy model in the loop
Active learning: prefers most uncertain scores
Use case: updating and improving part-of-speech tagging models

Collect the best possible training data for a part-of-speech tagging model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few labels at a time, instead of annotating all labels jointly.

prodigypos.teachdatasetspacy_modelsource--loader--label--exclude--unsegmented

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`

Example

prodigypos.teachtag_newsen_core_web_sm./news_headlines.jsonl--label NN,NNS,NNP,NNPS

This live demo requires JavaScript to be enabled.

Sentence Segmentation

`sent.correct` manualNew: 1.11

Interface: pos_manual
Saves: annotations to the database
Use case: correct a spaCy model's predictions manually

Create gold-standard data for sentence segmentation by correcting the model’s suggestions. The spaCy pipeline will be used to predict sentence boundaries, which the annotator can correct if necessary. The recipe uses the label S to mark tokens that start a sentence. You can double-click a sentence start token in the UI to add a new sentence boundary, or click on an incorrect prediction to remove it.

prodigysent.correctdatasetspacy_modelsource--loader--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigysent.correctsentence_dataxx_sent_ud_sm./paragraphs.jsonl

This live demo requires JavaScript to be enabled.

`sent.teach` binary

Interface: pos
Saves: annotations to the database
Updates: spaCy model in the loop
Active learning: prefers most uncertain scores
Use case: updating and improving sentence segmentation models

Collect the best possible training data for a sentence segmentation model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. The recipe uses S to mark tokens that start sentences and I for all other tokens. You can then hit accept or reject, depending on whether the suggested token is correctly labelled as a sentence start or other token.

prodigysent.teachdatasetspacy_modelsource--loader--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline with SentenceRecognizer.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigysent.teachsent_newsxx_sent_ud_sm./news_headlines.jsonl

This live demo requires JavaScript to be enabled.

Dependency Parsing

`dep.correct` manualNew: 1.10

Interface: relations
Saves: annotations to the database
Updates: spaCy model in the loop (if enabled)
Active learning: no example selection
Use case: correct a spaCy model's predictions manually

Create gold-standard data for dependency parsing by correcting the model’s suggestions. The spaCy pipeline will be used to predict dependencies for the given labels, which the annotator can remove and correct if necessary. If --update is set, the model in the loop will be updated with the annotations and its updated predictions will be reflected in future batches. The recipe performs no example selection and all texts will be shown as they come in.

prodigydep.correctdatasetspacy_modelsource--loader--label--update--wrap--unsegmented--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline with a dependency parser.
`source`	str	Path to text source, `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be used.	`None`
`--update`, `-U`	bool	Whether to update the model in the loop during annotation.	`False`
`--wrap`, `-W`	bool	Wrap lines in the UI by default (instead of showing tokens in one row).	`False`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigydep.correctdeps_newsen_core_web_sm./news_headlines.jsonl--label ROOT,csubj,nsubj,dobj,pboj--update

This live demo requires JavaScript to be enabled.

`dep.teach` binary

Interface: dep
Saves: annotations to the database
Updates: spaCy model in the loop
Active learning: prefers most uncertain scores
Use case: updating and improving dependency parsing models

Collect the best possible training data for a dependency parsing model with the model in the loop. Based on your annotations, Prodigy will decide which questions to ask next. It’s often more efficient to focus on a few most relevant labels at a time, instead of annotating all labels jointly.

prodigydep.teachdatasetspacy_modelsource--loader--label--exclude--unsegmented

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to annotate. Accepts single label or comma-separated list. If not set, all available labels will be returned.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--unsegmented`, `-U`	bool	Don’t split sentences.	`False`

Example

prodigydep.teachdeps_newsen_core_web_sm./news_headlines.jsonl--label csubj,nsubj,dobj,pboj

This live demo requires JavaScript to be enabled.

Coreference Resolution

`coref.manual` manualNew: 1.10

Interface: relations
Saves: annotations to the database
Use case: create annotations for coreference resolution

Create training data for coreference resolution. Coreference resolution is the challenge of linking ambiguous mentions such as “her” or “that woman” back to an antecedent providing more context about the entity in question. This recipe allows you to focus on nouns, proper nouns and pronouns specifically, by disabling all other tokens. You can customize the labels used to extract those using the recipe arguments. Also see the usage guide on coreference annotation.

prodigycoref.manualdatasetspacy_modelsource--loader--label--pos-tags--poss-pron-tags--ner-labels--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline with the required capabilities (entity recognizer part-of-speech tagger) or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source, `-` to read from standard input or `dataset:name` to load from existing annotations.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to use for coreference annotation. Accepts single label or comma-separated list.	`"COREF"`
`--pos-tags`, `-ps`	str	List of coarse-grained POS tags to enable for annotation.	`"NOUN,PROPN,PRON,DET"`
`--poss-pron-tags`, `-pp`	str	List of fine-grained tag values for possessive pronoun to use.	`"PRP$"`
`--ner-labels`, `-nl`	str	List of NER labels to use if model has a named entity recognizer.	`"PERSON,ORG"`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigycoref.manualcoref_moviesen_core_web_sm./plot_summaries.jsonl--label COREF

This live demo requires JavaScript to be enabled.

Relations

`rel.manual` manualNew: 1.10

Interface: relations
Saves: annotations to the database
Use case: annotate relations between words and expressions

Annotate directional relations and dependencies between tokens and expressions by selecting the head, child and dependency label and optionally assign labelled spans for named entities or other expressions. This workflow is extremely powerful and can be used for basic dependency annotation, as well as joint named entity and entity relation annotation. If --span-label defines additional span labels, a second mode for span highlighting is added.

The recipe lets you take advantage of several efficiency tricks: spans can be pre-defined using an existing NER dataset, entities or noun phrases from a model or fully custom match patterns. You can also disable certain tokens to make them unselectable. This lets you focus on what matters and prevents annotators from introducing mistakes. For more details and examples, check out the usage guide on custom relation annotation and see the task-specific recipes dep.correct and coref.manual that include pre-defined configurations.

prodigyrel.manualdatasetspacy_modelsource--loader--label--span-label--patterns--disable-patterns--add-ents--add-nps--wrap--exclude--hide_arrow_heads

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline with the required capabilities (if entities or noun phrases should be merged) or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source, `-` to read from standard input or `dataset:name` to load from existing annotations.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label(s) to annotate. Accepts single label or comma-separated list.	`None`
`--span-label`, `-sl`	str	Optional span label(s) to annotate. If set, an additional span highlighting mode is added.	`None`
`--patterns`, `-pt`	str	Path to patterns file defining spans to be added and merged.	`None`
`--disable-patterns`, `-dpt`	str	Path to patterns file defining tokens to disable (make unselectable).	`None`
`--add-ents`, `-AE`	bool	Add entities predicted by the model.	`False`
`--add-nps`, `-AN`	bool	Add noun phrases (if noun chunks rules are available), based on tagger and parser.	`False`
`--wrap`, `-W`	bool	Wrap lines in the UI by default (instead of showing tokens in one row).	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--hide-arrow-heads`, `-HA`	bool	Hide the arrow heads visually	`False`

Example

prodigyrel.manualrelation_dataen_core_web_sm./data.jsonl--label COREF,OBJECT--span-label PERSON,PRODUCT,NP--disable-patterns ./disable_patterns.jsonl--add-ents--wrap

disable_patterns.jsonl
{"pattern": [{"is_punct": true}]}
{"pattern": [{"pos": "VERB"}]}
{"pattern": [{"lower": {"in": ["'s", "’s"]}}]}

This live demo requires JavaScript to be enabled.

Computer Vision

`image.manual` manual

Interface: image_manual
Saves: annotations to the database
Use case: Add bounding boxes and segments to images

Annotate images by drawing rectangular bounding boxes and polygon shapes. Each shape will be added to the task’s "spans" with its label and a "points" property containing the [x, y] pixel coordinate tuples. See here for more details on the JSONL format. You can click and drag or click and release to draw boxes. Polygon shapes can also be closed by double-clicking when adding the last point, similar to closing a shape in Photoshop or Illustrator. Clicking on the label will select a shape so you can change the label or delete it.

prodigyimage.manualdatasetsource--loader--exclude--width--darken--no-fetch--remove-base64

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing image files or pre-formatted JSONL file if `--loader jsonl` is set.
`--loader`, `-lo`	str	Optional ID of source loader.	`images`
`--label`, `-l`	str / `Path`	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--width`, `-w`	int	New: 1.10 Width of card and maximum image width in pixels.	`675`
`--darken`, `-D`	bool	Darken image to make boxes stand out more.	`False`
`--no-fetch`, `-NF`	bool	New: 1.9 Don’t fetch images as base64. Ideally requires a JSONL file as input, with `--loader jsonl` set and all images available as URLs.	`False`
`--remove-base64`, `R`	bool	New: 1.10 Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files!	`False`

Example

prodigyimage.manualphoto_objects./stock-photos--label LAPTOP,CUP,PLANT

This live demo requires JavaScript to be enabled.

If you organize your images in subdirectories, you can set --loader pages to group them together in a single interface using the pages UI. This can be especially useful for multi-page documents or collections of images that should be viewed together. If you’re working with PDFs, the Prodigy-PDF plugin also supports loading paginated documents.

Example

prodigyimage.manualpapers_layout./documents--loader pages--label FIGURE,FOOTNOTE,PARAGRAPH

This live demo requires JavaScript to be enabled.

Audio and Video

`audio.manual` manualNew: 1.10

Interface: audio_manual
Saves: annotations to the database
Use case: Manually annotate audio regions in audio and video files

Manually label regions for the given labels in the audio or video file. The recipe expects a directory of audio files as the source argument and will use the audio loader (default) to load the data. To load video files instead, you can set --loader video. Each added region will be added to the "audio_spans" with a start and end timestamp and the selected label.

prodigyaudio.manualdatasetsource--loader--label--autoplay--keep-base64--fetch-media--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`--loader`, `-lo`	str	Optional ID of source loader, e.g. `audio` or `video`.	`audio`
`--label`, `-l`	str / `Path`	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
`--autoplay`, `-A`	bool	Autoplay the audio when a new task loads.	`False`
`--keep-base64`, `-B`	bool	If `audio` loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.	`False`
`--fetch-media`, `-FM`	bool	Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigyaudio.manualspeaker_data./recordings--label SPEAKER_1,SPEAKER_2,NOISE

This live demo requires JavaScript to be enabled.

Recipe command

prodigyaudio.manualspeaker_data./recordings--loader video--label SPEAKER_1,SPEAKER_2

This live demo requires JavaScript to be enabled.

`audio.transcribe` manualNew: 1.10

Interface: blocks/ audio/ text_input
Saves: annotations to the database
Use case: Manually create transcriptions for audio and video files

Manually transcribe audio and video files by typing the transcript into a text field. The recipe expects a directory of audio files as the source argument and will use the audio loader (default) to load the data. To load video files instead, you can set --loader video. The transcript will be stored as the key "transcript". To make it easier to toggle play and pause as you transcribe and to prevent clashes with the text input field (like with the default enter), this recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can press command/option/alt/ctrl+enter or provide your own overrides via --playpause-key.

prodigyaudio.transcribedatasetsource--loader--autoplay--keep-base64--fetch-media--playpause-key--text-rows--text-rows--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`--loader`, `-lo`	str	Optional ID of source loader, e.g. `audio` or `video`.	`audio`
`--autoplay`, `-A`	bool	Autoplay the audio when a new task loads.	`False`
`--keep-base64`, `-B`	bool	If `audio` loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.	`False`
`--fetch-media`, `-FM`	bool	Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.	`False`
`--playpause-key`, `-pk`	str	Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field.	`"command+enter, option+enter, ctrl+enter"`
`--text-rows`, `-tr`	int	Height of the text input field, in rows.	`4`
`--field-id`, `-fi`	str	New: 1.10.1 Add the transcript text to the data using this key, e.g. `"transcript": "Text here"`.	`"transcript"`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Example

prodigyaudio.transcribespeaker_transcripts./recordings--text-rows 3

This live demo requires JavaScript to be enabled.

Training models

`train` commandNew: 1.9

Interface: terminal only
Saves: trained model to a directory (optional)
Use case: run training experiments

Train a model with one or more components (NER, text classification, tagger, parser, sentence recognizer or span categorizer) using one or more Prodigy datasets with annotations. The recipe calls into spaCy directly and can update an existing model or train a new model from scratch. For each component, you can provide optional datasets for evaluation using the eval: prefix, e.g. --ner dataset,eval:eval_dataset. If no evaluation sets are specified, the --eval-split is used to determine the percentage held back for evaluation.

Datasets will be merged and conflicts will be filtered out. If your data contains potentially conflicting annotations, it’s recommended to first use review to resolve them. If you specify an output directory as the first argument, the best model will be saved at the end. You can then load it into spaCy by pointing spacy.load at the directory.

prodigytrainoutput_dir--ner--textcat--textcat-multilabel--tagger--parser--senter--spancat--coref--eval-split--config--base-model--lang--label-stats--verbose--silent--gpu-idoverrides-F

Argument	Type	Description	Default
`output_dir`	str	Path to output directory. If not set, nothing will be saved.	`None`
`--ner`, `-n`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--textcat`, `-tc`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--textcat-multilabel`, `-tcm`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--tagger`, `-t`	str	One or more (comma-separated) datasets for the part-of-speech tagger. Use the `eval:` prefix for evaluation sets.	`None`
`--parser`, `-p`	str	One or more (comma-separated) datasets for the dependency parser. Use the `eval:` prefix for evaluation sets.	`None`
`--senter`, `-s`	str	One or more (comma-separated) datasets for the sentence recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--spancat`, `-sc`	str	One or more (comma-separated) datasets for the span categorizer. Use the `eval:` prefix for evaluation sets.	`None`
`--coref`, `-co`	str	New: 1.12 One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental. Use the `eval:` prefix for evaluation sets.	`None`
`--config`, `-c`	str	Optional path to training `config.cfg` to use. If not set, it will be auto-generated using the default setttings.	`None`
`--base-model`, `-m`	str	Optional spaCy pipeline to update or use for tokenization and sentence segmentation.	`None`
`--lang`, `-l`	str	Code of language to use if no config or base model are provided.	`"en"`
`--eval-split`, `-es`	float	If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation.	`0.2`
`--label-stats`, `-L`	bool	Show a breakdown of per-label stats after training.	`False`
`--verbose`, `-V`	bool	Enable verbose logging.	`False`
`--silent`, `-S`	bool	Don’t print any updates.	`False`
`--gpu-id`, `-g`	int	GPU ID for training on GPU or `-1` for CPU.	`-1`
`overrides`	any	Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.max_epochs=5`.	`None`
`-F`	str	One or more comma-separated paths to Python files to import, e.g. for custom registered functions.	`None`

Example

prodigytrain--ner fashion_brands_training,eval:fashion_brands_eval======================== Generating Prodigy config ========================ℹ Auto-generating config with spaCy✔ Generated training config============================ Training pipeline ============================Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 685 | Evaluation: 300 (from datasets)
Training: 685 | Evaluation: 300
Labels: ner (1)ℹ Pipeline: ['tok2vec', 'ner']ℹ Initial learn rate: 0.001============================ Training pipeline ============================E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
0       0          0.00     26.50    0.73    0.39    5.43    0.01
0     200         33.58    847.68   10.88   44.44    6.20    0.11
1     400         70.88    267.65   33.50   45.95   26.36    0.33
2     600         67.56    156.63   45.32   62.16   35.66    0.45
3     800        138.28    134.12   48.17   74.19   35.66    0.48
4    1000        177.95    109.77   51.43   66.67   41.86    0.51
6    1200         94.95     52.13   54.63   67.82   45.74    0.55
8    1400        126.85     66.19   56.00   65.62   48.84    0.56
10    1600         38.34     24.16   51.96   70.67   41.09    0.52
13    1800        105.14     23.23   56.88   69.66   48.06    0.57
16    2000         32.27     12.44   54.55   71.25   44.19    0.55

The example below shows how you might run the same command but with overrides. Notice how this training run uses less steps and has a different learning rate.

Example with overrides

prodigytrain--ner fashion_brands_training,eval:fashion_brands_eval--training.max_steps=1000 --training.optimizer.learn_rate=0.002======================== Generating Prodigy config ========================ℹ Auto-generating config with spaCy✔ Generated training config============================ Training pipeline ============================Components: ner
Merging training and evaluation data for 1 components
  - [ner] Training: 685 | Evaluation: 300 (from datasets)
Training: 685 | Evaluation: 300
Labels: ner (1)ℹ Pipeline: ['tok2vec', 'ner']ℹ Initial learn rate: 0.002============================ Training pipeline ============================E    #       LOSS TOK2VEC  LOSS NER  ENTS_F  ENTS_P  ENTS_R  SCORE
0       0          0.00     26.50    0.73    0.39    5.43    0.01
0     200         33.58    847.68   10.88   44.44    6.20    0.11
1     400         70.88    267.65   33.50   45.95   26.36    0.33
2     600         67.56    156.63   45.32   62.16   35.66    0.45
3     800        138.28    134.12   48.17   74.19   35.66    0.48
4    1000        177.95    109.77   51.43   66.67   41.86    0.51

`train-curve` commandNew: 1.9

Interface: terminal only
Use case: test how accuracy improves with more data

Train a model with one or more components (NER, text classification, tagger, parser, sentence recognizer or span categorizer) with different portions of the training examples and print the accuracy figures and accuracy improvements with more data. This recipe takes pretty much the same arguments as train. --n-samples sets the number of sample models to train at different stages. For instance, 10 will train models for 10% of the examples, 20%, 30% and so on. This recipe is useful to determine the quality of the collected annotations, and whether more training examples will improve the accuracy. As a rule of thumb, if accuracy improves within the last 25%, training with more examples will likely result in better accuracy.

prodigytrain-curve--ner--textcat--textcat-multilabel--tagger--parser--senter--spancat--coref--eval-split--config--base-model--lang--gpu-id--n-samples--show-plotoverrides-F

Argument	Type	Description	Default
`--ner`, `-n`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--textcat`, `-tc`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--textcat-multilabel`, `-tcm`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--tagger`, `-t`	str	One or more (comma-separated) datasets for the part-of-speech tagger. Use the `eval:` prefix for evaluation sets.	`None`
`--parser`, `-p`	str	One or more (comma-separated) datasets for the dependency parser. Use the `eval:` prefix for evaluation sets.	`None`
`--senter`, `-s`	str	One or more (comma-separated) datasets for the sentence recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--spancat`, `-sc`	str	One or more (comma-separated) datasets for the span categorizer. Use the `eval:` prefix for evaluation sets.	`None`
`--coref`, `-co`	str	New: 1.12 Optional path to training `config.cfg` to use. If not set, it will be auto-generated using the default setttings.	`None`
`--base-model`, `-m`	str	Optional spaCy pipeline to use for tokenization and sentence segmentation.	`None`
`--lang`, `-l`	str	Code of language to use if no config or base model are provided.	`"en"`
`--eval-split`, `-es`	float	If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation.	`0.2`
`--verbose`, `-V`	bool	Enable verbose logging.	`False`
`--n-samples`, `-ns`	int	Number of samples to train, e.g. `4` for results at 25%, 50%, 75% and 100%.	`4`
`--show-plot`, `-P`	bool	Show a visual plot of the curve (requires the `plotext` library).	`False`
`overrides`	any	Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--training.max_epochs=3`.	`None`
`-F`	str	One or more comma-separated paths to Python files to import, e.g. for custom registered functions.	`None`

Example

prodigytrain-curve--ner news_headlines--show-plot======================== Generating Prodigy config ========================ℹ Auto-generating config with spaCy✔ Generated training config========================== Train curve diagnostic ==========================Training 4 times with 25%, 50%, 75%, 100% of the data%      Score    ner
----   ------   ------
  0%   0.00     0.00
 25%   0.31 ▲   0.31 ▲
 50%   0.44 ▲   0.44 ▲
 75%   0.43 ▼   0.43 ▼
100%   0.56 ▲   0.56 ▲

    ┌──────────────────────────────────┐
0.56┤                                 •│
0.44┤                 •           •••• │
0.43┤              ••• •••••••••••     │
    │           •••                    │
0.31┤        •••                       │
    │      ••                          │
    │    ••                            │
    │  ••                              │
0.00┤••                                │
    └┬───────┬────────┬───────┬───────┬┘
     0%     25%      50%     75%    100%✔ Accuracy improved in the last sampleAs a rule of thumb, if accuracy increases in the last segment, this could indicate that collecting more annotations of the same type might improve the model further.

`data-to-spacy` commandNew: 1.9

Interface: terminal only
Saves: training and evaluation data in spaCy's format, config and cached labels
Use case: merge annotations and export a training corpus

Combine multiple datasets, merge annotations on the same examples and output training and evaluation data in spaCy’s binary .spacy format, which you can use with spacy train. The command takes an output directory and generates all data required to train a pipeline with spaCy, including the config and pre-generated labels data to speed up the training process. This recipe will merge annotations for the different pipeline components and outputs a combined training corpus. If an example is only present in one dataset type, its annotations for the other components will be missing values. It’s recommended to use the review recipe on the different annotation types first to resolve conflicts properly.

prodigydata-to-spacyoutput_dir--ner--textcat--textcat-multilabel--tagger--parser--senter--spancat--coref--eval-split--config--base-model--lang--verbose-F

Argument	Type	Description	Default
`output_dir`	str	Path to output directory.
`--ner`, `-n`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--textcat`, `-tc`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--textcat-multilabel`, `-tcm`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--tagger`, `-t`	str	One or more (comma-separated) datasets for the part-of-speech tagger. Use the `eval:` prefix for evaluation sets.	`None`
`--parser`, `-p`	str	One or more (comma-separated) datasets for the dependency parser. Use the `eval:` prefix for evaluation sets.	`None`
`--senter`, `-s`	str	One or more (comma-separated) datasets for the sentence recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--spancat`, `-sc`	str	One or more (comma-separated) datasets for the span categorizer. Use the `eval:` prefix for evaluation sets.	`None`
`--coref`, `-co`	str	New: 1.12 One or more (comma-separated) datasets for the coreference resolver. Requires spacy-experimental. Use the `eval:` prefix for evaluation sets.	`None`
`--config`, `-c`	str	Optional path to training `config.cfg` to use. If not set, it will be auto-generated using the default setttings.	`None`
`--base-model`, `-m`	str	Optional spaCy pipeline to use for tokenization and sentence segmentation.	`None`
`--lang`, `-l`	str	Code of language to use if no config or base model are provided.	`"en"`
`--eval-split`, `-es`	float	If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. If set to `0`, no evaluation set will be generated.	`0.2`
`--verbose`, `-V`	bool	Enable verbose logging.	`False`
`-F`	str	One or more comma-separated paths to Python files to import, e.g. for custom registered functions.	`None`

Example

prodigydata-to-spacy./corpus--ner news_ner_person,news_ner_org,news_ner_product--textcat news_cats2018,news_cats2019--eval-split 0.3ℹ Using language ‘en’============================= Generating data =============================Components: ner, textcat
Merging training and evaluation data for 2 components
  - [ner] Training: 685 | Evaluation: 300 (from datasets)
  - [textcat] Training: 538 | Evaluation: 230 (30% split)
Training: 1223 | Evaluation: 530
Labels: ner (3) | textcat (5)✔ Saved 1223 training examples./corpus/train.spacy✔ Saved 530 evaluation examples./corpus/dev.spacy============================ Generating config ============================ℹ Auto-generating config with spaCy✔ Generated training config======================= Generating cached label data =======================✔ Saving label data for component ‘ner’./corpus/labels/ner.json✔ Saving label data for component ‘textcat’./corpus/labels/textcat.json============================ Finalizing export ============================✔ Saved training config./corpus/config.cfgTo use this data for training with spaCy, you can run:
python -m spacy train ./corpus/config.cfg --paths.train ./corpus/train.spacy --paths.dev ./corpus/dev.spacy

Training in spaCy v3

spacytrain./corpus/config.cfg--paths.train ./corpus/train.spacy--paths.dev ./corpus/dev.spacy

`spacy-config` commandNew: 1.11

Interface: terminal only
Saves: config file ready to use in spaCy
Use case: generates a config file for spaCy to use in training that can load data from Prodigy

Generate a starter config for training from Prodigy datasets which you can use with spacy train. A custom reader will be used that merges annotations on the same examples. It’s recommended to use the review recipe on the different annotation types first to resolve conflicts properly (instead of relying on this recipe to just filter conflicting annotations and decide on one).

prodigyspacy-configoutput_file--ner--textcat--textcat-multilabel--tagger--parser--senter--spancat--coref--eval-split--config--base-model--verbose--lang--silent-F

Argument	Type	Description	Default
`output_dir`	str	Path to output directory.
`--ner`, `-n`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--textcat`, `-tc`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--textcat-multilabel`, `-tcm`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories). Use the `eval:` prefix for evaluation sets.	`None`
`--tagger`, `-t`	str	One or more (comma-separated) datasets for the part-of-speech tagger. Use the `eval:` prefix for evaluation sets.	`None`
`--parser`, `-p`	str	One or more (comma-separated) datasets for the dependency parser. Use the `eval:` prefix for evaluation sets.	`None`
`--senter`, `-s`	str	One or more (comma-separated) datasets for the sentence recognizer. Use the `eval:` prefix for evaluation sets.	`None`
`--spancat`, `-sc`	str	One or more (comma-separated) datasets for the span categorizer. Use the `eval:` prefix for evaluation sets.	`None`
`--coref`, `-co`	str	New: 1.12 One or more (comma-separated) datasets for the coreference resolver. Requires spacy-experimental. Use the `eval:` prefix for evaluation sets.	`None`
`--eval-split`, `-es`	float	If no evaluation sets are provided for a component, split off a a percentage of the training examples for evaluation. If set to `0`, no evaluation set will be generated.	`0.2`
`--config`, `-c`	str	Optional path to training `config.cfg` to use. If not set, it will be auto-generated using the default setttings.	`None`
`--base-model`, `-m`	str	Optional spaCy pipeline to use for tokenization and sentence segmentation.	`None`
`--lang`, `-l`	str	Code of language to use if no config or base model are provided.	`"en"`
`--verbose`, `-V`	bool	Enable verbose logging.	`False`
`--silent`, `-S`	bool	Don’t output any status or logs	`False`
`-F`	str	One or more comma-separated paths to Python files to import, e.g. for custom registered functions.	`None`

Example

prodigyspacy-configconfig.cfg--ner news_ner_person,news_ner_org,news_ner_product--textcat news_cats2018,news_cats2019--eval-split 0.3============================ Generating config ============================ℹ Auto-generating config with spaCy✔ Generated training config✔ Saved training configconfig.cfgYou can now add your data and train your pipeline:
python -m spacy train config.cfg

Training in spaCy v3

python -m spacy trainconfig.cfg

Vectors and Terminology

`terms.teach` binary

Interface: text
Saves: accepted and rejected terms to the database
Updates: target vector used for similarity comparsion
Use case: building terminology lists and pre-processing candiates for NER training

Build a terminology list interactively using a model’s word vectors and seed terms, either a comma-separated list or a text file containing one term per line. Based on the seed terms, a target vector is created and only terms similar to that target vector are shown. As you annotate, the recipe iterates over the vector model’s vocab and updates the target vector with the words you accept.

prodigyterms.teachdatasetvectors--seeds--resume

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`vectors`	str	Loadable spaCy pipeline with word vectors and a vocab, e.g. `en_core_web_lg` or custom vectors trained on domain-specific text.
`--seeds`, `-s`	str / `Path`	Comma-separated list or path to file with seed terms (one term per line).	`''`
`--resume`, `-R`	bool	Resume from existing terms dataset and update target vector accordingly.	`False`

Example

prodigyterms.teachprog_lang_termsen_core_web_lg--seeds Python,C++,Ruby

This live demo requires JavaScript to be enabled.

`terms.to-patterns` command

Interface: terminal only
Saves: JSONL-formatted patterns file
Use case: Convert terms dataset to match patterns to bootstrap annotation or for spaCy's entity ruler

Convert a dataset collected with terms.teach or sense2vec.teach to a JSONL-formatted patterns file. You can optionally provide a spaCy pipeline for tokenization to create token-based patterns and make them case-insensitive. If no model is provided, the patterns will be generated as exact string matches. Pattern files can be used in Prodigy to bootstrap annotation and pre-highlight suggestions, for example in ner.manual. You can also use them with spaCy’s EntityRuler for rule-based named entity recognition.

prodigyterms.to-patternsdatasetoutput_file--label--spacy-model--case-sensitive

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset ID to convert.
`output_file`	str	Optional path to an output file.	`sys.stdout`
`--label`, `-l`	str	Label to assign to the patterns.	`None`
`--spacy-model`, `-m`	str	New: 1.9 Optional spaCy pipeline for tokenization to create token-based patterns, or `blank:lang` to start with a blank model (e.g. `blank:en` for English).	`None`
`--case-sensitive`, `-CS`	bool	New: 1.9 Make patterns case-sensitive.	`False`

Example

prodigyterms.to-patternsprog_lang_terms./prog_lang_patterns.jsonl--label PROGRAMMING_LANGUAGE--spacy-model blank:en✨ Exported 59 patterns
./prog_lang_patterns.jsonl

Large Language Models

`ner.llm.correct` New: 1.13

Interface: ner_manual
Saves: NER annotations to the database
Use case: leverage large language models via spacy-llm as a model in the loop

This recipe marks entity predictions obtained from a large language model configured by spacy-llm and allows you to accept them as correct, or to manually curate them. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section

prodigyner.llm.correctdatasetconfig_pathsource--loader--segment--component--overrides

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--segment`, `-S`	bool	Split text into sentences	`False`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Give me an example configuration file to get started.

Check the spacy-llm documentation to learn more about how to set up these configuration files.

Example spacy-llm config file for NER
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.NER.v2"
labels = ["DISH", "INGREDIENT", "EQUIPMENT"]

[components.llm.model]
@llm_models = "spacy.GPT-4.v1"
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-ner-cache"
batch_size = 3
max_batches_in_mem = 10

Example

prodigyner.llm.correctannotated-recipesconfig.cfgexamples.jsonl

This live demo requires JavaScript to be enabled.

`ner.llm.fetch` New: 1.13

Interface: n/a
Saves: LLM suggestions to the database or a file
Use case: download pre-annotated examples for a specific NER task

The ner.llm.correct recipe fetches examples from large language models while annotating, but this recipe that can fetch a large batch of examples upfront. After downloading such a batch of examples you can use ner.manual to correct the annotations. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section

prodigyner.llm.fetchconfig_pathsourceoutput--loader--segment--component--resume--overrides

Argument	Type	Description	Default
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`output`	str	`dataset:name` or file path to save the annotations to.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--resume`, `-r`	bool	Resume fetching from dataset or file on disk.	`False`
`--segment`, `-S`	bool	Split text into sentences	`False`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Example

prodigyner.llm.fetchconfig.cfgexamples.jsonlner-annotated.jsonl100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`textcat.llm.correct` New: 1.13

Interface: text
Saves: textcat annotations to the database
Use case: leverage large language models via spacy-llm as a model in the loop

prodigytextcat.llm.correctdatasetconfig_pathsource--loader--component--overrides

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Give me an example configuration file to get started.

Check the spacy-llm documentation to learn more about how to set up these configuration files.

Example spacy-llm config file for textcat
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "spacy.TextCat.v2"
labels = ["RECIPE", "QUESTION", "FEEDBACK"]

[components.llm.model]
@llm_models = "spacy.GPT-4.v1"
config = {"temperature": 0.3}

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-ner-cache"
batch_size = 3
max_batches_in_mem = 10

Example

 prodigy
textcat.llm.correct annotated-recipes
config.cfg
examples.jsonl

This live demo requires JavaScript to be enabled.

`textcat.llm.fetch` New: 1.13

Interface: n/a
Saves: LLM suggestions to the database or a file
Use case: download pre-annotated examples for a specific textcat task

The textcat.llm.correct recipe fetches examples from large language models while annotating, but this recipe that can fetch a large batch of examples upfront. After downloading such a batch of examples you can use textcat.manual to correct the annotations. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section

prodigytextcat.llm.fetchconfig_pathsourceoutput--loader--resume--component--overrides

Argument	Type	Description	Default
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`output`	str	`dataset:name` or file path to save the annotations to.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--resume`, `-r`	bool	Resume download from dataset or file on disk	`False`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Example

prodigytextcat.llm.fetchconfig.cfgexamples.jsonldataset:textcat-annotated100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`spans.llm.correct` New: 1.13

Interface: spans_manual
Saves: Spancat annotations to the database
Use case: leverage spacy-llm as a model in the loop

This recipe marks overlapping span predictions obtained from a large language model configured by spacy-llm and allows you to accept them as correct, or to manually curate them. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.

prodigyspans.llm.correctdatasetconfig_pathsource--loader--component--overrides

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Give me an example configuration file to get started.

Here’s what an example spacy-llm config file might look like for spancat.

Basic spacy-llm config for spans
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"
save_io = true

[components.llm.task]
@llm_tasks = "spacy.SpanCat.v2"
labels = ["DISH", "INGREDIENT", "EQUIPMENT"]

[components.llm.task.label_definitions]
DISH = "Extract the name of a known dish."
INGREDIENT = "Extract the name of a cooking ingredient, including herbs and spices."
EQUIPMENT = "Extract any mention of cooking equipment. e.g. oven, cooking pot, grill"

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

[components.llm.task.examples]
@misc = "spacy.FewShotReader.v1"
path = "span_examples.yaml"

[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "local-cached"
batch_size = 3
max_batches_in_mem = 10

This file refers to a span_examples.yaml file, which might look like this:

span_examples.yaml
- text: 'Mac and Cheese is a popular American pasta variant.'
  entities:
    INGREDIENT: ['Cheese']
    DISH: ['Mac and Cheese']

Example

prodigyspans.llm.correctannotated-recipesconfig.cfgexamples.jsonl

This live demo requires JavaScript to be enabled.

`spans.llm.fetch` New: 1.13

Interface: spans_manual
Saves: LLM suggestions to the database or a file
Use case: download pre-annotated examples for a specific spancat task via spacy-llm

The spans.llm.correct recipe fetches examples from large language models while annotating, but this recipe that can fetch a large batch of examples upfront. After downloading such a batch of examples you can use spancat.manual to correct the annotations. This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.

prodigyspans.llm.fetchconfig_pathsourceoutput--loader--resume--component--overrides

Argument	Type	Description	Default
`config_path`	str	Path to the spacy-llm config file.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`output`	str	`dataset:name` or file path to save the annotations to.
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--resume`, `-r`	bool	Resume download from dataset or file on disk	`False`
`--component`, `-c`	str	Name of component to use for annotation.	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Example

prodigyspans.llm.fetchconfig.cfgexamples.jsonlspancat-annotated.jsonl100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`terms.llm.fetch` New: 1.13.2

Interface: n/a
Saves: Phrase candidates to the database
Use case: Leverage an LLM to help construct phrases for pattern files

This recipe generates terms and phrases obtained from a large language model. These terms can be curated and turned into patterns files, which can help with downstream annotation tasks. The recipe works by iteratively requesting terms from the LLM and by deduplicating the results on each batch. It can be helpful for this recipe to choose a high temperature setting for the LLM.

This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.

prodigyterms.llm.fetchdatasetconfig_pathoutput--n_requests--auto-accept--component--overrides

Argument	Type	Description	Default
`dataset`	str	Dataset to save annotations into.
`config_path`	str	Path to the spacy-llm config file.
`topic`	str	Description of the topic that you’re interested in.	`None`
`--n_requests`, `-r`	int	Number of requests to send to LLM.	5
`--auto-accept`, `-a`	int	Automatically accept generated examples.	`False`
`--component`, `-c`	str	Name of the spacy-llm component that generates terms	`llm`
`--overrides`	str	Overrides for the spacy-llm config file	`None`

Give me an example configuration file to get started.

Here’s what an example spacy-llm config file might look like for the terms recipe.

Example spacy-llm config for terms
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "prodigy.Terms.v1"
batch_size = 50

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

Example

prodigyterms.llm.fetchskateboard-trick-termsconfig.cfg”skateboard tricks”100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`ab.llm.tournament` New: 1.13.2

Interface: n/a
Saves: A/B evaluations to the database
Use case: Compare prompts and LLM backends by passing them through a tournament

The goal of this recipe is to quickly compare the quality of outputs from a collection of prompts by leveraging a tournament. It uses the Glicko rating system internally to determine the duels as well as the best performing prompt. You can also compare different LLM backends using this recipe.

This recipe may require environment variables to be set, depending on the large language model that you’re using. The details of this are explained in this section.

prodigyab.llm.tournamentdatasetinputs_pathprompt_pathconfig_path--display-template-path--no-random--resume--no-meta

Argument	Type	Description	Default
`dataset`	str	Dataset to save annotations into.
`inputs_path`	Path	Path to jsonl inputs
`prompt_path`	Path	Path file/folder with jinja2 prompt config(s)
`config_path`	Path	Path file/folder with spacy-llm config(s)
`display_template_path`, `-dp`	Path	Template for summarizing the arguments	`None`
`--no-random`,`-NR`	bool	Don’t randomize which annotation is shown as correct	`False`
`--resume`, `-r`	bool	Resume from the dataset, replaying the matches before starting	`False`
`--no-meta`,`-nm`	bool	Don’t add meta information to the annotation interface	`False`

Give me sample configuration files to get started.

Here’s what an example spacy-llm config files might look like for the tournament recipe.

Example spacy-llm config for the tournament recipe
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "prodigy.TextPrompter.v1"

[components.llm.model]
@llm_models = "spacy.GPT-3-5.v1"
config = {"temperature": 0.3}

Example spacy-llm config for the tournament recipe
[nlp]
lang = "en"
pipeline = ["llm"]

[components]

[components.llm]
factory = "llm"

[components.llm.task]
@llm_tasks = "prodigy.TextPrompter.v1"

[components.llm.model]
@llm_models = "spacy.GPT-4.v1"
config = {"temperature": 0.3}

You may also consider these prompts.

prompt1.jinja2
Write a haiku about {{topic}} that rhymes.

prompt2.jinja2
Write a super funny haiku about {{topic}} that rhymes.

Given these prompts, you may also consider such a file with topics to use.

{"topic": "star wars"}
{"topic": "python"}
{"topic": "stroopwafels"}

Example

prodigyab.llm.tournamenthaiku-tournamentinputs.jsonl./prompt_folder./config_folder--resume

This live demo requires JavaScript to be enabled.

Output after annotating a few examples

=============== Current winner: [prompt1.jinja2 + cfg1.cfg] ===============comparison                                                  prob   trials
[prompt1.jinja2 + cfg1.cfg] > [prompt1.jinja2 + cfg2.cfg]   0.50        0
[prompt1.jinja2 + cfg1.cfg] > [prompt2.jinja2 + cfg1.cfg]   0.50        0
[prompt1.jinja2 + cfg1.cfg] > [prompt2.jinja2 + cfg1.cfg]   0.71        1

Output after annotating more examples, ratings will converge

=============== Current winner: [prompt1.jinja2 + cfg1.cfg] ===============comparison                                                  prob   trials
[prompt1.jinja2 + cfg1.cfg] > [prompt1.jinja2 + cfg2.cfg]   0.96       15
[prompt1.jinja2 + cfg1.cfg] > [prompt2.jinja2 + cfg1.cfg]   0.94       12
[prompt1.jinja2 + cfg1.cfg] > [prompt2.jinja2 + cfg1.cfg]   0.89       10

`ner.openai.correct` New: 1.12

Interface: ner_manual
Saves: NER annotations to the database
Use case: leverage OpenAI as a model in the loop

This recipe marks entity predictions obtained from a large language model and allows you to accept them as correct, or to manually curate them. This allows you to quickly gather a gold-standard dataset through zero-shot or few-shot learning. It’s very much like using the ner.correct recipe, but with GPT-3 as a backend model to make predictions.

prodigyner.openai.correctdatasetsource--labels--lang--model--examples-path--max-examples--prompt-path--batch-size--segment--loader--verbose

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`--labels`, `-L`	str	Comma-separated list defining the NER labels the model should predict.
`--model`, `-m`	str	GPT-3 model to use for initial predictions.	`"text-davinci-003"`
`--examples-path`, `-e`	Path	Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to `None`, zero-shot learning is applied.	`None`
`--lang`, `-l`	str	Language of the input data - will be used to obtain a relevant tokenizer.	`"en"`
`--max-examples`, `-n`	int	Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available.	2
`--prompt_path`, `-p`	Path	Path to custom `.jinja2` prompt template	`None`
`--batch-size`, `-b`	int	Batch size of queries to send to the OpenAI API.	10
`--segment`, `-S`	bool	Flag to set when examples should be split into sentences. By default, the full input article is shown.	`False`
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--verbose`, `-v`	bool	Flag to print extra information to the terminal.	`False`

This live demo requires JavaScript to be enabled.

`ner.openai.fetch` New: 1.12

Interface: n/a
Saves: OpenAI NER suggestions to disk
Use case: download pre-filled examples for a specific NER task

The ner.openai.correct recipe fetches examples from OpenAI while annotating, but this recipe that can fetch a large batch of examples upfront. After downloading such a batch of examples loaded you can use ner.manual to correct the OpenAI annotations.

prodigyner.openai.fetchdatasetsourceoutput_path--labels--lang--model--examples-path--max-examples--prompt-path--batch-size--segment--resume--loader--verbose

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to source data to annotate. The data should at least contain a `"text"` field.
`output_path`	Path	Path to `.jsonl` file to save OpenAI annotations into.
`--labels` , `-L`	str	Comma-separated list defining the NER labels the model should predict.
`--lang`, `-l`	str	Language of the input data - will be used to obtain a relevant tokenizer.	`"en"`
`--model`, `-m`	str	GPT-3 model to use for initial predictions.	`"text-davinci-003"`
`--examples-path`, `-e`	Path	Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to `None`, zero-shot learning is applied.	`None`
`--max-examples`, `-n`	int	Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available.	2
`--prompt_path`, `-p`	Path	Path to custom `.jinja2` prompt template	`None`
`--batch-size`, `-b`	int	Batch size of queries to send to the OpenAI API.	10
`--segment`, `-S`	bool	Flag to set when examples should be split into sentences. By default, the full input article is shown.	`False`
`--resume`, `-r`	bool	Resume fetch from output file.	`False`
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--verbose`, `-v`	bool	Flag to print extra information to the terminal.	`False`

Example

prodigyner.openai.fetchexamples.jsonlner-annotated.jsonl--labels dish,ingredient,equipment100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`textcat.openai.correct` New: 1.12

Interface: ner_manual
Saves: text classification annotations to the database
Use case: leverage OpenAI as a model in the loop

This recipe enables you to classify texts by correcting the annotations from an OpenAI language model. OpenAI will also provide a “reason” to explain why a particular label was chosen.

prodigytextcat.openai.correctdatasetsource--labels--lang--model--batch-size--segment--prompt-path--examples-path--max-examples--loader--verbose

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	Path	Path to source data to annotate. The data should at least contain a `"text"` field.
`--labels`, `-L`	str	Comma-separated list defining the text categorization labels the model should predict.	`None`
`--lang`, `-l`	str	Language of the input data - will be used to obtain a relevant tokenizer.	`"en"`
`--model`, `-m`	str	GPT-3 model to use for initial predictions.	`"text-davinci-003"`
`--batch-size`, `-b`	int	Batch size of queries to send to the OpenAI API.	10
`--segment`, `-S`	bool	Flag to set when examples should be split into sentences. By default, the full input article is shown.	`False`
`--prompt-path`, `-p`	Path	Path to custom `.jinja2` prompt template. Will use default template if not provided.	`None`
`--examples-path`, `-e`	Path	Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to `None`, zero-shot learning is applied.	`None`
`--max-examples`, `-n`	int	Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available.	2
`--exclusive-classes`, `-E`	bool	Flag to make the classification task exclusive.	`False`
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--verbose`, `-v`	bool	Flag to print extra information to the terminal.	`False`

Example

prodigytextcat.openai.correctfood-commentsexamples.jsonl--labels “recipe,feedback,question”

This live demo requires JavaScript to be enabled.

`textcat.openai.fetch` New: 1.12

Interface: n/a
Saves: OpenAI textcat suggestions to disk
Use case: download pre-annotated examples for a specific text classification task

The textcat.openai.correct recipe fetches examples from OpenAI while annotating, but this recipe that can fetch a large batch of examples upfront. This is helpful when you are with a highly-imbalanced data and interested only in rare examples. After downloading such a batch of examples loaded you can use ner.manual to correct the OpenAI annotations.

prodigy
textcat.openai.fetch
dataset
source
output_path
--labels
--lang
--model
--prompt-path
--examples-path
--max-examples
--batch-sizes
--segment
--exclusive-classess
--resume
--loader
--verbose

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	Path	Path to source data to annotate. The data should at least contain a `"text"` field.
`output_path`	Path	Path to `.jsonl` file to save OpenAI annotations into.
`--labels`, `-L`	str	Comma-separated list defining the NER labels the model should predict.	`None`
`--lang`, `-l`	str	Language of the input data - will be used to obtain a relevant tokenizer.	`"en"`
`--model`, `-m`	str	GPT-3 model to use for initial predictions.	`"text-davinci-003"`
`--prompt_path`, `-p`	Path	Path to custom `.jinja2` prompt template	`None`
`--examples-path`, `-e`	Path	Path to examples to help define the task. The file can be a .yml, .yaml or .json. If set to `None`, zero-shot learning is applied.	`None`
`--max-examples`, `-n`	int	Max number of examples to include in the prompt to OpenAI. If set to 0, zero-shot learning is always applied, even when examples are available.	2
`--batch-size`, `-b`	int	Batch size of queries to send to the OpenAI API.	10
`--segment`, `-S`	bool	Flag to set when examples should be split into sentences. By default, the full input article is shown.	`False`
`--exclusive-classes`, `-E`	bool	Make the classification task exclusive	`False`
`--resume`, `-r`	bool	Resume fetch from output file.	`False`
`--loader`, `-lo`	str	Loader (guessed from file extension if not set).	`None`
`--verbose`, `-v`	bool	Flag to print extra information to the terminal.	`False`

Example

prodigytextcat.openai.fetchexamples.jsonltextcat-annotated.jsonlrecipe,feedback,question100%|████████████████████████████| 50/50 [00:12<00:00, 3.88it/s]

`terms.openai.fetch` New: 1.12

Interface: n/a
Saves: texts examples to disk
Use case: generate terminology lists via OpenAI

This recipe generates terms and phrases obtained from a large language model. These terms can be curated and turned into patterns files, which can help with downstream annotation tasks.

prodigyterms.openai.fetchqueryoutput_path--seeds--n--model--prompt-path--resume--temperature--top-p--best-of--n-batch--max-tokens

Argument	Type	Description	Default
`query`	str	Query to send to OpenAI
`output_path`	Path	Path to save the output
`--seeds`,`-s`	str	One or more comma-separated seed phrases.	`""`
`--n`,`-n`	int	Minimum number of items to generate	`100`
`--model`, `-m`	str	GPT-3 model to use for completion	`"text-davinci-003"`
`--prompt-path`, `-p`	Path	Path to custom jinja2 prompt template	`None`
`--resume`, `-r`	bool	Resume by loading in text examples from output file	`False`
`--temperature`,`-t`	float	OpenAI temperature param	`1.0`
`--top-p`, `--tp`	float	OpenAI top_p param	`1.0`
`--best-of`, `-bo`	int	OpenAI best_of param	`10`
`--n-batch`,`-nb`	int	OpenAI batch size param	`10`
`--max-tokens`, `-mt`	int	Max tokens to generate per call	`100`

Example

prodigyterms.openai.fetch”skateboard tricks”./tricks.jsonl--n 100”kickflip,ollie”100%|████████████████████████████| 100/100 [00:12<00:00, 3.88it/s]

`ab.openai.prompts` New: 1.12

Interface: choice
Saves: A/B evaluations to the database
Use case: Compare OpenAI prompts by passing them through a A/B test

The goal of this recipe is to quickly compare the quality of outputs from two prompts in a quantifiable and blind way.

prodigyab.openai.promptsdatasetinputs_pathdisplay_template_pathprompt1_template_pathprompt2_template_path--model--batch-size--temperature--no-random--repeat--verbose--no-meta

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save answers into
`inputs_path`	Path	Path to jsonl inputs
`display_template_path`	Path	Template for summarizing the arguments
`prompt1_template_path`	Path	Path to the first jinja2 prompt template
`prompt2_template_path`	Path	Path to the second jinja2 prompt template
`--model`, `-m`	str	GPT-3 model to use for completion	`"text-davinci-003"`
`--batch-size`, `-b`	int	Batch size to send to OpenAI API	`10`
`--temperature`,`-t`	float	OpenAI temperature param	`1.0`
`--no-random`,`-NR`	bool	Don’t randomize which annotation is shown as correct	`False`
`--repeat`, `-r`	int	How often to send the same prompt to OpenAI	`1`
`--verbose`,`-v`	bool	Print extra information to terminal	`False`
`--no-meta`,`-nm`	bool	Don’t add meta information to the annotation interface	`False`

Example

prodigyab.openai.promptshaikuhaiku-inputs.jsonltitle.jinja2prompt1.jinja2prompt2.jinja2--repeat 5

This live demo requires JavaScript to be enabled.

Output after annotating

========================== ✨  Evaluation results ==========================✔ You preferred prompt1.jinja2
prompt1.jinja2   11
prompt2.jinja2    5

`ab.openai.tournament` New: 1.12

Interface: choice
Saves: A/B evaluations to the database
Use case: Compare OpenAI prompts by passing them through a tournament

The goal of this recipe is to quickly compare the quality of outputs from a collection of prompts by leveraging a tournament. It uses the Glicko rating system internally to determine the duels as well as the best performing prompt. Your can read more about the expectations of the jsonl/jinja2 files in the tournaments guide

prodigyab.openai.tournamentdatasetinputs_pathdisplay_template_pathprompt_template_folder--model--batch-size--resume--temperature--no-random--verbose--no-meta

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save answers into
`inputs_path`	Path	Path to jsonl inputs
`display_template_path`	Path	Template for summarizing the arguments
`prompt_template_folder`	Path	Path to folder with jinja2 prompt templates
`--model`, `-m`	str	GPT-3 model to use for completion	`"text-davinci-003"`
`--batch-size`, `-b`	int	Batch size to send to OpenAI API	`1`
`--resume`, `-r`	bool	Resume from the dataset, starting with ratings based on matches from before	`False`
`--temperature`,`-t`	float	OpenAI temperature param	`1.0`
`--no-random`,`-NR`	bool	Don’t randomize which annotation is shown as correct	`False`
`--verbose`,`-v`	bool	Print extra information to terminal	`False`
`--no-meta`,`-nm`	bool	Don’t add meta information to the annotation interface	`False`

Example

prodigyab.openai.tournamenthaiku-tournamentinput.jsonltitle.jinja2prompt_folder--resume

This live demo requires JavaScript to be enabled.

Output after annotating a few examples

==================== ✨  Current winner: prompt3.jinja2 ====================comparison                          value  count
P(prompt3.jinja2 > prompt1.jinja2)   0.60      7
P(prompt3.jinja2 > prompt2.jinja2)   0.83      3

Output after annotating more examples, ratings will converge

==================== ✨  Current winner: prompt3.jinja2 ====================comparison                          value   count
P(prompt3.jinja2 > prompt1.jinja2)   0.95      19
P(prompt3.jinja2 > prompt2.jinja2)   0.97       5

Review and Evaluate

`review` New: 1.8

Interface: review
Saves: reviewed master annotations to the database
Use case: review annotations by multiple annotators and resolve conflicts

Review existing annotations created by multiple annotators and resolve potential conflicts by creating one final “master annotation”. Can be used for both binary and manual annotations and supports all interfaces except image_manual and compare. If the annotations were created with a manual interface, the “most popular” version, e.g. the version most sessions agreed on, will be pre-selected automatically.

prodigyreviewdatasetin_sets--label--view-id--fetch-media--show-skipped--auto-accept--auto-accept

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset ID to save reviewed annotations.
`in_sets`	str	Comma-separated names of datasets to review.
`--label`, `-l`	str	Optional comma-separated labels to display in manual annotation mode.	`None`
`--view-id`, `-v`	str	Interface to use if none present in the task, e.g. `ner` or `ner_manual`.	`None`
`--fetch-media`, `-FM`	bool	New: 1.10 Temporarily replace paths and URLs with base64 string so thex can be reannotated. Will be removed again before examples are placed in the database.	`False`
`--show-skipped`, `-S`	bool	New: 1.10.5 Include answers that would otherwise be skipped, like annotations with answer `"ignore"` or annotations with answer `"reject"` in manual interfaces.	`False`
`--auto-accept`, `-A`	bool	New: 1.11 Automatically accept annotations with no conflicts and add them to the dataset.	`False`
`--accept-single`, `-AS`	bool	New: 1.12 Also auto-accept annotations that have only been annotated by a single annotator.	`False`

Example (binary)

prodigyreviewfood_reviews_finalfood_reviews2019,food_reviews2018

This live demo requires JavaScript to be enabled.

Example (manual)

prodigyreviewner_finalner_news,ner_misc--label ORG,PRODUCT

This live demo requires JavaScript to be enabled.

`compare`

Interface: choice/ diff
Saves: evaluation results to the database
Use case: A/B evaluation of two outputs, e.g. to evaluate diffent models

Compare the output of your model and the output of a baseline on the same inputs. To prevent bias during annotation, Prodigy will randomly decide which output to suggest as the correct answer. When you exit the application, you’ll see detailed stats, including the preferred output. Expects two JSONL files where each entry has an "id" (to match up the outputs on the same input), and an "input" and "output" object with the content to render, e.g. the "text".

prodigycomparedataseta_fileb_file--no-random--diff

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`a_file`	str	First file to compare, e.g. system responses.
`b_file`	str	Second file to compare, e.g. baseline responses.
`--no-random`, `-nr`	bool	Don’t randomize which annotation is shown as the “correct” suggestion (always use the first option).	`False`
`--diff`, `-D`	bool	Show examples as visual diff.	`False`

prodigycompareeval_translation./model_a.jsonl./model_b.jsonl

model_a.jsonl
{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx hit by worldwide cyberattack"}}

model_b.jsonl
{"id": 1, "input": {"text": "FedEx von weltweiter Cyberattacke getroffen"}, "output": {"text": "FedEx from worldwide Cyberattacke hit"}}

This live demo requires JavaScript to be enabled.

`metric.iaa.doc` commandNew: 1.14.3

Interface: terminal only
Use case: computes document-level inter-annotator agreement

Compute the inter-annotator agreement (IAA) for document-level annotations using percent agreement, Krippendorff’s Alpha, and Gwet’s AC2 as metrics. The algorithm implemention is ported from: https://github.com/pmbaumgartner/prodigy-iaa and is benchmarked on Gwet, 2015 paper. The current implementation supports two types of annotations: multiclass and multilabel (for binary annotations please see metric.iaa.binary). Importantly, the annotations are grouped by the _input_hash i.e. all annotations that have the same _input_hash are considered to be the same annotation task. For details on other source data assumptions and the interpretation of the results please see the metrics guide.

The command will output the results to the terminal:

Example metric.iaa.doc output

prodigymetric.iaa.docdataset:textcatmulticlass-l POSITIVE,NEGATIVEℹ Annotation StatisticsAttribute                     Value
--------------------------    -----
Examples                      100
Categories                    3
Coincident Examples*          50
Single Annotation Examples    4
Annotators                    2
Avg. Annotations per Example  2.00
* (>1 annotation)ℹ Agreement StatisticsStatistic                     Value
--------------------------    -----
Percent (Simple) Agreement    0.4167
Krippendorff's Alpha          0.1809
Gwet's AC2                    0.1640

Argument	Type	Description	Default
`source`	str	Path to source or `dataset:name` to load from existing annotations. Directories with JSONL files are also supported.
`annotation_type`	str	Annotation type in the source. `multiclass` or `multilabel`
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--labels`, `-l`	str	Comma separated list of labels. If not provided, it will be inferred from the dataset.	`None`
`--annotators`, `-a`	str	Comma separated list annotators. If not provided, it will be inferred from the dataset.	`None`
`--output`, `-o`	str	Path to a `json` file to save the results on disc.	`None`

`metric.iaa.span` commandNew: 1.14.3

Interface: terminal only
Use case: computes span-level inter-annotator agreement

Compute the inter-annotator agreement (IAA) for span-level text annotations using micro-averaged pairwise F1 score as the metric. For computing IAA on reject/accept decisions of span annotations see metric.iaa.binary. Importantly, the annotations are grouped by the _input_hash i.e. all annotations that have the same _input_hash are considered to be the same annotation task. For more details on other source data assumptions and the interpretation of the results please see the metrics guide.

The command will output the results to the terminal:

Example metric.iaa.span output

prodigymetric.iaa.spandataset:ner-l PER,ORGℹ Annotation StatisticsAttribute                     Value
--------------------------    -----
Examples                      22
Categories                    2
Coincident Examples*          12
Single Annotation Examples    0
Annotators                    2
Avg. Annotations per Example  2.00
\* (>1 annotation)ℹ Agreement StatisticsLabel                         Value
--------------------------    -----
PER                           0.0
ORG                           0.5ℹ Confusion matrix        PER    ORG   NONE
---    ----   ----   ----
PER    0.00   0.00   0.00
ORG    0.00   0.67   0.33
NONE   0.00   0.60   0.40

Argument	Type	Description	Default
`source`	str	Path to source or `dataset:name` to load from existing annotations. Directories with JSONL files are also supported.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--labels`, `-l`	str	Comma separated list of labels. If not provided, it will be inferred from the dataset.	`None`
`--annotators`, `-a`	str	Comma separated list annotators. If not provided, it will be inferred from the dataset.	`None`
`--partial`, `-P`	bool	Consider partial span matches as agreement.	`False`
`--output`, `-o`	str	Path to a `json` file to save the results on disc.	`None`

`metric.iaa.binary` commandNew: 1.14.3

Interface: terminal only
Use case: computes inter-annotator agreement for binary annotations

Compute the inter-annotator agreement (IAA) for binary annotations using percent agreement, Krippendorff’s Alpha, and Gwet’s AC2 as metrics. The algorithm implemention is ported from: https://github.com/pmbaumgartner/prodigy-iaa and is benchmarked on Gwet, 2015 paper. Importantly, the annotations are grouped by the _input_hash i.e. all annotations that have the same _input_hash are considered to be the same annotation task. For details on other source data assumptions and the interpretation of the results please see the metrics guide.

The command will output the results to the terminal:

Example metric.iaa.binary output

prodigymetric.iaa.binarydataset:nerℹ Annotation StatisticsAttribute                     Value
--------------------------    -----
Examples                      37
Categories                    2
Coincident Examples*          17
Single Annotation Examples    3
Annotators                    2
Avg. Annotations per Example  1.85
\* (>1 annotation)ℹ Agreement StatisticsStatistic                     Value
--------------------------    -----
Percent (Simple) Agreement    0.5294
Krippendorff's Alpha          0.1833
Gwet's AC2                    0.1681

Argument	Type	Description	Default
`source`	str	Path to source or `dataset:name` to load from existing annotations. Directories with JSONL files are also supported.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--annotators`, `-a`	str	Comma separated list annotators. If not provided, it will be inferred from the dataset.	`None`
`--output`, `-o`	str	Path to a `json` file to save the results on disc.	`None`

Other Utilities and Commands

`mark` binary / manual

Interface: n/a
Saves: annotations to the database
Use case: show data and accept or reject examples

Start the annotation server, display whatever comes in with a given interface and collect binary annotations. At the end of the annotation session, a breakdown of the answer counts is printed. The --view-id lets you specify one of the existing annotation interfaces – just make sure your input data includes everything the interface needs, since this recipe does no preprocessing and will just show you whatever is in the data. The recipe is also very useful if you want to re-annotate data exported with db-out.

prodigymarkdatasetsource--loader--label--view-id--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Label to apply in classification mode or comma-separated labels to show for manual annotation.	`''`
`--view-id`, `-v`	str	Annotation interface to use.	`None`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

prodigymarknews_marked./news_headlines.jsonl--label INTERESTING--view-id classification

This live demo requires JavaScript to be enabled.

`match` binaryNew: 1.9.8

Interface: n/a
Saves: annotations to the database
Use case: select examples based on match patterns

Select examples based on match patterns and accept or reject the result. Unlike ner.manual with patterns, this recipe will only show examples if they contain pattern matches. It can be used for NER and text classification annotations – for instance, to bootstrap a text category if the classes are very imbalanced and not enough positive examples are presented during manual annotation or textcat.teach. The --label-task and --label-span flags can be used to specify where the label should be added. This will also be reflected via the "label" property (on the top-level task or the spans) in the data you create with the recipe. If --combine-matches is set, all matches will be presented together. Otherwise, each match will be presented as a separate task.

prodigymatchdatasetspacy_modelsource--loader--label--patterns--label-task--label-span--combine-matches--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`spacy_model`	str	Loadable spaCy pipeline for tokenization to initialize the matcher, or `blank:lang` for a blank model (e.g. `blank:en` for English).
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`
`--label`, `-l`	str	Comma-separated label(s) to annotate or text file with one label per line. Only pattern matches for those labels will be shown.
`--patterns`, `-pt`	str	Path to match patterns file.
`--label-task`, `-LT`	bool	Whether to add a label to the top-level task if a match for that label was found. For example, if you use this recipe for text classification, you typically want to add a label to the whole task.	`False`
`--label-span`, `-LS`	bool	Whether to add a label to the matched span that’s highlighted. For example, if you use this recipe for NER, you typically want to add a label to the span but not the whole task.	`False`
`--combine-matches`, `-C`	bool	Whether to show all matches in one task. If `False`, the matcher will output one task for each match and duplicate tasks if necessary.	`False`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

prodigymatchnews_matchedblank:en./news_headlines.jsonl--patterns ./news_patterns.jsonl--label ORG,PRODUCT--label-span

This live demo requires JavaScript to be enabled.

`print-stream` commandNew: 1.9

Interface: terminal only
Use case: quickly view a spaCy model's predictions

Pretty-print the model’s predictions on the command line. Supports named entities and text categories and will display the annotations if the model components are available. For textcat annotations, only the category with the highest score is shown if the score is greater than 0.5.

prodigyprint-streamspacy_modelsource--loader

Argument	Type	Description	Default
`spacy_model`	str	Loadable spaCy pipeline.
`source`	str	Path to text source or `-` to read from standard input.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`

`print-dataset` commandNew: 1.9

Interface: terminal only
Use case: quickly inspect collected annotations

Pretty-print annotations from a given dataset on the command line. Supports plain text, text classification and NER annotations. If no --style is specified, Prodigy will try to infer it from the data via the "_view_id" that’s automatically added since v1.8.

prodigyprint-datasetdataset--style

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset ID.
`--style`, `-s`	str	Dataset type: `auto` (try to infer from the data, default), `text`, `spans` or `textcat`.	`auto`

`filter-by-patterns` commandNew: 1.12

Interface: terminal only
Saves: JSONL file to disk
Use case: filter data to produce interesting subset

Filter data using match patterns in order to produce a representative subset for downstream tasks. Such subsets can be useful to jump start a model e.g. in active learning setting, especially, when dealing with sparse entities. The output dataset will contain entity spans added to matching examples. It will also display a progress bar.

Example

prodigyfilter-by-patternssourceoutputspacy_model--patterns--label--loader

Argument	Type	Description	Default
`source`	str	Data to filter (file path or ’-’ to read from standard input)
`output`	str	Path to .jsonl file, or dataset, to write subset into
`spacy_model`	str	Loadable spaCy pipeline or blank:lang (e.g. blank:en)
`--patterns`, `-pt`	str	Path to match patterns file	`None`
`--label`, `-l`	str	Comma-separated label(s) to select subset of patterns	`None`
`--loader`, `-lo`	str	Loader (guessed from file extension if not set)	`None`

Example

prodigyfilter-by-patternsexamples.jsonlsubset.jsonlen_core_web_sm--patterns ./patterns.jsonl100%|████████████████████████████| 250/250 [00:12<00:00, 20.88it/s]

`db-out` command

Interface: terminal only
Saves: JSONL file to disk
Use case: export annotated data

Export annotations in Prodigy’s JSONL format. If the output directory doesn’t exist, it will be created. If no output directory is specified, the data will be printed so it can be redirected to a file.

prodigydb-outdatasetout_dir--dry

Argument	Type	Description	Default
`dataset`	str	Dataset ID to import or export.
`out_dir`	str	Optional path to output directory to export annotation file to.	`None`
`--dry`, `-D`	bool	Perform a dry run and don’t save any files.	`False`

Example

prodigydb-outnews_headlines>./news_headlines.jsonl

`db-merge` command

Interface: terminal only
Saves: merged examples to the database
Use case: merging multiple datasets with annotations into one

Merge two or more existing datasets into a new set, e.g. to create a final dataset that can be reviewed or used to train a model. Keeps a copy of the original datasets and creates a new set for the merged examples.

prodigydb-mergein_setsout_set--rehash--dry

Argument	Type	Description	Default
`in_sets`	str	Comma-separated names of datasets to merge.
`out_set`	str	Name of dataset to save the merged examples to.
`--rehash`, `-R`	bool	New: 1.10 Force-update all hashes assigned to examples.	`False`
`--dry`, `-D`	bool	Perform a dry run and don’t save anything.	`False`

prodigydb-mergenews_person,news_org,news_productnews_training✔ Merged 2893 examples from 3 datasetsCreated merged dataset 'news_training'

`db-in` command

Interface: terminal only
Saves: imported examples to the database
Use case: importing existing annotated data

Import existing annotations to the database. Can load all file types supported by Prodigy. To import NER annotations, the files should be converted into Prodigy’s JSONL annotation format.

prodigydb-indatasetin_file--rehash--dry

Argument	Type	Description	Default
`dataset`	str	Dataset ID to import or export.
`in_file`	str	Path to input annotation file.
`--answer`	str	Set this answer key if none is present	`"accept"`
`--rehash`, `-rh`	bool	Update and overwrite all hashes.	`False`
`--dry`, `-D`	bool	Perform a dry run and don’t save any files.	`False`

`drop` command

Interface: terminal only
Saves: updated database
Use case: remove datasets and sessions

Remove a dataset or annotation session from a project. Can’t be undone. To see all dataset and session IDs in the database, use prodigy stats -ls.

prodigydropdataset--batch-size

Argument	Type	Description	Default
`dataset`	str	Dataset or session ID.
`--batch-size`, `-n`	int	Delete examples in batches of the given size. Prevents possible database error for large datasets.	`None`

`stats` command

Interface: terminal only
Use case: view installation details and database statistics

Print Prodigy and database statistics. Specifying a dataset ID will show detailed stats for the dataset, like annotation counts and meta data. You can also choose to list all available dataset or session IDs.

prodigystatsdataset-l-ls--no-format

Argument	Type	Description	Default
`dataset`	str	Optional Prodigy dataset ID.
`--list-datasets`, `-l`	bool	List IDs of all datasets in the database.	`False`
`--list-sessions`, `-ls`	bool	List IDs of all datasets and sessions in the database.	`False`
`--no-format`, `-nf`	bool	Don’t pretty-print the stats and print a simple dict instead.	`False`

Example

prodigystatsnews_headlines-l============================== Prodigy Stats ==============================Version            1.9.0
Database Name      SQLite
Database Id        sqlite
Total Datasets     4
Total Sessions     23================================= Datasets =================================news_headlines, news_headlines_eval, github_docs, test============================== Dataset Stats ==============================Dataset            news_headlines
Created            2017-07-29 15:29:28
Description        Annotate news headlines
Author             Ines
Annotations        1550
Accept             671
Reject             435
Ignore             444

`progress` commandNew: 1.11

Interface: terminal only
Use case: view annotation progress over time

View the annotation progress of one or more datasets over time and optionally compare it against an input source to check the coverage. The command will output the new annotations created during the given intervals, the total annotations at each point, as well as the number of unique annotations if the data contains multiple annotations on the same input data.

prodigyprogressdatasets--interval--source--loader

Argument	Type	Description	Default
`datasets`	str	One or more comma separated dataset names.
`--interval`, `-i`	str	Time period to calculate progress for. Can be `"day"`, `"week"`, `"month"`, `"year"`.	`"month"`
`--source`, `-s`	str	Optional path to text source or `-` to read from standard input. If set, will be used to calculate percentage of annotated examples based on the input data.
`--loader`, `-lo`	str	Optional ID of text source loader. If not set, source file extension is used to determine loader.	`None`

Example

prodigyprogressnews_headlines_person,news_headlines_org================================== Legend ==================================New         New annotations collected in interval
Total       Total annotations collected
Unique      Unique examples (not counting multiple annotations of same example)================================= Progress =================================New   Unique   Total   Unique
10 Jul 2021   1123      733    1123      733
12 Jul 2021    200      200    1323      933
13 Jul 2021    831      711    2154     1644
14 Jul 2021    157      150    2311     1790
15 Jul 2021   1464     1401    3775     3191

`prodigy` command

Interface: terminal only
Use case: Run recipe scripts

Run a built-in or custom Prodigy recipe. The -F option lets you load a recipe from a simple Python file, containing one or more recipe functions. All recipe arguments will be available from the command line. To print usage info and a list of available arguments, use the --help flag.

prodigyrecipe_name…recipe arguments-F

Argument	Type	Description
`recipe_name`	positional	Recipe name.
`*recipe_arguments`		Recipe arguments.
`-F`	str	Path to recipe file to load custom recipe.
`--help`, `-h`	bool	Show help message and available arguments.

Example

prodigycustom-recipemy_dataset./data.jsonl--custom-opt 123-F recipe.py

recipe.py
import prodigy
from prodigy.components.stream import get_stream

@prodigy.recipe(
    "custom-recipe",
    dataset=("The dataset", "positional", None, str),
    source_file=("A positional argument", "positional", None, str),
    custom_opt=("An option", "option", "co", int)
)
def custom_recipe_function(dataset, source_file, custom_opt=10):
    stream = get_stream(source_file)
    print("Custom option pased in via command line:", custom_opt)
    return {
        "dataset": dataset,
        "stream": stream,
        "view_id": "text"
    }

Deprecated recipes

The following recipes have been deprecated in favor of newer workflows and best practices. See the table for details and replacements. The version numbers indicate when the feature was deprecated (but still available) and when it was removed. For instance, 1.10 1.11 indicates that the recipe was deprecated but still available in v1.10 and removed in v1.11. To view the recipe details and documentation of deprecated recipes, run the recipe command with the --help flag.


`ner.match`	1.10 1.11 This recipe has been deprecated in favor of `ner.manual` with `--patterns`, which lets you match patterns and allows editing the results at the same time, and the general purpose `match`, which lets you match patterns and accept or reect the result.
`ner.eval`	1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with `ner.manual` (fully manual) or `ner.correct` (semi-automatic).
`ner.print-stream`	1.10 1.11 This recipe has been deprecated in favor of the general-purpose `print-stream` command that can print streams of all supported types.
`ner.print-dataset`	1.10 1.11 This recipe has been deprecated in favor of the general-purpose `print-dataset` command that can print datasets of all supported types.
`ner.gold-to-spacy`	1.10 1.11 This recipe has been deprecated in favor of `data-to-spacy`, which can take multiple datasets of different types (e.g. NER and text classification) and outputs a JSON file in spaCy’s training format that can be used with `spacy train`.
`ner.iob-to-gold`	1.10 1.11 This recipe has been deprecated because it only served a very limited purpose. To convert IOB annotations, you can either use `spacy convert` or write a custom script.
`ner.batch-train`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train` recipe that supports all components.
`ner.train-curve`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train-curve` recipe that supports all components.
`textcat.eval`	1.10 1.11 This recipe has been deprecated in favor of creating regular gold-standard evaluation sets with `textcat.manual`.
`textcat.print-stream`	1.10 1.11 This recipe has been deprecated in favor of the general-purpose `print-stream` command that can print streams of all supported types.
`textcat.print-dataset`	1.10 1.11 This recipe has been deprecated in favor of the general-purpose `print-dataset` command that can print datasets of all supported types.
`textcat.batch-train`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train` recipe that supports all components and works with both binary accept/reject annotations and multiple choice annotations out-of-the-box.
`textcat.train-curve`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train-curve` recipe that supports all components.
`pos.gold-to-spacy`	1.10 1.11 This recipe has been deprecated in favor of `data-to-spacy`, which can take multiple datasets of different types (e.g. POS tags and NER) and outputs a JSON file in spaCy’s training format that can be used with `spacy train`.
`pos.batch-train`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train` recipe that supports all components.
`pos.train-curve`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train-curve` recipe that supports all components.
`dep.batch-train`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train` recipe that supports all components.
`dep.train-curve`	1.10 1.11 This recipe will be deprecated in favor of the general-purpose `train-curve` recipe that supports all components.
`terms.train-vectors`	1.10 1.11 This recipe has been deprecated since wrapping word vector training in a recipe only introduces a layer of unnecessary abstraction. If you want to train your own vectors, use GloVe, fastText or Gensim directly and then add the vectors to a spaCy pipeline.
`image.test`	1.10 1.11 This recipe has been deprecated since it was mostly intended to demonstrate the new image capabilities on launch. For a real-world example of using Prodigy for object detection with a model in the loop, see this TensorFlow tutorial.
`pipe`	1.10 1.11 This command has been deprecated since it didn’t provide any Prodigy-specific functionality. To pipe data forward, you can convert the data to JSONL and run `cat data.jsonl
`dataset`	1.10 1.11 This command has been deprecated since it’s mostly redundant. If a dataset doesn’t exist in the database, it’s added automatically.