Prodigy Plugins

Some Prodigy recipes require a 3rd party library in order to work. To keep Prodigy lightweight we’ve separated some of these recipes out into their own packages so that you may install them as a plugin. These plugins always target the most recent version of Prodigy with regards to compatibility.

This section of the docs showscases such plugins. Note that you can also explore these recipes on Github to serve as a source of inspiration to customise further.


🤗 Prodigy HF	Recipes that interact with the Huggingface stack. Contains `hf.train.ner`, `hf.correct.ner`, `hf.upload` and more.	Github repo
📄 Prodigy PDF	Recipes that help with the annotation of PDF files. Contains `pdf.image.manual` and `pdf.ocr.correct`	Github repo
🤫 Prodigy Whisper	Recipes that leverage OpenAI’s Whisper model for audio transcription. Contains `whisper.audio.annotate`.	Github repo
🍰 Prodigy Segment	Recipes that leverage Meta’s Segment-Anything model for image segmentation. Contains `segment.image.manual` and more.	Github repo
🏘 Prodigy ANN	Recipes that allow you to use approximate nearest neighbor techniques to help you annotate. Contains `ann.text.index`, `ann.image.index`, `ann.text.fetch` and more.	Github repo
🌕 Prodigy Lunr	Recipes that allow you to use old-school string matching techniques to help you annotate. Contains `lunr.text.index`, `lunr.text.fetch` and more.	Github repo
🦆 sense2vec	Recipes that allow to fetch terms using phrase embeddings trained on Reddit. Contains `sense2vec.teach`, `sense2vec.to-patterns` and more.	Github repo
🔎 Prodigy evaluate	Recipes that compute evaluation metrics for spaCy pipelines. Contains `evaluate.evaluate`, `evaluate.evaluate-example` and more.	Github repo

🤗 Prodigy-HF

This plugin contains recipes that interact with the Hugging Face stack. Some recipes will allow you to directly train transformer models on top of your annotations while other recipes allow you to upload artifacts to HF cloud environment.

To use these recipes, you’ll first need to install the plugin.

Install prodigy-hfpip install "prodigy-hf @ git+https://github.com/explosion/prodigy-hf"

Once it is installed you can explore some of the new recipes.

Training Huggingface models

The first recipe that you may enjoy from this plugin is the recipe to train custom NER models.


prodigy
hf.train.ner
fashion,eval:fashion-eval
hf-model-dir
--epochs 10
--model-name distilbert-base-uncased

Once the model is done training you’ll be able to inspect the hf-model-dir folder to find all the trained state.

You can also choose to re-use this trained model to help you annotate data. The plugin features a hf.ner.correct recipe that works similarily to ner.correct except here we get to use a Hugging Face model. This means that you can also use models from the Hugging Face Hub. This recipe will internally map the predictions from the transformer model to spaCy tokens.


prodigy
hf.ner.correct
fashion
hf-model-dir/checkpoint-20
examples.jsonl
--lang en

Note that this plugin also offers variants of these recipes for text classification. Check out the API docs for hf.train.textcat and hf.correct.textcat for more details.

Warning: Exclusive categories

Note that hf.correct.textcat and hf.train.textcat assume mutually exclusive textcat labels. You cannot train a pipeline where multiple classes can be detected for a single example.

This also has consequences with regards to annotation. Where possible you should add an --exclusive flag to your recipes to ensure that the annotations are in the right format. That also means that if you use textcat.manual to annotate data for a transformer pipeline that you should also be mindful when you’re doing binary annotations.


prodigy
textcat.manual
dataset
examples.jsonl
--labels pos,not-pos
--exclusive

By using --labels pos,not-pos instead of --labels pos you are making sure that both classes are available to the pipeline after it is training.

Interacting with Hugging Face Hub

Alternatively, you may also use these plugin to upload your annotated datasets to Huggingface Hub.

Logging in

All the recipes below assume that you’ve authenticated beforehand via the following command:

Logging into Huggingface from the CLIhuggingface-cli login


prodigy
hf.upload
fashion,eval:fashion-eval
username/reponame

✔ Upload completed! You should be able to view repo at
https://huggingface.co/datasets/username/reponame.

Internally this recipe will validate the dataset for consistency and will attempt to anonymise the annotators before uploading. You can turn this behavior off with flags and you can also specify that you want the dataset not to appear publicly.

Storing keyed datasets

You can choose to prefix a dataset name, like eval:fashion-eval to have the examples appear in a different key on the DataSetDict. You can confirm this locally when downloading the dataset via the datasets library.

from datasets import load_dataset

load_dataset("username/reponame")

Notice how the eval key appears in the output.

DatasetDict({
    train: Dataset({
        features: ['text', 'meta', '_input_hash', '_task_hash', 'tokens', 'spans', '_session_id', '_view_id', 'answer'],
        num_rows: 1235
    })
    eval: Dataset({
        features: ['text', 'meta', '_input_hash', '_task_hash', 'tokens', 'spans', '_session_id', '_view_id', 'answer'],
        num_rows: 500
    })
})

API

Trains a Huggingface model for NER directly on your annotated datasets.


prodigy
hf.train.ner
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation
`out_dir`	positional	Folder to store trained model and checkpoints.
`--model-name`, `-mn`	option	Pick the model you’d like to use as a starting point for training.	“distilbert-base-uncased”
`--batch-size`, `-bs`	option	Batch size for training.	8
`--eval-split`, `-es`	option	If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
`--learning-rate`, `-lr`	option	Learning rate.	2e-5
`--verbose`, `-v`	flag	Output all the logs/warnings from Huggingface libraries.	False

Annotate NER data with a transformer model in the loop.


prodigy
hf.correct.ner
dataset
--model-name
source
--lang

Argument	Type	Description	Default
`dataset`	positional	Dataset to save annotations into
`out_dir`	positional	Path to transformer model. Can also point to a model on Hugging Face Hub.
`source`	positional	Source file to annotate
`--lang`, `-l`	option	Language to assume for the spaCy tokeniser	“en”

Trains a Hugging Face model for text classification directly on your annotated datasets.

Warning: Exclusive categories

Note that hf.correct.textcat and hf.train.textcat assume mutually exclusive textcat labels. You cannot train a pipeline where multiple classes can be detected for a single example.

While you can use this recipe to train models on binary datasets, you should be aware that the resulting model will only be capable of emitting “accept” or “reject” as labels.

If you use textcat.manual to annotate data for a transformer pipeline that you should also be mindful when you’re doing binary annotations that you may loose important information. So in general we recommend doing this:


prodigy
textcat.manual
dataset
examples.jsonl
--labels pos,not-pos
--exclusive

By using --labels pos,not-pos instead of --labels pos you are making sure that both classes are available to the pipeline after it is training.


prodigy
hf.train.textcat
datasets
out_dir
--model-name
--batch-size
--eval-split
--learning-rate
--verbose

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets for the named entity recognizer. Use the `eval:` prefix for evaluation
`out_dir`	positional	Folder to store trained model and checkpoints.
`--model-name`, `-mn`	option	The name of the model to be used as a starting point for training.	“distilbert-base-uncased”
`--batch-size`, `-bs`	option	Batch size for training.	8
`--eval-split`, `-es`	option	If no evaluation sets are provided for a component, this setting can be used to split off a a percentage of the training examples for evaluation. If no evaluation splits are given the train set performance will be reported.
`--learning-rate`, `-lr`	option	Learning rate.	2e-5
`--verbose`, `-v`	flag	Output all the logs/warnings from Huggingface libraries.	False

Annotate data for text classification with a transformer model in the loop.


prodigy
hf.correct.textcat
dataset
--model-name
source

Argument	Type	Description
`dataset`	positional	Dataset to save annotations into
`out_dir`	positional	Path to transformer model. Can also point to a model on Hugging Face Hub.
`source`	positional	Source file to annotate

Upload your annotations to Hugging Face Hub.

This recipe assumes that you’ve authenticated beforehand via the following command:

Logging into Hugging Face from the CLIhuggingface-cli login

You can use the same command multiple times to upload the most recent version of your data to the hub.


prodigy
datasets
repo_id
--keep-annotator-ids
--patch_values
--private

Argument	Type	Description	Default
`datasets`	positional	One or more (comma-separated) datasets to upload. Use the `name:` prefix to add keys to the dataset.
`repo_id`	positional	Name of the repo to upload to. Should be formatted as `/`.
`--keep-annotator-ids`, `-k`	flag	Don’t anonymize the annotators.	False
`--patch_values`, `-nv`	flag	If keys are missing between datasets, patch them with `None` values.	False
`--private`, `-p`	flag	Upload dataset as a private repository.	False

Prodigy-PDF

This plugin contains recipes that help you annotate pdf files by turning them into images first. This way, they may be annotating using the familiar image_manual interface. It also contains recipes for OCR. If you’re interested in a quick over, you may appreciate this Youtube explainer.

To use these recipes, you’ll first need to install the plugin.

Install prodigy-pdfpip install "prodigy-pdf @ git+https://github.com/explosion/prodigy-pdf"

Then once it is installed, you can start annotating PDFs as images via pdf.image.manual.


prodigy
pdf.image.manual
papers
path/pdfs
--labels figure,footnote,paragraph

Prodigy

This live demo requires JavaScript to be enabled.

If you like, you can re-use the pdf annotations with the pdf.ocr.correct recipe to apply OCR to the annotated segments. This recipe uses pytessaract under the hood to give suggestions that you can correct.


prodigy
pdf.ocr.correct
ocr_images
papers
path/pdfs
--labels paragraph
--fold-dashes

Prodigy

This live demo requires JavaScript to be enabled.

API

Add annotations to a PDF by first converting it to an image.

In order for this recipe to work, you may need to install system dependencies for tesseract. These can usually be installed directly via:

# for mac
brew install tesseract

# for ubuntu
sudo apt install tesseract-ocr


prodigy
pdf.image.manual
dataset
pdf_folder
--labels
--remove-base64

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`pdf_folder`	Path	Folder that contains your pdf files
`labels`, `-l`	str	Comma delimted labels to attach
`remove_base64`, `-R`	bool	Don’t save the base64 images of the pdfs	False

Applies OCR to annotated segments from pdf.image.manual and gives a textbox for corrections.


prodigy
pdf.ocr.correct
dataset
source
--labels
--scale
--fold-dashes
--remove-base64
--autofocus

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Source with PDF Annotations
`--labels`, `-l`	str	Labels to consider
`--scale`, `-s`	int	Zoom scale. Increase above 3 to upscale the image for OCR.	3
`--remove-base64`, `-R`	bool	Don’t save the base64 images of the pdfs	False
`--fold-dashes`, `-f`	bool	Removes dashes at the end of a textline and folds them with the next term.	False
`--autofocus`, `-af`	bool	Autofocus on the transcript UI	False

🤫 Prodigy-Whisper

OpenAI released an open model for audio annotation called Whisper. It’s a model that can be downloaded locally, it has support for multiple languages and you’re even able to pick from a selection of models. The model isn’t perfect, but when you’re transcribing text, it can really help to have such a model provide a starting point. The goal of this plugin is to help you get started with this right away.

To use this plugin, you’ll need to install it first.

Install prodigy-whisperpip install "prodigy-whisper @ git+https://github.com/explosion/prodigy-whisper"

In order to use the plugin you’ll also need to have ffmpeg installed. Most package managers should have these available so you should be able to use one of the following commands.

Install ffmpeg# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on Arch Linux
sudo pacman -S ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg

# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg

Once the plugin is installed you can use the whisper.audio.transcribe recipe. It is very similar to audio.transcribe recipe that Prodigy provides, but this recipe uses Whisper to provide an initial transcription.

Example

prodigy
whisper.audio.transcribe
transcripts
./recordings
--model base
prodigy whisper.audio.transcribe transcripts ./recordings --model base

This live demo requires JavaScript to be enabled.

In the base form you can already see that Whisper does a pretty good job at transcription. But it may be easier to correct short pieces of audio instead of a long one. This is where Wishper can help out as well. It is able to segment a long audio clip into shorter segments and each of these segments can then be annotated in Prodigy.

To use this feature, you can add the --segment flag to the recipe call.

Example

prodigy
whisper.audio.transcribe
transcripts
./recordings
--model base
--segment
prodigy whisper.audio.transcribe transcripts ./recordings --model base --segment

Now, you can go through the segments one by one and each segment will have metadata attached so that you can link it back to the timestamps in the original file. This is what the first segment would look like.

This live demo requires JavaScript to be enabled.

This is what the second segment would look like.

This live demo requires JavaScript to be enabled.

API

Manually transcribe audio files by typing the transcript into a text field with the help of Whisper. The API is built on top of audio.transcribe and will allow you to configure everything that the original recipe can. The only input addition is that this recipe also allows you to select a Whisper model. The recipe uses the "base" model by default, but you should be able to pick any of the models shown on here.


prodigy
whisper.audio.transcribe
dataset
source
--loader
--autoplay
--keep-base64
--fetch-media
--playpause-key
--text-rows
--text-rows
--exclude

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`--model`, `-m`	str	Name of OpenAI Whisper model to use.	`base`
`--loader`, `-lo`	str	Optional ID of source loader, e.g. `audio` or `video`.	`audio`
`--autoplay`, `-A`	bool	Autoplay the audio when a new task loads.	`False`
`--keep-base64`, `-B`	bool	If `audio` loader is used: don’t remove the base64-encoded audio data from the task before it’s saved to the database.	`False`
`--fetch-media`, `-FM`	bool	Convert local paths and URLs to base64. Can be enabled if you’re annotating a JSONL file with paths or for re-annotating an existing dataset.	`False`
`--playpause-key`, `-pk`	str	Alternative keyboard shortcuts to toggle play/pause so it doesn’t conflict with text input field.	`"command+enter, option+enter, ctrl+enter"`
`--text-rows`, `-tr`	int	Height of the text input field, in rows.	`6`
`--field-id`, `-fi`	str	Add the transcript text to the data using this key, e.g. `"transcript": "Text here"`.	`"transcript"`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`

Prodigy-Segment

Sometimes you’re interested in selecting pixels from an image, as opposed to merely selecting a bounding box. Selecting the right pixels can be tedious work so you may want to use a model in the loop to help you. A good choice for such a model is Meta’s Segment Anything model, which we’ve integrated into Prodigy via the prodigy-segment plugin.

This model is able to take bounding box annotations from Prodigy to construct a pixel segmentation map under the hood. From the UI, that might look like this:

Using Prodigy-Segment

For a quick overview of the features, you may also enjoy this Youtube tutorial.

Before you’ll be able to use recipes, you’ll want to make sure you’ve downloaded the appropriate model checkpoint beforehand. You can check the available models here but this tutorial will assume the “default” model-type. The weights for this model can be downloaded via:

Download the weights for the `default` model-typewget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

Once the model is downloaded you can get started by running the segment.image.manual recipe.


prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--label cat,dog

When you run this model, you may notice that it’s fairly slow. This isn’t a big suprise given the size of the model but it can be a serious burden, especially if your machine does not have a GPU. For a better experience, you may want to pre-compute the features ahead of annotation time and cache those results to disk. It may take a while to precompute all the images, but once they are done the annotation experience feels seamless and realtime again.

To precompute a cache, you can use the segment.fill-cache recipe.


prodigy
segment.fill-cache
images
sam_vit_h_4b8939.pth
--model-type default
--cache segment-anything-cache

This will store all the features in a folder (configurable via the --cache flag) which the segment.image.manual recipe can immediately pick up.


prodigy
segment.image.manual
segment-cat-dog
images
sam_vit_h_4b8939.pth
--model-type default
--label cat,dog
--cache segment-anything-cache

The pixel maps, once annotated, are stored under the spans key in your examples. You can explore these maps one by one in a Jupyter notebook using the script shown below.

Script to loop over all annotated examplesimport base64
from io import BytesIO
from PIL import Image

from prodigy.components.db import connect

db = connect()
examples = db.get_dataset_examples("<dataset-name>")

def mask_to_pil(mask_str):
    indicator = "base64,"
    mask_str = mask_str[mask_str.find(indicator) + len(indicator):]
    bytes = BytesIO(base64.b64decode(mask_str))
    return Image.open(bytes)

# Loop over all the examples and display them.
for ex in examples:
    print(ex['path'])
    for span in ex.get("spans", []):
        # Use builtin `display` to view pixel map
        display(mask_to_pil(span['mask']))

From here you can re-use the Pillow library to either store these pixel maps into the required format for your pipeline or you can stream them directly into a learning algorithm from Python.

API

Manually transcribe pixels in images with Meta’s segment anything model under the hood.


prodigy
segment.image.manual
dataset
source
checkpoint
--label
--loader
--exclude
--width
--darken
--no-fetch
--remove-base64
--model-type
--cache

Argument	Type	Description	Default
`dataset`	str	Prodigy dataset to save annotations to.
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`checkpoint`	Path	Path to a model checkpoint.
`--label`, `-l`	str / `Path`	One or more labels to annotate. Supports a comma-separated list or a path to a file with one label per line.
`--loader`, `-lo`	str	Optional ID of source loader.	`images`
`--exclude`, `-e`	str	Comma-separated list of dataset IDs containing annotations to exclude.	`None`
`--width`, `-w`	int	Width of card and maximum image width in pixels.	`675`
`--darken`, `-D`	bool	Darken image to make boxes stand out more.	`False`
`--no-fetch`, `-NF`	bool	Don’t fetch images as base64. Ideally requires a JSONL file as input, with `--loader jsonl` set and all images available as URLs.	`False`
`--remove-base64`, `-R`	bool	Remove base64-encoded image data before storing example in the database and only keep the reference to the local file path. Caution: If enabled, make sure to keep original files!	`False`
`--model-type`, `-mt`	str	Type of model to use.	`default`
`--cache`, `-c`	`Path`	Path to feature cache to speed up inference.	`segment-anything-cache`

Prepares a local disk cache to speed up inference for segment.image.manual. This can cause a huge speedup if you’re running on a non-GPU device.


prodigy
segment.fill-cache
source
--loader
checkpoint
--model-type
--cache

Argument	Type	Description	Default
`source`	str	Path to a directory containing audio files or pre-formatted JSONL file if `--loader jsonl` is set.
`checkpoint`	Path	Path to a model checkpoint.
`--loader`, `-lo`	str	Optional ID of source loader.	`images`
`--cache`, `-c`	`Path`	Path to feature cache to speed up inference.	`segment-anything-cache`

Prodigy-ANN

Sometimes you may want to query your examples to find a relevant subset for annotation. A modern method for doing this is to use numeric vectors to represent text and you can use approximate neighest neighbor (ANN) techniques to fetch relevant examples. The goal is to spend more time looking at examples that matter, like examples similar to items that the model gets wrong. Curating these examples first might be a pragmatic method to steer the model in the right direction.

ann — This is general approach for the ANN recipes.

If you’re interested to see a quick demo for Prodigy-ANN applied to a text dataset, you may appreciate this Prodigy short on Youtube.

To use this plugin, you’ll need to install it first.

Install prodigy-annpip install "prodigy-ann @ git+https://github.com/explosion/prodigy-ann"

As a first step for this approach you’ll first need to generate an index with vector representations of your text. To encode the text this library uses sentence-transformers and it uses hnswlib as an index for these vectors.

To index your documents, you can run the ann.text.index recipe.


prodigy
ann.text.index
examples.jsonl
examples.index

indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use ann.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.


prodigy
ann.text.fetch
examples.jsonl
examples.index
subset.jsonl
--query "this is an outrage!"

More interfaces

As a convenience this plugin also provides the textcat.ann.manual, ner.ann.manual and spans.ann.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.ann.manual, ner.ann.manual or spans.ann.manual recipes.


prodigy
textcat.ann.manual
examples.jsonl
examples.index
--query "new academic dataset"
--allow-reset

Here’s an example of what the experience might look like from the UI.

Retreiving Images

You can use these embedding retreival techniques for images too. Models like CLIP allow you to embed images and text in the same space, which means that you can query the images by using text.

The approach for images is very similar to the approach for text too. To get started you’ll first want to run an indexing recipe over a folder of images via the ann.image.index recipe.


prodigy
ann.image.index
path/to/image_folder
image.index

indexing: 100%|███████████████████████████| 210/210 [01:49<00:00]

Once the index is built, you can query it. You can choose to query it to prepare a .jsonl file to re-use later via the ann.image.fetch recipe.


prodigy
ann.image.fetch
path/to/image_folder
examples.index
out.jsonl
--query "laptops"
--remove-base64
--n 100

Alternatively the plugin also provides a wrapper around the familiar image.manual recipe. This will retreive the images before passing it on to the image_manual interface. This interface also allows you to reset the stream via the --allow-reset flag.


prodigy
image.ann.manual
annotated_laptops
path/to/image_folder
examples.index
--query "laptops"
--remove-base64
--n 100
--labels laptop,phone
--allow-reset

Here’s an example of what the experience might look like from the UI.

API

Builds an HSNWLIB index on example text data.


prodigy
ann.text.index
source
examples.index

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path of trained index

Fetch a relevant subset using a HNSWlib index.


prodigy
ann.text.fetch
source
index_path
out_path
--query
--n

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path of trained index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index
`--n`, `-n`	str	Number of results to return from index	200

Builds an HSNWLIB index on example image data.


prodigy
ann.image.index
source
examples.index

Argument	Type	Description	Default
`source`	Path	Path to source folder of images to index.
`index_path`	Path	Path of trained index

Fetch a relevant subset of images using a HNSWlib index.


prodigy
ann.image.fetch
source
index_path
out_path
--query
--query
--remove-base64

Argument	Type	Description	Default
`source`	Path	Path to source folder of images for index.
`index_path`	Path	Path of trained index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index
`-n`	int	Number of items to retreive	200
`remove-base64`, `-R`	bool	Don’t save the base64 images on disk	False

Prodigy-Lunr

Instead of using semantic vectors with approximate nearest neighbors to find relevant subsets you can also resort to the “regular” search techniques. To accomodate these techniques we’ve added support for recipes that use lunr. These recipes are very similar to their ann.* counterparts but will rely on string matching techniques to retreive relevant examples.

To use this plugin, you’ll need to install it first.

Install prodigy-lunrpip install "prodigy-lunr @ git+https://github.com/explosion/prodigy-lunr"

To index your documents, you can run the ann.text.index recipe. This will generate an index and serialize it to disk by writing it into a gzipped json file.


prodigy
lunr.text.index
examples.jsonl
index.gz.json

indexing: 100%|███████████████████████████| 2210/2210 [00:09<00:00, 243.64it/s]

Once it is indexed you can use text queries find and curate interesting subsets. A general method to prepare these subsets is to use lunr.text.fetch. This will fetch a subset of vectors that are close in vector space and save the associated examples on disk. From there you can use any Prodigy recipe you like.


prodigy
lunr.text.fetch
examples.jsonl
index.gz.json
subset.jsonl
--query "outrage better service unhappy"

More interfaces

As a convenience this plugin also provides the textcat.lunr.manual, ner.lunr.manual and spans.lunr.manual so that you may query and annotate directly. These recipes have the same arguments as their native Prodigy textcat.manual, ner.manual and spans.manual counterparts but add a --query parameter so that you may pass your query.

Interactive Queries

Sometimes you may want to update the stream while you’re annotating. You can do that without restarting the server by using the --allow-reset flag when you’re starting the textcat.lunr.manual, ner.lunr.manual or spans.lunr.manual recipes.


prodigy
textcat.lunr.manual
examples.jsonl
index.gz.json
--query "outrage better service unhappy"
--allow-reset

API

Builds an HSNWLIB index on example text data.


prodigy
lunr.text.index
source
examples.index

Argument	Type	Description	Default
`source`	Path	Path to source to index.
`index_path`	Path	Path to stored lunr index

Fetch a relevant subset using a HNSWlib index.


prodigy
lunr.text.fetch
source
index_path
out_path
--query

Argument	Type	Description
`source`	Path	Path to source to index.
`index_path`	Path	Path to stored lunr index
`out_path`	Path	Path to stored subset of interest
`--query`, `-q`	str	Query to encode and pass to index

Sense2vec

sense2vec (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors. This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. There are also more details in this blogpost.

To see a demo on how to use this tool with Prodigy, you may enjoy this Youtube video where we use it to detect video games in text.

To use sense2vec, you’ll first need to install it.

python -m pip install sense2vec

To use the pre-trained vectors in Prodigy you’ll need to download the archive(s) and extract them. Large files have been split into multi-part downloads. All the available versions can be found below.

Vectors	Size	Description	Download Link (zipped)
`s2v_reddit_2019_lg`	4 GB	Reddit comments 2019 (01-07)	part 1, part 2, part 3
`s2v_reddit_2015_md`	573 MB	Reddit comments 2015	part 1

To merge the multi-part archives, you can run the following:

cat s2v_reddit_2019_lg.tar.gz.* > s2v_reddit_2019_lg.tar.gz

Once downloaded (and merged) you should be able to unarchive via:

tar -xvf s2v_reddit_lg.tar.gz

Now that the archive is extracted you can point the sense2vec.teach recipe to it. This will allow Prodigy to suggest similar terms based on the most similar phrases from sense2vec, and the suggestions will be adjusted as you annotate and accept similar phrases. For each seed term, the best matching sense according to the sense2vec vectors will be used.


prodigy
sense2vec.teach
video_game_yesno
/path/to/s2v_reddit_2019_lg
--seeds "mass effect,knights of the old republic,halo 3"
--resume

Suggestions from Sense2Vec

This live demo requires JavaScript to be enabled.

After curating the generated examples you can choose to export the collected phrases as pattern files which can be used with spaCy’s EntityRuler or recipes like ner.manual by using the sense2vec.to-patterns recipe.


prodigy
sense2vec.to-patterns
video_game_yesno
blank:en
VIDEO_GAME
patterns.jsonl

This will generate a patterns.jsonl file locally that has contents that may look like:

{"label": "VIDEO_GAME", "pattern": [{"LOWER": "mass"}, {"LOWER": "effect"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "knights"}, {"LOWER": "of"}, {"LOWER": "the"}, {"LOWER": "old"}, {"LOWER": "republic"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "halo"}, {"LOWER": "3"}]}
{"label": "VIDEO_GAME", "pattern": [{"LOWER": "jade"}, {"LOWER": "empire"}]}

More recipes

Sense2vec also has the sense2vec.eval, sense2vec.eval-most-similar and sense2vec.eval-ab recipes. These may be interesting if you’re interested in evaluating a sense2vec model. For more information on those, you can check the README on the Github repository.

Bootstrap a terminology list using sense2vec.


prodigy
sense2vec.teach
dataset
vectors_path
--seeds
--threshold
--n-similar
--batch-size
--case-sensitive
--resume

Argument	Type	Description	Default
`dataset`	positional	Dataset to save annotations to.
`vectors_path`	positional	Path to pretrained sense2vec vectors.
`--seeds`, `-s`	option	One or more comma-separated seed phrases.
`--threshold`, `-t`	option	Similarity threshold.	0.85
`--n-similar`, `-n`	option	Number of similar items to get at once.	100
`--batch-size`, `-b`	option	Batch size for submitting annotations.	5
`--case-sensitive`, `-CS`	option	Show the same terms with different casing.	False
`--resume`, `-R`	flag	Resume from an existing phrases dataset.	False

Convert a dataset of phrases collected with sense2vec.teach to token-based match patterns.


prodigy
sense2vec.to-patterns
dataset
spacy_model
label
--output-file
--case-sensitive
--dry

Argument	Type	Description	Default
`dataset`	positional	Phrase dataset to convert.
`spacy_model`	positional	spaCy model for tokenization.
`label`	positional	Label to apply to all patterns.
`--output-file`, `-o`	option	Optional output file. Defaults to stdout.
`--case-sensitive`, `-CS`	flag	Make patterns case-sensitive.	False
`--dry`, `-D`	flag	Perform a dry run and don’t output anything.	False

🔎 Prodigy-evaluate

This Prodigy plug-in allows you to evaluate your spaCy pipeline overall or on a per-example basis. To use these recipes, you’ll first need to install the plugin.

Install prodigy-evaluatepip install "prodigy-evaluate @ git+https://github.com/explosion/prodigy-evaluate"

Once installed, you can make use of the two main recipes in this plugin: evaluate.evaluate and evaluate.evaluate-example.

This recipe allows you to evaluate a spaCy pipeline on one or more datasets for different components. Per-component datasets can be passed in the same way as in the case of train recipe only all datasets will be considered evaluation sets.

The --label-stats flag lets you investigate per-label scores like precision, recall and F1 scores for NER and textcat components. The --confusion-matrix flag will output a confusion matrix for the NER and textcat components. If you’d like to customize how the confusion matrix is rendered, you can save the an array of the confusion matrix by passing an output path via the --cf-path argument and use it with your favourite data visualization library. Please note that a separate inference is run to obtain the confusion matrix and as results are not deterministic, there may be slight variations in evaluation and confusion matrix results.

Example evaluate.evaluate output

prodigy
evaluate.evaluate
my_custom_ner_model
--ner ner_dataset
--label-stats
ℹ Using CPU

================================== Results ==================================


TOK     100.00
NER P   92.80 
NER R   99.58 
NER F   96.07 
SPEED   26868


=============================== NER (per type) ===============================


                 P        R       F
SKILL        92.53    99.55   95.91
EXPERIENCE   96.88   100.00   98.41

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer.	`None`
`--textcat`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories).	`None`
`--textcat-multilabel`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories).	`None`
`--senter`	str	One or more (comma-separated) datasets for the sentence recognizer.	`None`
`--parser`	str	One or more (comma-separated) datasets for the dependency parser.	`None`
`--tagger`	str	One or more (comma-separated) datasets for the part-of-speech tagger.	`None`
`--spancat`	str	One or more (comma-separated) datasets for the span categorizer.	`None`
`--coref`	str	One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental.	`None`
`--label-stats`, `-LS`	bool	Compute per-label statistics for NER and textcat components.	`False`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--confusion-matrix`, `-CF`	bool	Compute confusion matrix for NER, textcat and textcat-multilabel components.	`False`
`--cf-path`, `-CP`	str	Local path to save the confusion matrix to. Available for NER, textcat and textcat-multilabel components.	`None`
`--spans-key`	str	Key to use for spans in the evaluation data.	`sc`

Evaluate a spaCy pipeline on one or more datasets for different components on a per-example basis. Datasets are provided in the same per-component format as the prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2. This command will run an evaluation on each example individually and then sort by the desired --metric argument.

This is helpful for debugging and for understanding the hardest or easiest examples for your model. The example below shows how to evaluate a model on a dataset on a per-example basis and sort by the lowest NER F1 score.

If you would like to save the top examples sorted by your metric, you can use the --output-path argument to save the examples in .jsonl format to file. If you’re evaluating NER, spancat or textcat pipeline, this .jsonl file could then be used as input to Prodigy correct ( ner.correct, spans.correct, textcat.correct) or model-annotate ( ner.model-annotate, spans.model-annotate, textcat.model-annotate) workflows to quickly inspect your model’s predictions on hardest examples.

Example evaluate.evaluate-example output

prodigy
evaluate.evaluate-example
my_custom_ner_model
--ner ner_dataset
--metric ents_f
--n-results 3
ℹ Using CPU

============================== Scored Examples ==============================


Example                     ents_f
-----------------           ------
I live in london.           0.0
My name is Freya.           0.0
Where is Antonia?           0.0

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer.	`None`
`--textcat`	str	One or more (comma-separated) datasets for the text classifier (exclusive categories).	`None`
`--textcat-multilabel`	str	One or more (comma-separated) datasets for the text classifier (non-exclusive categories).	`None`
`--senter`	str	One or more (comma-separated) datasets for the sentence recognizer.	`None`
`--parser`	str	One or more (comma-separated) datasets for the dependency parser.	`None`
`--tagger`	str	One or more (comma-separated) datasets for the part-of-speech tagger.	`None`
`--spancat`	str	One or more (comma-separated) datasets for the span categorizer.	`None`
`--coref`	str	One or more (comma-separated) datasets for the coreference model. Requires spacy-experimental.	`None`
`--metric`	str	The metric to sort the examples by. The following metrics are supported: `token_acc`, `tag_acc`, `pos_acc`, `morph_acc`, `lemma_acc`, `dep_uas`, `dep_las`, `ents_p`, `ents_r`, `ents_f`, `cats_score`, `sents_p`, `sents_r`, `sents_f`, `spans_sc_p`, `spans_sc_r`, `spans_sc_f`, `speed`. Please choose a metric most appropriate to your model.	`None`
`--n-results`, `-NR`	int	Number of top examples to display.	`10`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--output-path`, `-OP`	str	Path to a `jsonl` file to save the scored examples to.	`None`

Evaluate a spaCy NER component using full named-entity evaluation metrics based on SemEval ‘13. Datasets are provided in the same per-component format as the prodigy evaluate command e.g. --ner my_eval_dataset_1,my_eval_dataset_2.

This command leverages the nervaluate Python library to “go beyond a simple token/tag based schema, and consider different scenarios based on weather all the tokens that belong to a named entity were classified or not, and also whether the correct entity type was assigned.”

This is helpful if you are interested in partial matches as part of your NER evaluation use-case. If you are interested in per-label evaluation metrics, you can pass the --per-label flag to the command.


prodigy
evaluate.nervaluate
my_custom_ner_model
--ner ner_dataset
--per-label

Argument	Type	Description	Default
`model`	str	Name of spaCy pipeline to evaluate. Must have a trained NER model to evaluate.
`--ner`	str	One or more (comma-separated) datasets for the named entity recognizer. Use the eval: prefix for evaluation sets.	`None`
`--gpu_id`	int	ID of the GPU to use.	`-1`
`--verbose`	bool	Print detailed information about the evaluation.	`False`
`--per-label`	bool	print per-label NER metrics to the terminal.	`False`

Usage

Prodigy Plugins

Training Huggingface models

API

Prodigy

Prodigy

API

Example

Example

API

API

API

More interfaces

API

Suggestions from Sense2Vec

More recipes

Example evaluate.evaluate output

Example evaluate.evaluate-example output