Components and Functions

Top-level functions

Prodigy provides the following top level utilities for writing your own scripts and recipes. To use them, import the prodigy module at the top of your file.

prodigy.serve function

Serve a Prodigy recipe and start the web app from Python. Does the same as the prodigy command on the command line.

prodigy.serve("ner.manual ner_news_headlines en_core_web_sm ./news_headlines.jsonl --label PERSON,ORG", port=9000)
ArgumentTypeDescription
commandstrThe full recipe command without “prodigy”. See the recipe documentation for examples.
*args-Deprecated: Recipe-specific arguments, in the same order as the recipe function arguments. Only available for backwards-compatibility.
**config-Additional config parameters to overwrite the project-specific, global and recipe config.

prodigy.recipe decorator

Decorator that transforms a recipe function into a Prodigy recipe. The decorated function needs to return a dictionary of recipe components or a Controller. The decorator’s first argument is the recipe name, followed by a variable number of argument annotations, mapping to the arguments of the decorated function. This lets you execute the recipe with Prodigy.

More details & examples

@prodigy.recipe(
    "example",
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("Source data to load", "positional", None, str),
    view_id=("Annotation interface to use", "option", "v", str)
)
def example(dataset, source, view_id="text"):
    stream = JSONL(source)
    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream
    }
ArgumentTypeDescription
namestrUnique recipe name. Used to register the recipe and call it from the command line or via prodigy.serve.
**annotations-Argument annotations in Plac style, i.e. argument name mapped to tuple of description, style, shortcut and type. See here for details.
RETURNScallableRecipe function.

prodigy.get_recipe function

Get a recipe for a given name.

ArgumentTypeDescription
namestrThe recipe name.
pathstr / PathOptional path to recipe file.
RETURNSThe recipe function.

prodigy.set_recipe function

Register a recipe function with a name. Also adds aliases with - and _ swapped. When you use the @recipe decorator, the recipe will be set automatically.

ArgumentTypeDescription
namestrThe recipe name.
funccallableThe recipe function.

prodigy.get_config function

Read and combine the user configuration from the available prodigy.json config files. Helpful in recipes to read off database settings, API keys or entirely custom config parameters.

config = prodigy.get_config()
theme = config["theme"]
ArgumentTypeDescription
RETURNSdictThe user configuration.

prodigy.get_stream function

Get an iterable stream of tasks. This function is also used in recipes that allow streaming data from a source standard input. If a loader ID is set, Prodigy will look for a matching loader, and try to load the source. If the source is a file path or Path-like object, Prodigy will try to guess the loader from the file extension (defaults to "jsonl"). If the source is set to "-", Prodigy will read from standard input, letting you pipe data forward on the command line.

stream = prodigy.get_stream("/tmp/data.jsonl")
stream = prodigy.get_stream("/tmp/myfile.tmp", loader="txt")
stream = prodigy.get_stream("/tmp/data.json", input_key="text", skip_invalid=False)
ArgumentTypeDescription
sourcestrA text source, e.g. a file path or an API query. Defaults to sys.stdin if not set.
apistrDeprecated: ID of an API to use.
loaderstrID of a loader, e.g. 'csv'.
rehashboolRehash the stream and assign new input and task IDs.
dedupboolDeduplicate the stream and filter out duplicate input tasks.
input_keystrOptional input key relevant to this task, to filter out examples with invalid keys. For example, 'text' for NER and text classification projects and 'image' for image projects.
skip_invalidboolIf an input key is set, skip invalid tasks. Defaults to True. If set to False, a ValueError will be raised.
RETURNSiterableAn iterable stream of tasks produced by the loader.

If the source is an iterable stream itself – e.g. a generator or a list – get_stream will simply return the stream. This is useful if you’re calling an existing recipe function from Python – for example, in your custom recipe – and want to use an already loaded stream.

prodigy.get_loader function

Get a loader ID from an ID or guess a loader ID based on a file’s extension.

loader = prodigy.get_loader("jsonl")  # JSONL
loader = prodigy.get_loader(file_path="/tmp/data.json")  # JSON
ArgumentTypeDescription
loader_idstrID of a loader, e.g. 'jsonl' or 'csv'.
file_pathstr / PathOptional file path to allow guessing loader from file extension.
RETURNScallableA loader.

prodigy.set_hashes function

Set hash IDs on a task based on the task properties. This is usually done by Prodigy automatically as the stream is processed by the controller. However, in some cases, you may want to take care of the hashing yourself, to implement custom filtering.

Input hashes are based on the input data, like the text, image or HTML. Task keys are based on the input hash, plus optional features you’re annotating, like the label or spans. This allows Prodigy to distinguish between two tasks that collect annotations on the same input, but for different features – for example, two different entities in the same text.

from prodigy import set_hashes

stream = (set_hashes(eg) for eg in stream)
stream = (set_hashes(eg, input_keys=("text", "custom_text")) for eg in stream)
ArgumentTypeDescription
taskdictThe annotation task to set hashes on.
input_keystupleDictionary keys to consider for the input hash. Defaults to ("text", "image", "html", "input").
task_keystupleDictionary keys to consider for the task hash. Defaults to ("spans", "label", "options").
ignoretupleDictionary keys (including nested) to ignore when creating hashes. Defaults to ("score", "rank", "model", "source", "_view_id", "_session_id").
overwriteboolOverwrite existing hashes in the task. Defaults to False.
RETURNSdictThe task with hashes set.

prodigy.get_schema function

Get the JSON schema for a given view_id. This is the schema that Prodigy will validate against when you run a recipe. The JSON schemas describe the properties and types needed in order for an interface to render your task. The very first batch of the stream is validated before the server starts. After that, tasks in the stream are validated before they’re sent out to the web app. To disabled validation, set "validate": false in your Prodigy config.

Validation is powered by pydantic, so if you want to implement your own validation of Prodigy tasks and annotations, you can set json=False and receive the pydantic model.

schema = prodigy.get_schema("text")
Example output{
  "title": "TextTask",
  "type": "object",
  "properties": {
    "meta": { "title": "Meta", "default": {}, "type": "object" },
    "_input_hash": { "title": " Input Hash", "type": "integer" },
    "_task_hash": { "title": " Task Hash", "type": "integer" },
    "_view_id": {
      "title": " View Id",
      "enum": ["text", "classification", "ner", "ner_manual", "pos", "pos_manual", "image", "image_manual", "html", "choice", "diff", "compare", "review", "text_input", "blocks"]
    },
    "_session_id": { "title": " Session Id", "type": "string" },
    "text": { "title": "Text", "type": "string" }
  },
  "required": ["_input_hash", "_task_hash", "text"]
}
ArgumentTypeDescription
view_idstrOne of the available annotation interface IDs, e.g. ner.
jsonboolReturn the schema as a JSON schema. Defaults to True. If False, the pydantic model is returned.
RETURNSdictThe expected JSON schema for a task rendered by the interface.

prodigy.log function

Add an entry to Prodigy’s log. For more details, see the docs on debugging and logging.

prodigy.log("RECIPE: Starting recipe custom-recipe", locals())
ArgumentTypeDescription
messagestrThe basic message to display, e.g. “RECIPE: Starting recipe ner.teach”.
details-Optional details to log only in verbose mode.

Preprocessors

Preprocessors convert and modify a stream of examples and their properties, or pre-process documents before annotation. They’re available via prodigy.components.preprocess.

split_sentences function

Use spaCy’s sentence boundary detector to split example text into sentences, convert the existing "spans" and their start and end positions accordingly and yield one example per sentence. Setting a min_length will only split texts longer than a certain number of characters. This lets you use your own logic, while still preventing very long examples from throwing off the beam search algorithm and affecting performance. If no min_length is set, Prodigy will check the config for a 'split_sents_threshold' setting.

ArgumentTypeDescription
nlpspacy.language.LanguageA spaCy nlp object with a sentence boundary detector (a custom implementation or any model that supports dependency parsing).
streamiterableThe stream of examples.
text_keystrTask key containing the text. Defaults to 'text'.
batch_sizeintBatch size to use when processing the examples with nlp.pipe. Defaults to 32.
min_lengthintMinimum character length of text to be split. If None, Prodigy will check the config for a 'split_sents_threshold' setting. If False, all texts are split, if possible. Defaults to False.
YIELDSdictThe individual sentences as annotation examples.
from prodigy.components.preprocess import split_sentences
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "spaCy is a library. It is written in Python."}]
stream = split_sentences(nlp, stream, min_length=30)

split_spans function

Split a stream with multiple spans per example so that there’s one span per task.

ArgumentTypeDescription
streamiterableThe stream of examples.
labelslistOnly create examples for entities of those labels. If None, all entities will be used.
YIELDSdictThe annotation examples.

add_tokens function

Tokenize the incoming text and add a 'tokens' key to each example in the stream. If the example has spans, each span is updated with a "token_start" and "token_end" key. This pre-processor is mostly used in manual NER annotation to allow entity annotation based on token boundaries.

ArgumentTypeDescription
nlpspacy.language.LanguageA spaCy nlp object with a tokenizer.
streamiterableThe stream of examples.
skipboolDon’t raise ValueError for mismatched tokenization and skip example instead. Defaults to False.
YIELDSdictThe annotation examples with added tokens.
from prodigy.components.preprocess import add_tokens
import spacy

nlp = spacy.load("en_core_web_sm")
stream = [{"text": "Hello world"}, {"text": "Another text"}]
stream = add_tokens(nlp, stream, skip=True)

fetch_images function

Replace all image paths and URLs in a stream with base64 data URIs. The skip keyword argument lets you specify whether to skip invalid images that can’t be converted (for example, because the path doesn’t exist, or the URL can’t be fetched). If set to False, Prodigy will raise a ValueError if it encounters invalid images.

ArgumentTypeDescription
streamiterableThe stream of examples.
image_keystrThe task key containing the image. Defaults to 'image'.
skipboolSkip and don’t include tasks with images that can’t be fetched. Defaults to False, which will raise a ValueError.
YIELDSdictThe annotation examples with converted images.
from prodigy.components.preprocess import fetch_images

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_images(stream, skip=True)

Sorters

Sorters are helper functions that wrap a stream of (score, example) tuples, (usually returned by a model), resort it and yield examples in the new order. They’re available via prodigy.components.sorters.

All sorters follow the same API and take two arguments:

ArgumentTypeDescription
streamiterableThe stream to sort.
biasfloatBias towards high or low scoring.
YIELDSdictAnnotation examples.

The following sorters are available and can be imported from prodigy.components.sorters:

SorterDescription
prefer_uncertainResort stream to prefer uncertain scores.
prefer_high_scoresResort the stream to prefer high scores.
prefer_low_scoresResort the stream to prefer low scores.

The prefer_uncertain function also supports an additional keyword argument algorithm, that lets you specify either "probability" or "ema" (exponential moving average).

from prodigy.components.sorters import prefer_uncertain

def score_stream(stream):
    for example in stream:
        score = model.predict(example["text"])
        yield (score, example)

stream = prefer_uncertain(score_stream(stream))

Filters

Filters are helper functions used across recipes that wrap a stream and filter it based on certain conditions. They’re available via prodigy.components.filters.

filter_empty function

Remove examples with a missing, empty or otherwise falsy value from a stream. This filter can also be enabled by specifying an input_key argument on the get_stream helper function.

from prodigy.components.filters import filter_empty
stream = [{"text": "test"}, {"image": "test.jpg"}, {"text": ""}]
stream = filter_empty(stream, key="text")
# [{'text': 'test'}]
ArgumentTypeDescription
streamiterableThe stream of examples.
keystrThe key in the annotation task to check, e.g. 'text'.
skipboolSkip filtered examples. If set to False, a ValueError is raised. Defaults to True.
YIELDSdictFiltered annotation examples.

filter_duplicates function

Filter duplicate examples from a stream. You can choose to filter by task, which includes the input data as well as the added spans, labels etc., or by input data only. This filter can also enabled by setting dedup=True on the get_stream helper function.

from prodigy.components.filters import filter_duplicates

stream = [{"text": "foo", "label": "bar"}, {"text": "foo", "label": "bar"}, {"text": "foo"}]
stream = filter_duplicates(stream, by_input=False, by_task=True)
# [{'text': 'foo', 'label': 'bar'}, {'text': 'foo'}]
stream = filter_duplicates(stream, by_input=True, by_task=True)
# [{'text': 'foo', 'label': 'bar'}]
ArgumentTypeDescription
streamiterableThe stream of examples.
by_inputboolFilter out duplicates of the same input data. Defaults to False.
by_taskboolFilter out duplicates of the same task data. Defaults to True.
YIELDSdictFiltered annotation examples.

filter_inputs function

Filter out tasks based on a list of input hashes, referring to the input data. Useful for filtering out already annotated tasks. To get the task hashes of one or more datasets, you can use db.get_input_hashes(*dataset_names).

from prodigy.components.filters import filter_inputs
stream = [{"_input_hash": 5, "text": "foo"}, {"_input_hash": 9, "text": "bar"}]
stream = filter_inputs(stream, [1, 2, 3, 4, 5])
# [{'_input_hash': 9, 'text': 'bar'}]
ArgumentTypeDescription
streamiterableThe stream of examples.
input_idslistThe input IDs to filter out.
YIELDSdictFiltered annotation examples.

filter_tasks function

Filter out tasks based on a list of task hashes, referring to the input data plus the added spans, label etc. Useful for filtering out already annotated tasks and used by Prodigy’s built-in --exclude logic. To get the task hashes of one or more datasets, you can use db.get_task_hashes(*dataset_names).

from prodigy.components.filters import filter_tasks
stream = [{"_task_hash": 5, "text": "foo"}, {"_task_hash": 9, "text": "bar"}]
stream = filter_tasks(stream, [1, 2, 3, 4, 5])
# [{'_task_hash': 9, 'text': 'bar'}]
ArgumentTypeDescription
streamiterableThe stream of examples.
task_idslistThe task IDs to filter out.
YIELDSdictFiltered annotation examples.

PatternMatcher

The PatternMatcher wraps spaCy’s Matcher and PhraseMatcher and will match both token-based and exact string match pattern on a stream of incoming examples. It’s typically used in recipes like ner.teach or textcat.teach to use match pattern files to add suggestions. The PatternMatcher can be updated with annotations and will score the individual patterns. Patterns that are accepted more often will be scored higher than patterns that are mostly rejected. Combined with a sorter, this lets you focus on the most uncertain or the highest/lowest scoring patterns. The pattern matcher is available via prodigy.models.matcher.

PatternMatcher.__init__ method

Initialize a pattern matcher.

from prodigy.models.matcher import PatternMatcher
import spacy

nlp = spacy.blank("en")
matcher = PatternMatcher(nlp)
ArgumentTypeDescription
nlpspacy.language.LanguageThe spaCy language class to use for the matchers and to process text.
label_spanboolWhether to add a "label" to the matched span that’s highlighted. For example, if you use the matcher for NER, you typically want to add a label to the span but not the whole task.
label_taskboolWhether to add a "label" to the top-level task if a match for that label was found. For example, if you use the matcher for text classification, you typically want to add a label to the whole task.
combine_matchesboolNew: 1.9 Whether to show all matches in one task. If False, the matcher will output one task for each match and duplicate tasks if necessary.
filter_labelslistNew: 1.9 Only add patterns if their labels are part of this list. If None (default), all labels are used. Can be set in recipes to make sure the matcher is only producing matches related to the specified labels, even if the file contains patterns for other labels.
task_hash_keystupleNew: 1.9 Optional key names to consider for setting task hashes to prevent duplicates. For instance, only hashing by "label" would mean that "spans" added by the pattern matcher wouldn’t be considered for the hashes.
prior_correctfloatInitial value of a pattern’s accepted count. Defaults to 2.0. Modifying this value changes how much of an impact a single accepted pattern has on the overall confidence.
prior_incorrectfloatInitial value of a pattern’s rejected count. Defaults to 2.0. Modifying this value changes how much of an impact a single accepted pattern has on the overall confidence.

The PatternMatcher assigns a confidence score to examples based on how many examples matching that pattern have been accepted and rejected. The calculation for this is:

score = (n_accept + prior_correct) / ((n_reject + prior_incorrect) / (n_accept + prior_correct))

Let’s say you’re working on a problem with imbalanced classes and you expect that only about 5% of your examples will be accepted in your data. If an example matches a pattern, there’s definitely a higher chance it will be accepted – but the chances still aren’t that high. Let’s say matching examples have about a 20% chance of being accepted. If the matching score was something like n_accept / (n_accept + n_reject), then if you had a pattern that you’d accepted once and rejected once, the scores would come out that examples matching that pattern had a 50% chance of being accepted. But you know that’s not actually likely – it’s probably not a great pattern that’s a huge indicator of acceptance. It’s just that you haven’t seen many examples of it yet. You have a prior belief about the distribution of positive and negative examples, and you haven’t seen enough evidence from this pattern to really alter your beliefs.

The prior_correct and prior_incorrect settings let you represent how many examples you expect to be accepted, and also how confident you are in that belief. If you want each example of a pattern match to only change your prior probability a little, you can set high absolute values on the priors – for instance, setting prior_correct to 10.0 and prior_incorrect to 90.0 means the first example you see will only change the score by about 1%. If you set them to 1.0 and 9.0, the score would change by about 10% instead.

PatternMatcher.__call__ method

Process a stream of examples and add pattern matches. Will add an entry to the example dict’s "spans" for each match. Each span includes a "pattern" key, mapped to the ID of the pattern that was used to produce this match. This is also later used in PatternMatcher.update.

ArgumentTypeDescription
RETURNSboolWhether patterns have been added.

PatternMatcher.from_disk method

Load patterns from a patterns file.

matcher = PatternMatcher(nlp).from_disk("./patterns.jsonl")
ArgumentTypeDescription
pathstr / PathThe JSONL file to load the patterns from.
RETURNSPatternMatcherThe updated matcher.

PatternMatcher.add_patterns method

Add patterns to the pattern matcher.

patterns = [
    {"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
    {"label": "VEGETABLE", "pattern": "Lamb's lettuce"}
]
matcher.add_patterns(patterns)
ArgumentTypeDescription
patternslistThe patterns. Expects a list of dictionaries with the keys "pattern" and "label". See the patterns format for details.

PatternMatcher.has_label method

Whether patterns for a given label have been added to the PatternMatcher.

matcher.add_patterns({"label": "FRUIT", "pattern": [{"lower": "apple"}]})
assert matcher.has_label("FRUIT")
ArgumentTypeDescription
labelstrThe label to check.
RETURNSboolWhether patterns have been added for the label.

PatternMatcher.update method

Update the pattern matcher from annotation and update its scores. Typically called as part of a recipe’s update callback and with answers received from the web app. Expects the examples to have an "answer" key ("accept", "reject" or "ignore") and will use all "spans" that have a "pattern" key, which is the ID of the pattern assigned by PatterMatcher.__call__.

answers = [
    {
      "text": "Hello Apple",
      "spans": [{"start": 0, "end": 11, "label": "ORG", "pattern": 0}],
      "answer": "reject"
    }
]
matcher.update(answers)
ArgumentTypeDescription
exampleslistThe annotated examples with accept/reject annotations.
RETURNSintAlways 0 (only for compatibility with other annotation models that return a loss).

Utility functions

combine_models function

Combine two models and return a predict and update function. Predictions of both models are combined using the toolz.interleave function. Requires both model objects to have a __call__ and an update() method. This helper function is mostly used to combine annotation models with a PatternMatcher to mix pattern matches and model suggestions. For an example, see the docs on custom text classification models.

from prodigy.util import combine_models
from prodigy.models.matcher import PatternMatcher

class CustomModel(object):
    def __call__(self, stream):
        yield from predict_something(stream)

    def update(self, answers):
        update_something(answers)

predict, update = combine_models(CustomModel, PatternMatcher)
ArgumentTypeDescription
onecallableFirst model. Requires a __call__ and update method.
twocallableSecond model. Requires a __call__ and update method.
batch_sizeintThe batch size to use for predicting the stream.
RETURNStupleA (predict, update) tuple of the respective functions.

b64_uri_to_bytes function

Convert a base64-encoded data URI to bytes. Can be used to convert the inlined base64 "image" value of image tasks to byte strings that can be consumed by image models. See the docs on integrating image models for examples.

from prodigy.util import b64_uri_to_bytes
image = "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEASABIA..."  # and so on
image_bytes = b64_uri_to_bytes(image)
ArgumentTypeDescription
data_uristrThe data URI to convert.
RETURNSbytesThe image data.

split_string function

Split a comma-separated string and strip whitespace. A very simple utility that’s mostly used as a converter function in the recipe argument annotations to convert labels passed in from the command line.

from prodigy.util import split_string
assert split_string("PERSON,ORG,PRODUCT") == ["PERSON", "ORG", "PRODUCT"]
ArgumentTypeDescription
textstrThe text to split.
RETURNSlistThe split text or empty list if text is falsy.

get_labels function

Utility function used in recipe argument annotations to handle command-line arguments that can either take a comma-separated list of labels or a file with one label per line. If the string is a valid file path, the file contents are read in line by line. Otherwise, the string is split on commas.

from prodigy.util import get_labels
assert get_labels("PERSON,ORG,PRODUCT") == ["PERSON", "ORG", "PRODUCT"]
assert get_labels("./labels.txt") == ["SOME", "LABELS", "FROM", "FILE"]
ArgumentTypeDescription
labels_datastrThe value passed in from the command line.
RETURNSlistThe list of labels read from a file or the string, or empty list if labels_data is falsy.

Controller

The controller takes care of putting the individual recipe components together and exposes methods that allow the application to interact with the REST API. This is usually done when you use the @recipe decorator on a function that returns a dictionary of components. However, you can also choose to initialize a Controller yourself and make your recipe return it.

Controller.__init__ method

Initialize the controller.

from prodigy.controller import Controller
controller = Controller(dataset, view_id, stream, update, store,
                        progress, on_load, on_exit, get_session_id,
                        exclude, config)
ArgumentTypeDescription
datasetstrThe ID of the current project.
view_idstrThe annotation interface to use.
streamiterableThe stream of annotation tasks.
updatecallableThe update function called when annotated tasks are received.
dbcallableThe database ID, component or custom storage function.
progresscallableThe progress function that computes the annotation progress.
on_loadcallableThe on load function that gets called when Prodigy is started.
on_exitcallableThe on exit function that gets called when the user exits Prodigy.
get_session_idcallableFunction that returns a custom session ID. If not set, a timestamp is used.
excludelistList of dataset IDs whose annotations to exclude.
configdictRecipe-specific configuration.
RETURNSControllerThe recipe controller.

All arguments of the controller are also accessible as attributes, for example controller.store. In addition, the controller exposes the following attributes:

ArgumentTypeDescription
homePathPath to Prodigy home directory.
session_idstrID of the current session, generated from a timestamp.
batch_sizeintThe number of tasks to return at once. Taken from config and defaults to 10.
queuegeneratorThe batched-up stream of annotation tasks.
total_annotatedintNumber of tasks annotated in the current project.
session_annotatedintNumber of tasks annotated in the current session.

Controller.get_questions method

Get a batch of annotation tasks from the queue.

next_batch = controller.get_questions()
ArgumentTypeDescription
RETURNSlistBatch of annotation tasks.

Controller.receive_answers method

Receive a batch of annotated tasks. If available, stores the tasks in the database and calls the update function.

tasks = [
    {"_input_hash": 0, "_task_hash": 0, "text": "x", "answer": "accept"},
    {"_input_hash": 1, "_task_hash": 1, "text": "y", "answer": "reject"}
]
controller.receive_answers(tasks)
ArgumentTypeDescription
answerslistAnnotated tasks.

Controller.save method

Saves the current project and progress when the user exits the application. If available, calls the store’s save method and the on_exit function.

Controller.progress property

Get the current progress. If available, calls the progress function. Otherwise, it checks whether the stream has a length and returns the quotient of the session annotations and stream length. Otherwise, it returns None. A progress of None is visualized with an infinity symbol in the web application.

progress = controller.progress
ArgumentTypeDescription
RETURNSfloat or NoneThe current annotation progress.