Custom Recipes

A Prodigy recipe is a Python function that can be run via the command line. Prodigy comes with lots of useful recipes, and it’s very easy to write your own. All you have to do is wrap the @prodigy.recipe decorator around your function, which should return a dictionary of components, specifying the stream of examples, and optionally the web view, database, progress calculator, and callbacks on update, load and exit.

Custom recipes let you integrate machine learning models using any framework of your choice, load in data from different sources, implement your own storage solution or add other hooks and features. No matter how complex your pipeline is – if you can call it from a Python function, you can use it in Prodigy.

Writing a custom recipe

A recipe is a simple Python function that returns a dictionary of its components. The arguments of the recipe function will become available from the command line and let you pass in parameters like the dataset ID, the text source and other settings. Recipes can receive a name and a variable number of argument annotations, following the radicli syntax (see the docs on recipe arguments for more details).

recipe.py
import prodigy

@prodigy.recipe(
    "my-custom-recipe",
    dataset=Arg(help="Dataset to save answers to."),
    view_id=Arg("--view-id", "-v", help="Annotation interface")
)
def my_custom_recipe(dataset:str, view_id:str = "text"):
    # Load your own streams from anywhere you want
    stream = load_my_custom_stream()

    def update(examples):
        # This function is triggered when Prodigy receives annotations
        print(f"Received {len(examples)} annotations!")

    return {
        "dataset": dataset,
        "view_id": view_id,
        "stream": stream,
        "update": update
    }

Custom recipes can be used from the command line just like the built-in recipes. All you need to do is point the -F option to the Python file containing your recipe.

prodigymy-custom-recipemy_dataset--view-id text-F recipe.py

Files can contain multiple recipes, so you can group them however you like. Argument annotations, as well as the recipe function’s docstring will also be displayed when you use the --help flag on the command line.

Recipe components

The components returned by the recipe need to include an iterable stream, a view_id and a dataset (if you want to use the storage to save the annotations to the database). The following components can be defined by a recipe:

Component	Type	Description
`dataset`	str	ID of the current project. Used to associate the annotation with a project in the database.
`view_id`	str	Annotation interface to use.
`stream`	iterable	Stream of annotation tasks in Prodigy’s JSON format.
`update`	callable	Function invoked when Prodigy receives annotations. Can be used to update a model. See here for details.
`db`	-	Storage ID, `True` for default database, `False` for no database or custom database class.
`progress`	callable	Function that takes two arguments (as of v1.10): the controller and a `update_return_value` (return value of the `update` callback, if the recipe provides one). It returns a progress value (float). See here for details.
`on_load`	callable	Function that is executed when Prodigy is started. Can be used to update a model with existing annotations.
`on_exit`	callable	Function that is executed when the user exits Prodigy. Can be used to save a model’s state to disk or export other data.
`before_db`	callable	New: 1.10 Function that is called on examples before they’re placed in the database and can be used to strip out base64 data etc. Use cautiously, as a bug in your code here could lead to data loss. See here for details.
`validate_answer`	callable	New: 1.10 Function that’s called on each answer when it’s submitted in the UI and can raise validation errors. See here for details.
`task_router`	callable	New: 1.12 Function that determines which annotator gets to annotate each example. See here for details.
`session_factory`	callable	New: 1.12 Function that defines how a session object should be initialized.
`get_session_id`	callable	Function that returns a custom session ID. If not set, a timestamp is used.
`exclude`	list	List of dataset IDs whose annotations to exclude.
`config`	dict	Recipe-specific configuration. Can be overwritten by the global and project config.

For more details on the callback functions and their usage, check out the section on recipe callbacks below.

Examples

Video tutorial: image captioning with PyTorch

The following video shows how you can use Prodigy to script fully custom annotation workflows in Python, how to plug in your own machine learning models and how to mix and match different interfaces for your specific use case. We’ll create a dataset of image captions, use an image captioning model implemented in PyTorch to suggest captions and perform error analysis to find out what the model is getting right, and where it needs improvement. You can find the code from this tutorial on GitHub.

Example: Customer feedback sentiment

Let’s say you’ve extracted examples of customer feedback and support emails, and you want to classify those by sentiment in the categories “happy”, “sad”, “angry” or “neutral”. The result could be used to get some insights into your customers’ overall satisfaction, or to provide statistics for your yearly report. Maybe you also want to experiment with training a model to predict the tone of incoming customer emails, so you can make sure that unhappy customers or critical situations can receive priority support. Your data could look something like this:

feedback.jsonl
{"text": "Thanks for your great work – really made my day!"}
{"text": "Worst experience ever, never ordering from here again."}
{"text": "My order arrived last Tuesday."}

Prodigy comes with a built-in choice interface that lets you render a task with a number of multiple or single choice options. The result could look like this:

The combination of the options and Prodigy’s accept, reject and ignore actions provides a powerful and intuitive annotation system. If none of the options apply, you can simply ignore the task. You can also select an option and reject the task – for instance, to generate negative examples for training a model. To use a multiple-choice interface, you can set "choice_style": "multiple" in the config settings returned by your recipe. "choice_auto_accept": true will automatically accept selected answers in single-choice mode, so you won’t have to hit accept in addition to selecting the option.

Prodigy

This live demo requires JavaScript to be enabled.

Recipes should be as reusable as possible, so in your custom sentiment recipe, you likely want a command-line option that allows passing in the file path to the texts. You can write your own function to load in data from a file, or use one of Prodigy’s built-in loaders using the get_stream. To add the four options, you can write a simple helper function that takes a stream of tasks, and yields individual tasks with an added option key.

recipe.py
from prodigy.core import Arg, recipe
from prodigy.components.stream import get_stream

@prodigy.recipe(
    "sentiment",
    dataset=Arg(help="Dataset to save answers to."),
    file_path=Arg(help="Path to texts")
)
def sentiment(dataset: str, file_path: Path):
    """Annotate the sentiment of texts using different mood options."""
    stream = get_stream(file_path) # load in the JSONL file
    stream = add_options(stream)   # add options to each task

    return {
        "dataset": dataset,   # save annotations in this dataset
        "view_id": "choice",  # use the choice interface
        "stream": stream,
    }

def add_options(stream):
    # Helper function to add options to every task in a stream
    options = [
        {"id": "happy", "text": "😀 happy"},
        {"id": "sad", "text": "😢 sad"},
        {"id": "angry", "text": "😠 angry"},
        {"id": "neutral", "text": "😶 neutral"},
    ]
    for task in stream:
        task["options"] = options
        yield task

Each option requires a unique id, which will be used internally. Options are simple dictionaries and can contain the same properties as regular annotation tasks – so you can also use an image, or add spans to highlight entities in the text.

Using the recipe above, you can now start the Prodigy server with a new dataset:

prodigysentimentcustomer_feedbackfeedback.jsonl-F recipe.py✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

All tasks you annotate will be stored in the database in Prodigy’s JSON format. When you annotate an example, Prodigy will add an "answer" key, as well as an "accept" key, mapped to a list of the selected option IDs. Since this example is using a single-choice interface, this list will only ever contain one or zero IDs.

Adding other hooks

Once you’re done with annotating a bunch of examples, you might want to see a quick overview of the total sentiment counts in the dataset. The easiest way to do this is to add an on_exit component to your recipe – a function that is invoked when you stop the Prodigy server. It takes the Controller as an argument, giving you access to the database.

The Controller class takes care of putting the individual recipe components together and manages your Prodigy annotation session. It exposes various attributes, like the database and session ID, as well as methods that allow the application to interact with the REST API. For more details on the controller API, see the PRODIGY_README.html.

Adding an on_exit hook
def on_exit(controller):
    # Get all annotations in the dataset, filter out the accepted tasks,
    # count them by the selected options and print the counts.
    examples = controller.db.get_dataset(controller.dataset)
    examples = [eg for eg in examples if eg["answer"] == "accept"]
    for option in ("happy", "sad", "angry", "neutral"):
        count = len([eg for eg in examples if option in eg["accept"]])
        print(f"Annotated {count} {option} examples")

Instead of getting all annotations in the dataset, you can also choose to only get the annotations of the current session. Prodigy creates an additional dataset for each session, usually named after the current timestamp. The ID is available as the session_id attribute of the controller:

examples = controller.db.get_dataset(controller.session_id)

To integrate the new hook, simply add it to the dictionary of components returned by your recipe, e.g. 'on_exit': on_exit. When you exit the Prodigy server, you’ll now see a count of the selected options:

prodigysentimentcustomer_feedbackfeedback.jsonl-F recipe.py✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!Annotated 56 happy examples
Annotated 10 sad examples
Annotated 12 angry examples
Annotated 36 neutral examples

Example: Custom interfaces with choice, manual NER, free-form input and custom API loader

The new blocks interface lets you build even more powerful custom recipes by combining different interfaces. For some annotation tasks, it’s important to collect different pieces of information at the same time - for instance, you might want the annotator to leave a comment explaining their decision, or select a span of text that corresponds to the option they selected.

In this example, we’re writing a simple custom recipe that loads facts about cats from the Cat Facts API and adds blocks for highlighting spans, selecting multiple-choice options and writing a free-form comment. Here’s the result we want to achieve: for each incoming fact about cats, we want to annotate whether the fact is fully correct, partially correct or wrong (or whether it’s unclear). We also want to give the annotator the chance to explain their decision and highlight the relevant section in the text above.

Highlight spans, select options and type!

This live demo requires JavaScript to be enabled.

We can break the above interface down into three blocks:

Blocks
[{"view_id": "ner_manual"}, {"view_id": "choice"}, {"view_id": "text_input"}]

An ner_manual block with a "text" containing the fact and a "tokens" property to allow fast selection. It specifies one label RELEVANT.
A choice block with four different options: “fully correct”, “partially correct”, “wrong” and “don’t know”. When the user selects an option, its "id" is added to the task’s "accept" list.
A text_input block with two rows and a custom label, “Explain your decision”. Whatever the annotator types in here should be added to the task data.

To support this combination of interfaces, we need to create input data that looks like this:

Example JSON task
{
    "text":"Adult cats only meow to communicate with humans.",
    "options":[
        {"id": 3,"text": "😺 Fully correct"},
        {"id": 2, "text": "😼 Partially correct"},
        {"id": 1, "text": "😾 Wrong"},
        { "id": 0, "text": "🙀 Don't know" }
    ],
    "tokens": [
        {"text": "Adult", "start": 0, "end": 5, "id": 0},
        {"text": "cats", "start": 6, "end": 10, "id": 1},
        {"text": "only", "start": 11, "end": 15, "id": 2},
        {"text": "meow", "start": 16, "end": 20, "id": 3},
        {"text": "to", "start": 21, "end": 23, "id": 4},
        {"text": "communicate", "start": 24, "end": 35, "id": 5},
        {"text": "with", "start": 36, "end": 40, "id": 6},
        {"text": "humans", "start": 41, "end": 47, "id": 7},
        {"text": ".", "start": 47, "end": 48, "id": 8}
    ]
}

Here’s the cat-facts recipe that puts all of this together. To load the stream, we can write a generator function that makes a request to the /facts endpoint of the API and yields dictionaries containing the "text" and the choice "options". We can then use spaCy and the add_tokens preprocessor to tokenize the stream and add a "tokens" property.

Within each block, we can also add to and override content of the task dictionary. This makes sense for the text_input block for instance, since those settings are mostly for presentation and don’t need to be stored with each annotation. (You could also have multiple text fields that require different settings, like the "field_id" of the field that the text is written to.) Similarly, the choice and ner_manual interface will both render a "text" if it’s present in the data. However, in this case, we only want to show the text once in the manual NER block, so we can override "text": None in the choice block.

recipe.py
import prodigy
from prodigy.components.preprocess import add_tokens
import requests
import spacy

@prodigy.recipe("cat-facts")
def cat_facts_ner(dataset, lang="en"):
    # We can use the blocks to override certain config and content, and set
    # "text": None for the choice interface so it doesn't also render the text
    blocks = [
        {"view_id": "ner_manual"},
        {"view_id": "choice", "text": None},
        {"view_id": "text_input", "field_rows": 3, "field_label": "Explain your decision"}
    ]
    options = [
        {"id": 3, "text": "😺 Fully correct"},
        {"id": 2, "text": "😼 Partially correct"},
        {"id": 1, "text": "😾 Wrong"},
        {"id": 0, "text": "🙀 Don't know"}
    ]

    def get_stream():
        res = requests.get("https://cat-fact.herokuapp.com/facts").json()
        for fact in res:
            yield {"text": fact["text"], "options": options}

    nlp = spacy.blank(lang)                           # blank spaCy pipeline for tokenization
    stream = get_stream()                             # set up the stream
    stream.apply(add_tokens, nlp=nlp, stream=stream)  # tokenize the stream for ner_manual

    return {
        "dataset": dataset,          # the dataset to save annotations to
        "view_id": "blocks",         # set the view_id to "blocks"
        "stream": stream,            # the stream of incoming examples
        "config": {
            "labels": ["RELEVANT"],  # the labels for the manual NER interface
            "blocks": blocks         # add the blocks to the config
        }
    }

We can now run the recipe on the command line using its name and arguments, and -F pointing to the Python file containing the code:

prodigycat-factscat_facts_data-F recipe.py✨ Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

Recipe callback functions in detail

Recipes let you define different callback functions that are executed at different points of the lifecycle and give you access to the data and created annotations, and let you compute results based on the submitted answers.

update

The update callback is executed whenever the server receives a new batch of annotated examples. It can be used to update a model in the loop with the answers, but also to perform any other action that requires access to the annotations as they come in – for instance, sending a status update to another service. The update function doesn’t have to return anything, but if it does, that value is passed to the progress function, if available. This lets you implement a custom progress calculation that takes things like the loss into account.

def update(answers):
    texts = [eg["text"] for eg in answers]
    ents = [(span["start"], span["end"], span["label"]) for span in eg["spans"]]
    annots = [{"entities": ent} for ent in ents]
    losses = {}
    nlp.update(texts, annots, losses=losses)
    return losses["ner"]

Argument	Type	Description
`answers`	list	List of annotated example dicts.
RETURNS	float / int / `None`	Optional value that’s available in the `progress` callback.

progress New: 1.10

While the progress callback itself isn’t new, v1.10 introduces a new and consistent API. The function is used to calculate the progress shown in the sidebar and is executed on load and whenever new answers are received by the server, right after the update callback, if available.

def progress(ctrl, update_return_value):
    return ctrl.session_annotated / 2500

Argument	Type	Description
`ctrl`	`Controller`	The `Controller` that lets you access details about the annotation process.
`update_return_value`	float / int / `None`	Value returned by the `update` callback, if available. Can be used in active learning workflows to calculate the progress based on the loss.
RETURNS	float	A progress value between `0` and `1`, or `None` for infinite.

validate_answer New: 1.10

If a validate_answer callback is provided, it is executed every time the user submits an annotation via the accept or reject button. The function receives the submitted example and can perform any validation checks and assert or raise errors if needed. If validation fails, the user will see the validation error message and Prodigy won’t save the answer. The user can then either fix the problem, or skip the example by pressing ignore.

Using the validate_answer callback makes the most sense in combination with manual interfaces that modify the task during annotation, e.g. to add entity spans or select multiple choice options. Here’s an example of a validation function that raises an error if less than 1 or more than 3 options were selected in a multiple-choice UI:

def validate_answer(eg):
    selected = eg.get("accept", [])
    assert 1 <= len(selected) <= 3, "Select at least 1 but not more than 3 options."

If you’re checking for multiple conditions, keep in mind that Python will exit after raising the first exception and the annotator will only ever see the first raised error. If you want to give feedback on multiple validation errors, you can collect them and then raise one error at the end:

def validate_answer(eg):
    selected = eg.get("accept", [])
    spans = eg.get("spans", [])
    errors = []
    if 1 <= len(selected) <= 3:
        errors.append("Select at least 1 but not more than 3 options.")
    if len(spans) == 0:
        errors.append("Select at least one span.")
    if errors:
        raise ValueError(" ".join(errors))

Argument	Type	Description
`eg`	dict	The annotated example, including the `"answer"`.

before_db New: 1.10

The before_db callback lets you modify examples before they’re placed in the database. It’s mostly useful to prevent database bloat and strip out data like base64-encoded images or audio files. It shouldn’t be used for other more complex postprocessing – that’s a much better fit for a separate step. You typically want the data in the database to reflect exactly what the annotator saw and worked on, so you don’t lose any information and can easily reproduce a single datapoint later on.

def before_db(examples):
    for eg in examples:
        # If the image is a base64 string and the path to the original file
        # is present in the task, remove the image data
        if eg["image"].startswith("data:") and "path" in eg:
            eg["image"] = eg["path"]
    return examples

Argument	Type	Description
`examples`	list	A list of annotated example dictionaries.
RETURNS	list	A list of (optionally) modified example dictionaries.

task_router New: 1.12

In Prodigy a task router is a Python function that accepts a Controller, session_id and a dictionary that describes the current annotation task and returns a list of session id’s that will annotate this example.

If you’re writing a custom task router, it can help to understand how task routing is handle in Prodigy internally. You can find a guide on this topic here.

recipe.py (except)
def task_router_conf(ctrl: Controller, session_id: str, item: Dict) -> List[str]:
    """Route tasks based on the confidence of a model."""
    # Get all sessions known to the Controller now
    all_annotators = ctrl.all_session_ids

    # Calculate a confidence score from a custom model
    confidence_score = model(item['text'])

    # If the confidence is low, the example might be hard
    # and then everyone needs to check
    if confidence_score < 0.3:
        return all_annotators

    # Otherwise just one person needs to check.
    # We re-use the task_hash to ensure consistent routing of the task.
    idx = item['_task_hash'] % len(all_annotators)
    return all_annotators[idx]

Argument	Type	Description
`ctrl`	`Controller`	Prodigy Controller object.
`session_id`	string	The current session in consideration
`item`	dict	The example to be annotated
RETURNS	list	A list of strings that represent session ids

session_factory New: 1.12

Whenever an annotator accesses the Prodigy server with a URL, a new annotation session is created. Each annotation session is modelled by a Session object whose initialization can be fully customized by providing the session_factory callback to the Controller. This is useful if, for example, you want pre-populate a new session with a custom history to match your custom task router. See Session component documentation for details on how to initialize it. Apart from initializing the session object, this callback should also add the newly created session to Controller’s internal list by calling ctrl.add_session.

The example below gives a demonstration of a custom session factory that distinguishes between supervisor sessions and other sessions.

recipe.py (except)
def session_factory(ctrl: Controller, session_id: str) -> Session:
    """Initialize session with total_annotated based on session_id"""
    total_annotated = 0
    session_history_hashes = None
    supervisor_ids =["supervisor_1", "supervisor_2"]
    supervisor_sessions = set([ctrl.get_session_name(n) for n in supervisor_ids])
    # get the history for this session if it exists in the DB
    if ctrl.exclude_by == "task":
        session_history_hashes = ctrl.db.get_task_hashes(session_id)
    else:
        session_history_hashes = ctrl.db.get_input_hashes(session_id)

    # A supervisor session will only consider "supervisor" totals.
    if session_id in supervisor_sessions:
        sessions = supervisor_sessions
    else:
        sessions = set(ctrl.session_ids).difference(supervisor_sessions)
    for s in sessions:
        total_annotated += ctrl.db.count_dataset(s, session=True)

    # initialize Session object
    session = Session(
        session_id,
        ctrl.stream,
        batch_size=ctrl.batch_size,
        answered_input_hashes=(
            session_history_hashes if ctrl.exclude_by == "input" else None
        ),
        answered_task_hashes=(
            session_history_hashes if ctrl.exclude_by == "task" else None
        ),
        total_annotated=total_annotated,
        target_annotated=ctrl.target_annotated or 0,
    )
    # add it to Controller
    ctrl.add_session(session)
    return session

Argument	Type	Description
`ctrl`	`Controller`	Prodigy Controller object.
`session_id`	string	ID of the session to be initialized
RETURNS	Session	Initialized Session object

Customizing recipe arguments

Prodigy’s @recipe decorator turns the decorated function into a CLI command. The first argument is the recipe name, followed by optional argument annotations that let you specify how the argument can be set on the command line. Argument annotations are also documented when you run a recipe command with --help. The syntax is defined by the radicli command line library, which should feel familiar if you’ve ever used click or typer.

Here’s an example:

recipe.py
from typing import List
from prodigy.core import Arg, recipe

@recipe(
    "custom-recipe",
    dataset=Arg(help="The dataset to use"),
    label=Arg("--label", "-l", help="Comma-separated label(s)"),
    silent=Arg("--silent", "-S", help="Don't output anything")
)
def custom_recipe(dataset: str, label: List[str], silent: bool = False):
    print(f"{dataset=} {label=} {silent=}")
    # Do something more here...

This file can now passed to Prodigy on the command line via the -F argument, which will also make the custom recipe available. You could call this recipe from the command line via:

Command line usage

prodigycustom-recipemy_dataset--label PERSON,ORG--silent-F recipe.pydataset=‘my_dataset’ label=[‘PERSON’, ‘ORG’] silent=True

You’ll notice that the arguments are parsed and are also converted to their intended types. Argument annotations are defined with an Arg class, which add the help text but also allow you to specify shorthand arugments available to the CLU.

You may wonder how the types are automatically converted because the values that come back from the command line are always strings. When you type --label PERSON,ORG, the label you receive is a string "PERSON,ORG". However, radicli allows you to apply a converter to ensure that the variable is parsed into the correct type. Many types are already implemented on your behalf, but you are always able to write a custom converter.

The example below will add a converter for the --label argument.

recipe.py
from typing import List
from prodigy.core import Arg, recipe

# Define a converter to be used in the recipe below
demo_converter = lambda d: [lab.lower() for lab in d.split(",")]

@recipe(
    "custom-recipe",
    dataset=Arg(help="The dataset to use"),
    label=Arg("--label", "-l", help="Comma-separated label(s)", converter=demo_converter),
    silent=Arg("--silent", "-S", help="Don't output anything")
)
def custom_recipe(dataset: str, label: List[str], silent: bool = False):
    print(f"{dataset=} {label=} {silent=}")
    # Do something more here...

And you can see that the labels are now lower case when you run the recipe.

Command line usage

prodigycustom-recipemy_dataset--label PERSON,ORG--silent-F recipe.pydataset='my_dataset' label=['person', 'org'] silent=True

Testing recipes

The @recipe decorator leaves the original function intact, so calling it from within Python will simply return a dictionary of its components. This lets you write comprehensive unit tests to ensure that your recipes are working correctly.

Prodigy recipe
@prodigy.recipe("my-recipe")
def my_recipe(dataset, database):
    stream = my_database.load(database)
    view_id = "classification" if database == "products" else "text"
    return {"dataset": dataset, "stream": stream, "view_id": view_id}

Unit test
def test_recipe():
    components = my_recipe("my_dataset", "products")
    assert "dataset" in components
    assert "stream" in components
    assert "view_id" in components
    assert components["dataset"] == "my_dataset"
    assert components["view_id"] == "classification"
    assert hasattr(components["stream"], "__iter__")