API

Loaders and Input Data

Loaders are helper classes to turn a source file into an iterable stream. Calling a loader returns a generator that yields annotation tasks in Prodigy’s JSON format. Prodigy supports streaming in data from a variety of different formats, via the available loader components. To load data from other formats or sources, like a database or an API, you can write your own loader function that returns an iterable stream, and include it in your custom recipe.

Input data formats

Text sources

data.jsonl{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}
data.json[
  { "text": "This is a sentence." },
  { "text": "This is another sentence.", "meta": { "score": 0.1 } }
]
data.csvText,Label,Meta
This is a sentence.,POSITIVE,0
This is another sentence.,NEGATIVE,0.1

Column headers can be lowercase or title case. Columns for label and meta are optional. The value of the meta column will be added as a "meta" key within the "meta" dictionary, e.g. {"text": "...", "meta": {"meta": 0}},.

data.txtThis is a sentence.
This is another sentence.

Comparison files

Each entry in a comparison file needs to include an output key containing the annotation example data – for example, the text, a text and an entity span or an image. Optionally, you can also include an input key for the baseline annotation. The id is used to combine the examples from each file. If an ID is only present in one file, the example is skipped.

model_a.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Natural Language Processing"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "dog"}}
model_b.jsonl{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Neuro-Linguistic Programming"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "hound"}}

Match patterns

Match patterns can be used in the ner.manual, ner.match and ner.teach recipes to filter out specific entities you’re interested in – for example, to collect training data for a new entity type. You can also use the terms.to-patterns recipe to convert a dataset of seed terms to a JSONL pattern file.

Each entry should contain a "label" and "pattern" key. A pattern can be an exact string, or a rule-based token pattern (used by spaCy’s Matcher class), consisting of a list of dictionaries, each describing one individual token and its attributes. When using token patterns, keep in mind that their interpretation depends on the model’s tokenizer.

patterns.jsonl{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}

Here are some examples of match patterns and the respective matched strings. For more details, see the spaCy documentation on rule-based matching.

PatternMatches
[{"lower": "apple"}]“apple”, “APPLE”, “Apple”, “ApPlLe” etc.
[{"text": "apple"}]“apple”
[{"lower": "squash", "pos": "NOUN"}]“squash”, “Squash” etc. (nouns only, i.e. not “to squash”)
"Lamb's lettuce"“Lamb’s lettuce”

Prodigy will assign each entry a hash ID based on its label and pattern. This ID will be used when updating the matcher from an existing dataset, e.g. using ner.match with the --resume flag. If you want to use your own pattern ID system, you can define an "id" value on each entry, which will be respected by Prodigy.

Images

Images can be loaded via any of the data formats that support keyed inputs. If the data contains an image key, it will be copied over into the annotation task, and you’ll be able to use it with the image or classification interface.

data.jsonl{"image": "https://media.giphy.com/media/LHZyixOnHwDDy/giphy.gif"}

Alternatively, the Images loader lets you stream in image files from a directory. All images will be encoded as base64 data URIs and included as the image key.

from prodigy.components.loaders import Images
stream = Images("/path/to/images")

As of Prodigy v1.9.4, you can also use the ImageServer loader, which will serve the images via the Prodigy server and lets you bypass the base64 encoding. To use the loader from the command line, you can set --loader image-server.

from prodigy.components.loaders import ImageServer
stream = ImageServer("/path/to/images")
Fetching images from local paths and URLs

You can also use the fetch_images preprocessor to replace all image paths and URLs in your stream with base64 data URIs. The skip keyword argument lets you specify whether to skip invalid images that can’t be converted (for example, because the path doesn’t exist, or the URL can’t be fetched). If set to False, Prodigy will raise a ValueError if it encounters invalid images.

from prodigy.components.preprocess import fetch_images

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_images(stream, skip=True)

Hashing and deduplication

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash. Both hashes are uint32 values, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input. For more details on how the hashes are generated and how to set custom hashes, see the set_hashes docs.

HashTypeDescription
_input_hashuint32Hash representing the input that annotations are collected on, e.g. the "text", "image" or "html". Examples with the same text will receive the same input hash.
_task_hashuint32Hash representing the “question” about the input, i.e. the "label", "spans" or "options". Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.

As of v1.9, recipes can return an optional "exclude_by" setting in their "config" to specify whether to exclude by "input" or "task" (default). Filtering and excluding by input hash is especially useful for manual and semi-manual workflows like ner.manual and ner.correct. If you’ve already annotated an example and it comes in again with suggestions from a model or pattern, Prodigy will correctly determine that it’s a different “question”. However, unlike in the binary workflows, you typically don’t want to see the example again, because you already created a gold-standard annotation for it.


File loaders

Out-of-the-box, Prodigy currently supports loading in data from single files of JSONL, JSON, CSV or plain text. You can specify the loader via the --loader argument on the command line. If no loader is set, Prodigy will use the file extension to pick the respective loader. Loaders are available via prodigy.components.loaders.

IDComponentDescription
jsonlJSONLStream in newline-delimited JSON from a file. Prodigy’s preferred format, as it’s flexible and doesn’t require parsing the entire file.
jsonJSONStream in JSON from a file. Requires loading and parsing the entire file.
csvCSVStream in a CSV file using the csv module. The keys will be read off the headers in the first line. Supports an optional delimiter keyword argument.
txtTXTStream in a plain text from a file containing one example per line. Will yield tasks containing only a text property.
imagesImagesStream in images from a directory. All images will be encoded as base64 data URIs and included as the image key to be rendered with the image interface.
image-serverImageServerNew: v1.9.4 Stream in images from a directory. Image files will be served via the Prodigy server and their data won’t be included with the task.
Examplefrom prodigy.components.loaders import JSONL, JSON, CSV, TXT, Images, ImageServer

jsonl_stream = JSONL("path/to/file.jsonl")
json_stream = JSON("path/to/file.json")
csv_stream = CSV("path/to/file.csv", delimiter=",")
txt_stream = TXT("path/to/file.txt")
img_stream = Images("path/to/images")
img_stream2 = ImageServer("path/to/images")

Example

prodigy ner.manual your_dataset en_core_web_sm /tmp/your_data.dump --loader txt --label PERSON,ORG

Corpus loaders

Additionally, Prodigy also supports converting data from popular data sets and corpora.

IDComponentDescription
redditRedditStream in examples from a file of the Reddit corpus. Will extract, clean and validate the comments.
Examplefrom prodigy.components.loaders import Reddit
reddit_stream = Reddit("path/to/reddit.bz2")

Loading from standard input

If the source argument on the command line is set to -, Prodigy will read from sys.stdin. This lets you pipe data forward. If you’re loading data in a different format, make sure to set the --loader argument on the command line so Prodigy knows how to interpret the incoming data.


cat
./your_data.jsonl
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-
--loader jsonl

Loading text files from a directory or custom format

A custom loader should be a function that loads your data and yields dictionaries in Prodigy’s JSON format. If you’re writing a custom recipe, you can implement your loading in your recipe function:

recipe.pypseudocode 
@prodigy.recipe("custom-recipe-with-loader") def custom_recipe_with_loader(dataset, source): stream = load_your_source_here(source) # implement your custom loading return {"dataset": dataset, "stream": stream, "view_id": "text"}

Using custom loaders with built-in recipes

If you want to use a built-in recipe like ner.manual but load in data from a custom source, there’s usually no need to copy-paste the recipe script only to replace the loader. Instead, you can write a loader script that outputs the data, and then pipe that output forward. If the source argument on the command line is set to -, Prodigy will read from sys.stdin:


python
load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-

All your custom loader script needs to do is load the data somehow, create annotation tasks in Prodigy’s format (e.g. dictionary with a "text" key) and print the dumped JSON.

load_data.pypseudocode 
from pathlib import Path import json data_path = Path("/path/to/directory") for file_path in data_path.iterdir(): # iterate over directory lines = Path(file_path).open("r", encoding="utf8") # open file for line in lines: task = {"text": line} # create one task for each line of text print(json.dumps(task)) # dump and print the JSON

This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. For extra convenience, you can also wrap your loader in a custom recipe and have Prodigy take care of adding the command-line interface. If a custom recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the code.

load_data.pypseudocode 
@prodigy.recipe("load-data") # add argument annotations and shortcuts if needed def load_data(dir_path): # the loader code here

You can then use your custom loader like this:


prodigy
load_data
/path/to/directory
-F load_data.py
|
prodigy
ner.manual
your_dataset
en_core_web_sm
-

Live APIs

API loaders are similar to file format loaders, but stream in content via a web API – for example, news headlines or teasers for a topic or from a specific publication, images for a search term or related tags. Individual APIs differ in the type of content they provide and the respective rate limit restrictions. All APIs supported by Prodigy come with a free license option and should provide sufficient rate limits for use on a single machine.

API loaders are available via prodigy.components.loaders and using them requires an entry for the loader ID in the "api_keys" in your prodigy.json. The value of the source argument on the command line is used as the API query.

IDLoaderDescription
nytNewYorkTimesThe New York Times API.
guardianGuardianThe Guardian API.
zeitZeitDie Zeit API (German).
newsapiNewsAPINews API
twitterTwitterTwitter API. Requires the API key to be a dict with consumer_key, consumer_secret, access_token and access_token_secret.
tumblrTumblrTumblr API. Returns images.
githubGitHubGitHub API. Doesn’t require an API key.
unsplashUnsplashUnsplash API. Returns images.