Loaders and Input Data

Loaders are helper classes to turn a source file into an iterable stream. Calling a loader returns a generator that yields annotation tasks in Prodigy’s JSON format. Prodigy supports streaming in data from a variety of different formats, via the get_stream utility. To load data from other formats or sources, like a database or an API, you can write your own loader function that returns an iterable stream, and include it in your custom recipe.

Input data formats

Text sources

data.jsonl
{"text": "This is a sentence."}
{"text": "This is another sentence.", "meta": {"score": 0.1}}

data.json
[
  { "text": "This is a sentence." },
  { "text": "This is another sentence.", "meta": { "score": 0.1 } }
]

data.csv
Text,Label,Meta
This is a sentence.,POSITIVE,0
This is another sentence.,NEGATIVE,0.1

Column headers can be lowercase or title case. Columns for label and meta are optional. The value of the meta column will be added as a "meta" key within the "meta" dictionary, e.g. {"text": "...", "meta": {"meta": 0}},.

data.txt
This is a sentence.
This is another sentence.

PARQUET New: 1.12

Comparison files

Each entry in a comparison file needs to include an output key containing the annotation example data – for example, the text, a text and an entity span or an image. Optionally, you can also include an input key for the baseline annotation. The id is used to combine the examples from each file. If an ID is only present in one file, the example is skipped.

model_a.jsonl
{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Natural Language Processing"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "dog"}}

model_b.jsonl
{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Neuro-Linguistic Programming"}}
{"id": 1, "input": {"text": "Hund"}, "output": {"text": "hound"}}

Match patterns

Match patterns can be used in recipes like ner.manual, textcat.teach or match to filter out specific entities you’re interested in – for example, to collect training data for a new entity type. You can also generate a subset of data for any downstream task with filter-by-patterns. To convert a dataset of seed terms to a JSONL pattern file terms.to-patterns recipe can be used.

Each entry should contain a "label" and "pattern" key. A pattern can be an exact string, or a rule-based token pattern (used by spaCy’s Matcher class), consisting of a list of dictionaries, each describing one individual token and its attributes. When using token patterns, keep in mind that their interpretation depends on the model’s tokenizer.

patterns.jsonl
{"label": "FRUIT", "pattern": [{"lower": "apple"}]}
{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}
{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}
{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}

Here are some examples of match patterns and the respective matched strings. For more details, see the spaCy documentation on rule-based matching.

Pattern	Matches
`[{"lower": "apple"}]`	”apple”, “APPLE”, “Apple”, “ApPlLe” etc.
`[{"text": "apple"}]`	”apple”
`[{"lower": "squash", "pos": "NOUN"}]`	”squash”, “Squash” etc. (nouns only, i.e. not “to squash”)
`"Lamb's lettuce"`	”Lamb’s lettuce”

Images

Images can be loaded from a URL or base64 data URI via any of the data formats that support keyed inputs or from a directory of files. See the details on file loaders for how to load images from a directory and the image and image_manual docs for details on the expected JSON format.

Audio and video New: 1.10

Audio and video data can be loaded from a URL or base64 data URI via any of the data formats that support keyed inputs or from a directory of files. See the details on file loaders for how to load images from a directory and the audio and audio_manual docs for details on the expected JSON format.

File loaders

Out-of-the-box, Prodigy currently supports loading in data from single files of JSONL, JSON, CSV or plain text. You can specify the loader via the --loader argument on the command line. If no loader is set, Prodigy will use the file extension to pick the respective loader. Loaders are available via prodigy.components.loaders.

ID	Component	Description
`jsonl`	`JSONL`	Stream in newline-delimited JSON from a file. Prodigy’s preferred format, as it’s flexible and doesn’t require parsing the entire file.
`json`	`JSON`	Stream in JSON from a file. Requires loading and parsing the entire file.
`csv`	`CSV`	Stream in a CSV file using the `csv` module. The keys will be read off the headers in the first line. Supports an optional `delimiter` keyword argument.
`txt`	`TXT`	Stream in a plain text from a file containing one example per line. Will yield tasks containing only a `text` property.
`images`	`Images`	Stream in images from a directory. All images will be encoded as base64 data URIs and included as the `image` key to be rendered with the `image` or `image_manual` interface.
`image-server`	`ImageServer`	New: 1.9.4 Stream in images from a directory. Image files will be served via the Prodigy server and their data won’t be included with the task.
`audio`	`Audio`	New: 1.10 Stream in audio files from a directory. All files will be encoded as base64 data URIs and included as the `audio` key to be rendered with the `audio` or `audio_manual` interface.
`audio-server`	`AudioServer`	New: 1.10 Stream in audio files from a directory. Audio files files will be served via the Prodigy server and their data won’t be included with the task.
`video`	`Video`	New: 1.10 Stream in video files from a directory. All files will be encoded as base64 data URIs and included as the `video` key to be rendered with the `audio` or `audio_manual` interface.
`video-server`	`VideoServer`	New: 1.10 Stream in video files from a directory. Video files files will be served via the Prodigy server and their data won’t be included with the task.
`pages`	`Pages`	New: 1.17 Load collections of files to annotate with multi-page `pages`. Supports all other loader types for the content, which can be set as `pages:json` or `pages:images` on the command line.

Example
from prodigy.components.loaders import JSONL, JSON, CSV, TXT, Images, ImageServer, Pages

jsonl_stream = JSONL("path/to/file.jsonl")
json_stream = JSON("path/to/file.json")
csv_stream = CSV("path/to/file.csv", delimiter=",")
txt_stream = TXT("path/to/file.txt")
img_stream = Images("path/to/images")
img_stream2 = ImageServer("path/to/images")
pages_stream = Pages("path/to/files")

Example

prodigyner.manualyour_dataseten_core_web_sm/tmp/your_data.dump--loader txt--label PERSON,ORG

Media loader APIs

The Images, ImageServer, Audio, AudioServer, Video and VideoServer loaders all follow the same API and accept a path to a directory of image and optional file extensions.

Argument	Type	Description
`f`	str	Path to directory of images.
`file_ext`	tuple	New: 1.10 File extensions to load. All other files will be ignored. Also the default file extensions
YIELDS	dict	The annotation tasks with the loaded data.

As of v1.10, Prodigy also exposes more generic Base64 and Server loaders that can be used to implement loading for other file types.

Argument	Type	Description
`f`	str	Path to directory of files.
`input_key`	str	The key of the task dict to assign the string or URL to, e.g. `"image"` or `"audio"`.
`file_ext`	tuple	File extensions to consider. If `None` (default), all files in the directory will be loaded.
YIELDS	dict	The annotation tasks with the loaded data.

Default media file extensions


Images	`(".jpg", ".jpeg", ".png", ".gif", ".svg")`
Audio	`(".mp3", ".m4a", ".wav")`
Video	`(".mpeg", ".mpg", ".mp4")`

Fetching images from local paths and URLs

You can also use the fetch_media preprocessor to replace all local paths and URLs in your stream with base64 data URIs. The skip keyword argument lets you specify whether to skip invalid files that can’t be converted (for example, because the path doesn’t exist, or the URL can’t be fetched). If set to False, Prodigy will raise a ValueError if it encounters invalid files.

from prodigy.components.preprocess import fetch_media

stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]
stream = fetch_media(stream, ["image"], skip=True)

Pages loader New: 1.17

The Pages loader lets you load collections of files to annotate with the multi-page pages UI. If it receives a directory of text files, a task is created for each file with one page per entry in the text file. If it receives subdirectories of images, it will create one task per directory with one page per image. For more custom pages, e.g. combinations of different interfaces, you can also skip the loader and create the respective JSON format yourself in the recipe.

Argument	Type	Description
`f`	str	Path to file or directory.
`view_id`	str	The interface ID to use for the individual pages.
`loader`	str	Name of loader to use for the content. If not set, Prodigy will try to infer it from the file extension.
YIELDS	dict	The paginated annotation tasks.

from prodigy.component.loaders import Pages

text_stream = Pages("/path/to/file.jsonl", view_id="text")
image_stream = Pages("/path/to/directory", view_id="image_manual", loader="images")

To specify the content type on the command line, you can add it after a :, for example --loader pages:jsonl or pages:images. If no content type is provided, Prodigy will try to infer it from the file extension.

Loading paginated text files

If one or more JSON or JSONL files are provided, the loader will look for a "_page" key specifying the page index. Whenever a record starts at page 0 again, a new paginated example will be created. For plain text files, pages are separated with a single line break and paginated examples with two line breaks.

📄 data.jsonl
{"text": "Doc one, page one", "_page": 0}
{"text": "Doc one, page two", "_page": 1}
{"text": "Doc two, page one", "_page": 0}
{"text": "Doc two, page two", "_page": 1}

📄 data.txt
Doc one, page one
Doc one, page two

Doc two, page one
Doc two, page two

JSON tasks
{
    "pages": [
        {"text": "Doc one, page one", "view_id": "text"}
        {"text": "Doc one, page two", "view_id": "text"}
    ]
},
{
    "pages": [
        {"text": "Doc two, page one", "view_id": "text"}
        {"text": "Doc two, page two", "view_id": "text"}
    ]
}

Loading paginated image, audio and video files

If a directory of subdirectories is provided, e.g. for images, an annotation task will be created for each subdirectory. The subdirectory name will be preserved in the example’s "meta". Each file in the subdirectory will become a page. On the command line, you can set --loader to pages:images or pages:image-server to specify how the media content should be loaded and served.

Directory structure
📂 images
┣━━ 📂 Document one
┃   ┣━━ 📄 image1.jpg
┃   ┣━━ 📄 image2.jpg
┃   ┗━━ 📄 image3.jpg
┗━━ 📂 Document two
    ┣━━ 📄 image4.jpg
    ┗━━ 📄 image5.jpg

JSON task
{
   "pages": [
       {"image": "image1.jpg", "view_id": "image_manual"},
       {"image": "image2.jpg", "view_id": "image_manual"},
       {"image": "image3.jpg", "view_id": "image_manual"}
   ],
   "meta": {"title": "Document one"}
}

If you’re working with multi-page PDFs, the recipe included with the Prodigy-PDF plugin now also supports paginated loading out of the box, so you can annotate one document per task view from a directory of PDFs.

Corpus loaders

Additionally, Prodigy also supports converting data from popular data sets and corpora.

ID	Component	Description
`reddit`	`Reddit`	Stream in examples from a file of the Reddit corpus. Will extract, clean and validate the comments.

Example
from prodigy.components.loaders import Reddit
reddit_stream = Reddit("path/to/reddit.bz2")

Loading from existing datasets New: 1.10

The dataset: syntax lets you specify an existing dataset as the input source. Prodigy will then load the annotations from the dataset and stream them in again. Annotation interfaces respect pre-defined annotations and will pre-select them in the UI. This is useful if you want to re-annotate a dataset to correct it, or if you want to add new information with a different interface. The following command will stream in annotations from the dataset ner_data and save the resulting reannotated data in a new dataset ner_data_new:

Example

prodigyner.manualner_data_newblank:endataset:ner_data--label PERSON,ORG

Optionally, you can also add another : plus the value of the answer to load if you only want to load examples with specific answers like "accept" or "ignore". For example, you may want to re-annotate difficult questions you previously skipped by hitting ignore. Similarly, if you’re using rel.manual to assign relations to pre-annotated spans, you typically only want to load in accepted answers.

Example

prodigyrel.manualner_datablank:endataset:ner_data:accept--label SUBJECT,OBJECT

Loading from standard input

If the source argument on the command line is set to -, Prodigy will read from sys.stdin. This lets you pipe data forward. If you’re loading data in a different format, make sure to set the --loader argument on the command line so Prodigy knows how to interpret the incoming data.

cat./your_data.jsonl|prodigyner.manualyour_dataseten_core_web_sm---loader jsonl

Loading text files from a directory or custom format

A custom loader should be a function that loads your data and yields dictionaries in Prodigy’s JSON format. If you’re writing a custom recipe, you can implement your loading in your recipe function:

recipe.py
@prodigy.recipe("custom-recipe-with-loader")
def custom_recipe_with_loader(dataset, source):
    stream = load_your_source_here(source)  # implement your custom loading
    return {"dataset": dataset, "stream": stream, "view_id": "text"}

Using custom loaders with built-in recipes

If you want to use a built-in recipe like ner.manual but load in data from a custom source, there’s usually no need to copy-paste the recipe script only to replace the loader. Instead, you can write a loader script that outputs the data, and then pipe that output forward. If the source argument on the command line is set to -, Prodigy will read from sys.stdin:

pythonload_data.py|prodigyner.manualyour_dataseten_core_web_sm-

All your custom loader script needs to do is load the data somehow, create annotation tasks in Prodigy’s format (e.g. dictionary with a "text" key) and print the dumped JSON.

load_data.py
from pathlib import Path import json

data_path = Path("/path/to/directory")
for file_path in data_path.iterdir():  # iterate over directory
    lines = Path(file_path).open("r", encoding="utf8")  # open file
    for line in lines:
        task = {"text": line}  # create one task for each line of text
        print(json.dumps(task))  # dump and print the JSON

This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. For extra convenience, you can also wrap your loader in a custom recipe and have Prodigy take care of adding the command-line interface. If a custom recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the code.

load_data.py
@prodigy.recipe("load-data")  # add argument annotations and shortcuts if needed
def load_data(dir_path):
    # the loader code here

You can then use your custom loader like this:

prodigyload_data/path/to/directory-F load_data.py|prodigyner.manualyour_dataseten_core_web_sm-

Hashing and deduplication

When a new example comes in, Prodigy assigns it two hashes: the input hash and the task hash. Both hashes are integers, so they can be stored as JSON with each task. Based on those hashes, Prodigy is able to determine whether two examples are entirely different, different questions about the same input, e.g. text, or the same question about the same input. For more details on how the hashes are generated and how to set custom hashes, see the set_hashes docs.

Hash	Type	Description
`_input_hash`	int	Hash representing the input that annotations are collected on, e.g. the `"text"`, `"image"` or `"html"`. Examples with the same text will receive the same input hash.
`_task_hash`	int	Hash representing the “question” about the input, i.e. the `"label"`, `"spans"` or `"options"`. Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes.

As of v1.9, recipes can return an optional "exclude_by" setting in their "config" to specify whether to exclude by "input" or "task" (default). Filtering and excluding by input hash is especially useful for manual and semi-manual workflows like ner.manual and ner.correct. If you’ve already annotated an example and it comes in again with suggestions from a model or pattern, Prodigy will correctly determine that it’s a different “question”. However, unlike in the binary workflows, you typically don’t want to see the example again, because you already created a gold-standard annotation for it.