Loaders and Input Data
Loaders are helper classes to turn a source file into an iterable stream.
Calling a loader returns a generator that yields annotation tasks in Prodigy’s
JSON format. Prodigy supports streaming in data from a variety of different
formats, via the get_stream
utility.
To load data from other formats or sources, like a database or an API, you can
write your own loader function that returns an iterable stream, and include it
in your custom recipe.
Input data formats
Text sources
data.jsonl
{"text": "This is a sentence."}{"text": "This is another sentence.", "meta": {"score": 0.1}}
data.json
[{ "text": "This is a sentence." },{ "text": "This is another sentence.", "meta": { "score": 0.1 } }]
data.csv
Text,Label,MetaThis is a sentence.,POSITIVE,0This is another sentence.,NEGATIVE,0.1
Column headers can be lowercase or title case. Columns for label and meta are
optional. The value of the meta column will be added as a "meta"
key within
the "meta"
dictionary, e.g. {"text": "...", "meta": {"meta": 0}},
.
data.txt
This is a sentence.This is another sentence.
PARQUET New: 1.12
Comparison files
Each entry in a comparison file needs to include an output
key containing the
annotation example data – for example, the text, a text and an entity span or an
image. Optionally, you can also include an input
key for the baseline
annotation. The id
is used to combine the examples from each file. If an ID is
only present in one file, the example is skipped.
model_a.jsonl
{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Natural Language Processing"}}{"id": 1, "input": {"text": "Hund"}, "output": {"text": "dog"}}
model_b.jsonl
{"id": 0, "input": {"text": "NLP"}, "output": {"text": "Neuro-Linguistic Programming"}}{"id": 1, "input": {"text": "Hund"}, "output": {"text": "hound"}}
Match patterns
Match patterns can be used in recipes like ner.manual
,
textcat.teach
or match
to filter out specific entities you’re
interested in – for example, to collect training data for a new entity type. You
can also generate a subset of data for any downstream task with
filter-by-patterns
. To convert a dataset of seed terms to a JSONL pattern
file terms.to-patterns
recipe can be used.
Each entry should contain a "label"
and "pattern"
key. A pattern can be an
exact string, or a
rule-based token pattern (used by
spaCy’s Matcher
class), consisting of a list of dictionaries, each describing
one individual token and its attributes. When using token patterns, keep in mind
that their interpretation depends on the model’s tokenizer.
patterns.jsonl
{"label": "FRUIT", "pattern": [{"lower": "apple"}]}{"label": "FRUIT", "pattern": [{"lower": "goji"}, {"lower": "berry"}]}{"label": "VEGETABLE", "pattern": [{"lower": "squash", "pos": "NOUN"}]}{"label": "VEGETABLE", "pattern": "Lamb's lettuce"}
Here are some examples of match patterns and the respective matched strings. For more details, see the spaCy documentation on rule-based matching.
Pattern | Matches |
---|---|
[{"lower": "apple"}] | ”apple”, “APPLE”, “Apple”, “ApPlLe” etc. |
[{"text": "apple"}] | ”apple” |
[{"lower": "squash", "pos": "NOUN"}] | ”squash”, “Squash” etc. (nouns only, i.e. not “to squash”) |
"Lamb's lettuce" | ”Lamb’s lettuce” |
Images
Images can be loaded from a URL or base64 data URI via any of the data formats
that support keyed inputs or from a directory of files. See the details on
file loaders for how to load images from a directory and the
image
and image_manual
docs for details on the expected JSON
format.
Audio and video New: 1.10
Audio and video data can be loaded from a URL or base64 data URI via any of the
data formats that support keyed inputs or from a directory of files. See the
details on file loaders for how to load images from a directory
and the audio
and audio_manual
docs for details on the
expected JSON format.
File loaders
Out-of-the-box, Prodigy currently supports loading in data from single files of
JSONL, JSON, CSV or plain text. You can specify the loader via the --loader
argument on the command line. If no loader is set, Prodigy will use the file
extension to pick the respective loader. Loaders are available via
prodigy.components.loaders
.
ID | Component | Description |
---|---|---|
jsonl | JSONL | Stream in newline-delimited JSON from a file. Prodigy’s preferred format, as it’s flexible and doesn’t require parsing the entire file. |
json | JSON | Stream in JSON from a file. Requires loading and parsing the entire file. |
csv | CSV | Stream in a CSV file using the csv module. The keys will be read off the headers in the first line. Supports an optional delimiter keyword argument. |
txt | TXT | Stream in a plain text from a file containing one example per line. Will yield tasks containing only a text property. |
images | Images | Stream in images from a directory. All images will be encoded as base64 data URIs and included as the image key to be rendered with the image or image_manual interface. |
image-server | ImageServer | New: 1.9.4 Stream in images from a directory. Image files will be served via the Prodigy server and their data won’t be included with the task. |
audio | Audio | New: 1.10 Stream in audio files from a directory. All files will be encoded as base64 data URIs and included as the audio key to be rendered with the audio or audio_manual interface. |
audio-server | AudioServer | New: 1.10 Stream in audio files from a directory. Audio files files will be served via the Prodigy server and their data won’t be included with the task. |
video | Video | New: 1.10 Stream in video files from a directory. All files will be encoded as base64 data URIs and included as the video key to be rendered with the audio or audio_manual interface. |
video-server | VideoServer | New: 1.10 Stream in video files from a directory. Video files files will be served via the Prodigy server and their data won’t be included with the task. |
pages | Pages | New: 1.17 Load collections of files to annotate with multi-page pages . Supports all other loader types for the content, which can be set as pages:json or pages:images on the command line. |
Example
from prodigy.components.loaders import JSONL, JSON, CSV, TXT, Images, ImageServer, Pagesjsonl_stream = JSONL("path/to/file.jsonl")json_stream = JSON("path/to/file.json")csv_stream = CSV("path/to/file.csv", delimiter=",")txt_stream = TXT("path/to/file.txt")img_stream = Images("path/to/images")img_stream2 = ImageServer("path/to/images")pages_stream = Pages("path/to/files")
Media loader APIs
The Images
, ImageServer
, Audio
, AudioServer
, Video
and VideoServer
loaders all follow the same API and accept a path to a directory of image and
optional file extensions.
Argument | Type | Description |
---|---|---|
f | str | Path to directory of images. |
file_ext | tuple | New: 1.10 File extensions to load. All other files will be ignored. Also the default file extensions |
YIELDS | dict | The annotation tasks with the loaded data. |
As of v1.10, Prodigy also exposes more generic Base64
and Server
loaders
that can be used to implement loading for other file types.
Argument | Type | Description |
---|---|---|
f | str | Path to directory of files. |
input_key | str | The key of the task dict to assign the string or URL to, e.g. "image" or "audio" . |
file_ext | tuple | File extensions to consider. If None (default), all files in the directory will be loaded. |
YIELDS | dict | The annotation tasks with the loaded data. |
Default media file extensions
Images | (".jpg", ".jpeg", ".png", ".gif", ".svg") |
Audio | (".mp3", ".m4a", ".wav") |
Video | (".mpeg", ".mpg", ".mp4") |
Fetching images from local paths and URLs
You can also use the
fetch_media
preprocessor to replace all
local paths and URLs in your stream with base64 data URIs. The skip
keyword
argument lets you specify whether to skip invalid files that can’t be converted
(for example, because the path doesn’t exist, or the URL can’t be fetched). If
set to False
, Prodigy will raise a ValueError
if it encounters invalid
files.
from prodigy.components.preprocess import fetch_media
stream = [{"image": "/path/to/image.jpg"}, {"image": "https://example.com/image.jpg"}]stream = fetch_media(stream, ["image"], skip=True)
Pages loader New: 1.17
The Pages
loader lets you load collections of files to annotate with the multi-page pages
UI. If it receives a directory of text files, a task is created for each file with one page per entry in the text file. If it receives subdirectories of images, it will create one task per directory with one page per image. For more custom pages, e.g. combinations of different interfaces, you can also skip the loader and create the respective JSON format yourself in the recipe.
Argument | Type | Description |
---|---|---|
f | str | Path to file or directory. |
view_id | str | The interface ID to use for the individual pages. |
loader | str | Name of loader to use for the content. If not set, Prodigy will try to infer it from the file extension. |
YIELDS | dict | The paginated annotation tasks. |
from prodigy.component.loaders import Pages
text_stream = Pages("/path/to/file.jsonl", view_id="text")image_stream = Pages("/path/to/directory", view_id="image_manual", loader="images")
To specify the content type on the command line, you can add it after a :
, for example --loader pages:jsonl
or pages:images
. If no content type is provided, Prodigy will try to infer it from the file extension.
Loading paginated text files
If one or more JSON or JSONL files are provided, the loader will look for a "_page"
key specifying the page index. Whenever a record starts at page 0
again, a new paginated example will be created. For plain text files, pages are separated with a single line break and paginated examples with two line breaks.
📄 data.jsonl
{"text": "Doc one, page one", "_page": 0}{"text": "Doc one, page two", "_page": 1}{"text": "Doc two, page one", "_page": 0}{"text": "Doc two, page two", "_page": 1}
📄 data.txt
Doc one, page oneDoc one, page twoDoc two, page oneDoc two, page two
JSON tasks
{"pages": [{"text": "Doc one, page one", "view_id": "text"}{"text": "Doc one, page two", "view_id": "text"}]},{"pages": [{"text": "Doc two, page one", "view_id": "text"}{"text": "Doc two, page two", "view_id": "text"}]}
Loading paginated image, audio and video files
If a directory of subdirectories is provided, e.g. for images, an annotation task will be created for each subdirectory. The subdirectory name will be preserved in the example’s "meta"
. Each file in the subdirectory will become a page. On the command line, you can set --loader
to pages:images
or pages:image-server
to specify how the media content should be loaded and served.
Directory structure
📂 images┣━━ 📂 Document one┃ ┣━━ 📄 image1.jpg┃ ┣━━ 📄 image2.jpg┃ ┗━━ 📄 image3.jpg┗━━ 📂 Document two┣━━ 📄 image4.jpg┗━━ 📄 image5.jpg
JSON task
{"pages": [{"image": "image1.jpg", "view_id": "image_manual"},{"image": "image2.jpg", "view_id": "image_manual"},{"image": "image3.jpg", "view_id": "image_manual"}],"meta": {"title": "Document one"}}
If you’re working with multi-page PDFs, the recipe included with the Prodigy-PDF plugin now also supports paginated loading out of the box, so you can annotate one document per task view from a directory of PDFs.
Corpus loaders
Additionally, Prodigy also supports converting data from popular data sets and corpora.
ID | Component | Description |
---|---|---|
reddit | Reddit | Stream in examples from a file of the Reddit corpus. Will extract, clean and validate the comments. |
Example
from prodigy.components.loaders import Redditreddit_stream = Reddit("path/to/reddit.bz2")
Loading from existing datasets New: 1.10
The dataset:
syntax lets you specify an existing dataset as the input source.
Prodigy will then load the annotations from the dataset and stream them in
again. Annotation interfaces respect pre-defined
annotations and will pre-select them in the UI. This is useful if you want to
re-annotate a dataset to correct it, or if you want to add new information with
a different interface. The following command will stream in annotations from the
dataset ner_data
and save the resulting reannotated data in a new dataset
ner_data_new
:
Optionally, you can also add another :
plus the value of the answer to load if
you only want to load examples with specific answers like "accept"
or
"ignore"
. For example, you may want to re-annotate difficult questions you
previously skipped by hitting ignore. Similarly, if you’re using
rel.manual
to assign relations to pre-annotated spans, you typically only
want to load in accepted answers.
Loading from standard input
If the source
argument on the command line is set to -
, Prodigy will read
from sys.stdin
. This lets you pipe data forward. If you’re loading data in a
different format, make sure to set the --loader
argument on the command line
so Prodigy knows how to interpret the incoming data.
Loading text files from a directory or custom format
A custom loader should be a function that loads your data and yields dictionaries in Prodigy’s JSON format. If you’re writing a custom recipe, you can implement your loading in your recipe function:
recipe.py
@prodigy.recipe("custom-recipe-with-loader")def custom_recipe_with_loader(dataset, source):stream = load_your_source_here(source) # implement your custom loadingreturn {"dataset": dataset, "stream": stream, "view_id": "text"}
Using custom loaders with built-in recipes
If you want to use a built-in recipe like ner.manual
but load in data
from a custom source, there’s usually no need to copy-paste the recipe script
only to replace the loader. Instead, you can write a loader script that outputs
the data, and then pipe that output forward. If the source
argument on the
command line is set to -
, Prodigy will read from sys.stdin
:
All your custom loader script needs to do is load the data somehow, create
annotation tasks in Prodigy’s format (e.g. dictionary with a "text"
key) and
print the dumped JSON.
load_data.py
from pathlib import Path import jsondata_path = Path("/path/to/directory")for file_path in data_path.iterdir(): # iterate over directorylines = Path(file_path).open("r", encoding="utf8") # open filefor line in lines:task = {"text": line} # create one task for each line of textprint(json.dumps(task)) # dump and print the JSON
This approach works for any file format and data type – for example, you could also load in data from a different database or via an API. For extra convenience, you can also wrap your loader in a custom recipe and have Prodigy take care of adding the command-line interface. If a custom recipe doesn’t return a dictionary of components, Prodigy won’t start the server and just execute the code.
load_data.py
@prodigy.recipe("load-data") # add argument annotations and shortcuts if neededdef load_data(dir_path):# the loader code here
You can then use your custom loader like this:
Hashing and deduplication
When a new example comes in, Prodigy assigns it two hashes: the input hash
and the task hash. Both hashes are integers, so they can be stored as JSON
with each task. Based on those hashes, Prodigy is able to determine whether two
examples are entirely different, different questions about the same input, e.g.
text, or the same question about the same input. For more details on how the
hashes are generated and how to set custom hashes, see the
set_hashes
docs.
Hash | Type | Description |
---|---|---|
_input_hash | int | Hash representing the input that annotations are collected on, e.g. the "text" , "image" or "html" . Examples with the same text will receive the same input hash. |
_task_hash | int | Hash representing the “question” about the input, i.e. the "label" , "spans" or "options" . Examples with the same text but different label suggestions or options will receive the same input hash, but different task hashes. |
As of v1.9, recipes can return an optional
"exclude_by"
setting in their "config"
to specify whether to exclude by
"input"
or "task"
(default). Filtering and excluding by input hash is
especially useful for manual and semi-manual workflows like ner.manual
and ner.correct
. If you’ve already annotated an example and it comes in
again with suggestions from a model or pattern, Prodigy will correctly determine
that it’s a different “question”. However, unlike in the binary workflows, you
typically don’t want to see the example again, because you already created a
gold-standard annotation for it.