Database

By default, Prodigy uses SQLite to store annotations in a simple database file in your Prodigy home directory. If you want to use the default database with its default settings, no further configuration is required and you can start using Prodigy straight away. Alternatively, you can choose to use Prodigy with MySQL or PostgreSQL, or write your own custom recipe to plug in any other storage solution.

Prodigy uses the peewee package to manage the database integration. This gives you a lot of flexibility in terms of setup and debugging, and allows you to use more advanced features via the Playhouse extension.

Prodigy uses the database to store all annotations by project and annotation session. Even if you’re only using Prodigy with a model in the loop, you’ll usually want a record of the collected annotations as a backup, or to use them as evaluation data. Prodigy is also very powerful without a model in the loop – for example to bootstrap word lists or collect feedback on the output of two models and generate evaluation data for machine translation or image classification.

No! The database only exists to store collected annotations. You can read in the raw data straight from a file or a custom source. As your data comes in, Prodigy will assign hashes to the examples, so it’ll always be able to tell whether an example has been annotated already or not.

If you have existing annotations, you can convert them to Prodigy’s format for the given task and use the db-in command to import them to the database.


Setup

SQLite

The default database option that stores all annotations in a flat SQLite database file. Unless otherwise specified, the database is created in the Prodigy home directory.

prodigy.json{
  "db": "sqlite",
  "db_settings": {
    "sqlite": {
      "name": "prodigy.db",
      "path": "/custom/path"
    }
  }
}
SettingDescriptionDefault
nameDatabase file name. To only store the database in memory, use :memory:.prodigy.db
pathPath to directory containing database file.PRODIGY_HOME
sqlite3 connection parameters.-

MySQL

The settings for a MySQL database can take any MySQLdb or PyMySQL connection parameters, depending on your database driver.

prodigy.json{
  "db": "mysql",
  "db_settings": {
    "mysql": {
      "host": "localhost",
      "user": "username",
      "passwd": "xxx",
      "db": "prodigy"
    }
  }
}

PostgreSQL

The settings for a PostgreSQL database can take any psycopg2 connection parameters.

prodigy.json{
  "db": "postgresql",
  "db_settings": {
    "postgresql": {
      "dbname": "prodigy",
      "user": "username",
      "password": "xxx"
    }
  }
}

Permissions and user privileges

When setting up your database integration, you might not want to give your user permission to perform all operations. Prodigy uses the following operations, some of which are optional and only required for certain commands or initial setup:

OperationRequiredDetails
SELECTRetrieving datasets and annotations.
INSERTAdding datasets and annotations.
UPDATEUpdating datasets and annotations.
DELETEDeleting datasets. Only used for the prodigy drop command, so permission can be omitted if you don’t need this feature or prefer to delete records manually.
CREATECreation of tables Dataset, Example and Link. Not required if you create them manually.

To test your database connection, you can also write a simple Python script that connects to Prodigy’s database and performs the most important operations. For more details, check out this thread on the forum.

test_database.pyfrom prodigy.components.db import connect

examples = [{"text": "hello world", "_task_hash": 123, "_input_hash": 456}]

db = connect()                               # uses settings from prodigy.json
db.add_dataset("test_dataset")               # add dataset
assert "test_dataset" in db                  # check that dataset was added
db.add_examples(examples, ["test_dataset"])  # add examples to dataset
dataset = db.get_dataset("test_dataset")     # retrieve a dataset
assert len(dataset) == 1                     # check that examples were added

Tables

Here are the tables Prodigy creates and how they map to the annotations you collect. You typically shouldn’t have to interact with the database and its tables directly.

TableDescription
DatasetThe dataset / session IDs and meta information.
ExampleThe individual annotation examples. Each example is only added once, so if you add the same annotation to multiple datasets, it’ll only have one record here.
LinkExample IDs linked to datasets. This is how Prodigy knows which examples belong to which datasets and sessions.

Importing existing data

To use existing annotations collected with other tools in Prodigy, you can import them via the db-in command. You can import data of all file formats supported by Prodigy. However, JSON or JSONL is usually recommended, as it gives you more flexibility. By default, all examples will be set to "answer": "accept". You can specify a different answer using the --answer argument on the command line.


prodigy
db-in
new_dataset
/path/to/data.jsonl
Imported 1550 annotations to 'new_dataset'.

Database class

Prodigy’s database model is included with the library and is available via prodigy.components.db.

connect function

Prodigy provides a simple connection helper that takes care of connecting to one of the built-in database options using the database ID and database settings. If no database config is provided, it will be read off the prodigy.json settings, and default to 'sqlite' with the standard settings.

ArgumentTypeDescription
db_idstrID of database, i.e. 'sqlite', 'postgresql' or 'mysql'. Defaults to 'sqlite'.
db_settingsdictDatabase-specific settings. If not provided, settings will be read off the prodigy.json.
RETURNSDatabaseThe database.
from prodigy.components.db import connect
db = connect("sqlite", {"name": "my_db.db"})

Database.__init__ method

ArgumentTypeDescription
dbpeewee.Databasepeewee database. Will be available as the db attribute of the database.
display_idstrID of database, e.g. 'sqlite'. Will be available as the db_id attribute of the database. For custom databases plugged in by the user, the ID will default to 'custom'.
display_namestrHuman-readable name of the database, e.g. 'SQLite'. Will be available as the db_name attribute. For custom database module, the display name will defaults to the function name, class name, or repr(db).
RETURNSDatabaseThe database.

To plug in custom database, you can initialize the Database class with a custom instance of peewee.Database or its extension package Playhouse, for example:

import prodigy
from prodigy.components.db import Database
from playhouse.postgres_ext import PostgresqlExtDatabase

psql_db = PostgresqlExtDatabase("my_database", user="postgres")
db = Database(psql_db, "postgresql", "Custom PostgreSQL Database")

@prodigy.recipe("recipe-with-custom-db")
def recipe_with_custom_db():
    return {"db": db}

Database.__len__ method

ArgumentTypeDescription
RETURNSintThe number of datasets in the database.

Database.__contains__ method

ArgumentTypeDescription
namestrDataset ID.
RETURNSboolWhether the dataset exists in the database.

Database.datasets property

ArgumentTypeDescription
RETURNSlistA list of all dataset IDs in the database (excluding sessions).

Database.sessions property

ArgumentTypeDescription
RETURNSlistA list of all session IDs in the database (excluding user-created datasets).

Database.reconnect method

Reconnect to the database. Called on API requests to avoid timeout issues, especially with MySQL. If the database connection is still open, it will be closed before reconnecting.

Database.close method

Close the database connection (if not already closed). Called after API requests to avoid timeout issues, especially with MySQL.

Database.get_dataset method

ArgumentTypeDescription
namestrThe dataset ID.
RETURNSlistA list of all examples in the dataset.

Database.get_meta method

ArgumentTypeDescription
namestrThe dataset ID.
RETURNSdictThe meta data associated with the dataset.

Database.get_examples method

ArgumentTypeDescription
idslistA list of task IDs.
bystrSelect by field, defaults to task_hash.
RETURNSlistThe examples with the given IDs.

Database.get_input_hashes method

ArgumentTypeDescription
*nameslistDataset IDs.
RETURNSsetA set of input hashes for examples of the given datasets.

Database.get_task_hashes method

ArgumentTypeDescription
*nameslistDataset IDs.
RETURNSsetA set of task hashes for examples of the given datasets.

Database.add_dataset method

ArgumentTypeDescription
namestrThe dataset ID.
metadictThe dataset meta.
sessionboolWhether dataset is a session.
RETURNSlistThe dataset.

Database.add_examples method

ArgumentTypeDescription
exampleslistA list of annotated examples.
datasetslistA list of datasets to add the examples to.
ArgumentTypeDescription
namestrThe dataset ID.
example_idslistIDs of examples to link to the dataset.
ArgumentTypeDescription
example_idslistIDs of examples to unlink.

Database.drop_dataset method

ArgumentTypeDescription
namelistThe dataset ID.
batch_sizeintDelete examples in batches of the given size.
RETURNSboolWhether the dataset was successfully deleted.

Implementing a custom database

Custom recipes let you return an optional "db" component. If it’s not set, it will default to the database specified in your prodigy.json or to "sqlite". The database plugged in via a custom recipe can also be False (to not use any DB) or a custom class that follows Prodigy’s Database API.

@prodigy.recipe('custom-recipe')
def custom_recipe():
    return {'db': YourCustomDB}  # etc.

Essentially, all your custom class needs to do is expose methods to add and retrieve datasets and annotated examples. For instance:

class YourCustomDB(object):
    def __init__(self, *args, **kwargs):
        # initialize your custom database

    def get_dataset(self, name, default=None):
        # get examples for a given dataset name

    # other methods and properties

How your database handler resolves those queries is entirely up to you. If your database class implements the methods reconnect and close, Prodigy will call those on each request to the REST API, allowing you to explicitly manage the connection and prevent timeouts between requests.