local

Prodigy Modal

To streamline the process of running Prodigy in a cloud environment, we’ve developed an integration with serverless cloud platform Modal. Modal is a natural fit for Prodigy, as it can be fully configured directly from Python, allowing for seamless interaction with the Prodigy backend, as well as custom recipes, data loaders, and databases. The Prodigy Modal plugin offers the following recipes:

modal.setupGenerates a Python file containing the Modal script with the cloud function definition. This script can be used directly with Modal’s modal run, modal serve and modal deploy commands.
modal.runA wrapper around modal run. It generates a Modal script and executes a Prodigy command as an ephemeral app on the Modal cloud. The ephemeral app stops when the Prodigy command exits. This recipe is intended for terminal-only Prodigy commands, like train or the model.annotate family of recipes.
modal.serveA wrapper around modal serve. It generates a Modal script and serves Prodigy as an ephemeral app in Modal’s cloud. The cloud function stops if Prodigy raises an error. This recipe is useful for testing the Prodigy endpoint on Modal before fully deploying it.
modal.deployA wrapper around modal deploy. It generates a Modal script and deploys the Prodigy app as a web endpoint in the Modal cloud.

Installation

This plugin is designed specifically for use with the Modal cloud service provider. Ensure you have a Modal account and the Modal client installed before proceeding. You can find detailed setup instructions here. To use the Modal plugin, make sure you have Prodigy installed with the appropriate extras:

python -mpip installprodigy[company]--extra-index-urlhttps://XXXX-XXXX-XXXX-XXXX@download.prodi.gy

Quickstart

I want to deploy Prodigy remotely in a single command.

If you are happy with the standard setup of the Modal infrastructure, you can use Prodigy modal.deploy to directly deploy Prodigy in the Modal cloud. You just need to make sure that all the files required by your recipe live in a single --assets folder. Please see the modal.deploy command for details.

I want to customize my Modal deployment setup.

If you need to modify some parts of Prodigy Modal script e.g change the default container image or specify the GPU count, you can use modal.setup to generate the starter script and edit it according to your needs. When ready, you can use Modal CLI modal serve path/to/modal/script to serve Prodigy as an ephemeral app in the dev mode or modal deploy path/to/modal/script to deploy it as a web endpoint.

How do I get started with Modal?
  1. Head over to Modal website and create a user account. You can skip the usual first step of installing the Modal’s client as it comes pre-installed with the Prodigy company plugins.
  2. Set up a Modal API token for authentication:

    Modal setup

    python -mmodalsetup
  3. Run a Prodigy Modal recipe

    Run a recipe

    dotenv run --python -mprodigymodal.serve"textcat.manual dataset ./data/input.jsonl --labels TECH,SOCIAL"--assets ./data

Check the terminal for the URL of the Modal dashboard to find the URL where Prodigy UI can be accessed. Please see some additional information on database persistence and security considerations below.

Using the Prodigy Modal Plugin

Using Prodigy with Modal generally consists of two steps:

  1. Generate a Python script that defines the cloud function to be executed in Modal.
  2. Trigger the cloud function. Depending on the recipe used, the function can be:
    • launched as a web endpoint in development mode ( modal.serve)
    • deployed as a full-fledged app ( modal.deploy)
    • executed as a command without starting a web endpoint ( modal.run)

These steps can be decoupled, in case further customization of the Modal script is required. In such scenarios, the modal.setup recipe should be used. For Modal native experience, you can also use the script generated by modal.setup recipe directly with the Modal CLI. It is recommended to first test the deployment in development mode before sharing the web endpoint link with annotators.

Creating a Modal setup file

To create a Modal script file without executing anything, use the modal.setup recipe:

Creating a setup file

dotenv run --python -mprodigymodal.setup"textcat.manual dataset ./data/input.jsonl --labels TECH,SOCIAL"--assets ./data

This command generates a modal_setup.py in the present working directory (or the directory specified with the --output argument). This file contains the specification of the Modal cloud function as well as instruction on how the Prodigy image should be built and how to store any outputs such as model weights in case Prodigy train is used.

Storage

By default, a Modal volume named prodigy-vol is created at the remote path /vol/prodigy_data/. Model weights are stored in the models directory within this volume. Users can create additional folders relative to the prodigy-vol path (/vol/prodigy_data). This path can be used as output folder in any Prodigy command. See the terminal-only commands section for more details on custom storage. It’s also possible to interact with Modal volumes via web-based Modal dashboard or, very conveniently, via Modal CLI.

Importantly, all the resources such as input JSONL files, local base models, spaCy config files as well as extra Python code (used with -F) should be placed in the directory specified under --assets argument. This is how you can tell Prodigy which resources should be copied over to the Modal Prodigy image.

Similarly, when using spacy-llm workflows with cache, it’s worth specifying the path to cache directory on the volume e.g.

spacy-llm config excerpt

[...]
[components.llm.cache]
@llm_misc = "spacy.BatchCache.v1"
path = "/vol/prodigy_data/llm-cache"
batch_size = 3
max_batches_in_mem = 10

Deploying Prodigy as a web endpoint on Modal

To create the setup file and immediately deploy Prodigy on Modal in “production” mode, use the modal.deploy recipe:

Deploy Prodigy on Modal

dotenv run --python -mprodigymodal.deploy"ner.manual.custom ner_dataset_v2 en_core_web_lg ./data/news_headlines.jsonl --label PER,ORG -F ./data/my_custom_recipe.py"--assets ./data

This command creates the Modal setup file and triggers the cloud function deploying your Prodigy annotation project on Modal. With this recipe, you can optionally persist the Modal script. Please note that the script will serialize all environment variable values, so it should be considered a sensitive file.

After a successful deployment, the modal.deploy recipe will output a URL where your Prodigy instance is accessible. You can share this URL with your annotators to begin the annotation process. The Prodigy URL is available either in the terminal stdout or on the Modal dashboard.

It is recommended to test the Prodigy endpoint before fully deploying it. For this, you can use prodigy.serve. This recipe will serve Prodigy as an ephemeral app with the -dev prefix in the URL. Prodigy app logs will be forwarded to your local terminal, and you can stop the app with ctrl+C. The ephemeral app will automatically stop if Prodigy throws an error, overriding the default Modal serve behavior, which would restart the function.

For a fully deployed app, the process is automatically detached from your local terminal, and you can inspect the logs, as well as start and stop the app, from your Modal dashboard.

Persisting the database

Modal can serve your Prodigy instance, store your data and model weights while the app is running, but it won’t persist the default Prodigy SQLite database. For this reason it is recommended to set up a remote Postgres database that works independent of Modal. We recommend Neon for this purpose. It’s quick to set up and you can instruct Prodigy to use it via prodigy.json:

prodigy.json (excerpt)

{
"db": "postgresql",
"db_settings": {
"postgresql": {
"dbname": "prodigy",
"user": "prodigy_owner",
"password": "XXXXXX",
"host": "your-neon-host.aws.neon.tech"
}
}
}

Executing terminal-only recipes in Modal

For Prodigy commands that do not require an annotation server (e.g., train), modal.run should be used to avoid launching the web endpoint when it’s not needed.

Run training command

dotenv run --python -mprodigymodal.run"train experiment_123 --ner ner-dataset --config ./local_assets/train_123.cfg --base model en_core_web_lg --gpu-id 0"--assets ./local_assets--require-gpu--detach

Similar to modal.serve and modal.deploy, this recipe will generate the script and use the Modal run command to execute it.

As with modal.serve, the logs are tailed to the local terminal. For long-running functions, such as training, it is recommended to use the --detach option to prevent the function from stopping if the local terminal is shut down. Detached functions can be controlled from the Modal dashboard. Unlike undetached functions, which stop the Modal container when the function exits, detached functions must be explicitly stopped from the dashboard.

By default, trained models are stored in prodigy-vol Modal volume, under the models folder. However, you can define a custom folder relative to the default path. For example, models trained with the command above will be stored in prodigy-vol/models/experiment_123. This path is relevant when browsing volumes in the Modal dashboard under the Storage tab.

If the model stored in Modal volumes is supposed to be used as the base-model for another training experiment, it is possible to specify the programatic path this volume like so: /vol/prodigy_data/models/experiment_123, where /vol/prodigy_data/ corresponds to the prodigy-vol volume:

Run training command with base model

dotenv run --python -mprodigymodal.run"train experiment_124 --ner ner-dataset --config ./local_assets/train_124.cfg --base model /vol/prodigy_data/models/experiment_123/model-best --gpu-id 0"--assets ./local_assets--require-gpu--detach

As mentioned above, Modal volumes come with a full CLI interface, wehere the files can be listed, copied, moved and removed across different directories and volumes.

Example workflows

Common setup: DB configuration

As mentioned in the DB persistence section, it is necessary to configure Prodigy to use a persistent database outside the Modal environment. The recommended setup is a cloud PostgreSQL database, such as Neon. You can find information on setting up Prodigy with a PostgreSQL database here. Make sure you provide the DB config in the global or local prodigy.json file.

Common setup: Environment variables definition

Prodigy Modal recipes require the Prodigy license to be provided as the PRODIGY_LICENSE_KEY environment variable. Similarly, all other environment variables that your recipe requires, such as API keys or SSO variables, should be exported and available in your local environment. The easiest way to manage this is by defining all environment variables in an .env file and using dotenv to load them before invoking modal.setup, modal.run, modal.serve, or modal.deploy.

Example 1: Deploy a built-in Prodigy recipe for multiple annotators with basic authentication

  1. Define the target postgres DB and the required overlap in the global or local prodigy.json configuration file:

    prodigy.json (excerpt)

    {
    "db": "postgresql",
    "db_settings": {
    "postgresql": {
    "dbname": "prodigy",
    "user": "prodigy_owner",
    "password": "XXXXXX",
    "host": "your-db-host"
    }
    },
    "feed_overlap": true
    }
  2. Enable basic HTTPS authentication via environment variables defined in the .env file:

    .env

    PRODIGY_LICENSE_KEY="XXXX-XXXX-XXXX-XXXX"
    PRODIGY_ALLOWED_SESSIONS="alice,bob,steve"
    PRODIGY_BASIC_AUTH_PASS="password"
    PRODIGY_BASIC_AUTH_USER="user"
    PRODIGY_LOGGING="basic"
  3. Create a local assets directory with all local resources requried by the recipe such as e.g. input dataset, patterns files etc.

  4. Call modal.serve to test the Prodigy instance.

    dotenv run --python -mprodigymodal.serve"ner.manual ner_dataset_v1 blank:en ./local_assets/news_headlines.jsonl --label PER,ORG"--assets local_assets
  5. Check the terminal output for the links to the Modal dashboard and Prodigy web endpoint.

Since we have constrained the allowed sessions via the PRODIGY_ALLOWED_SESSIONS env var, the annotator will have to access the Prodigy server with the ?session=XXX attached to the base URL, e.g. https://YOUR_URL.modal.run/?session=alice.

You can inspect the logs, start and stop Prodigy web app either from your terminal or the Modal dashboard. Once the the testing is finished, you can deploy the app with modal.deploy.

Example 2: Train a transformer-based spaCy pipeline on GPU

  1. Define the target postgres DB in the global or local prodigy.json configuration file:

    prodigy.json (excerpt)

    {
    "db": "postgresql",
    "db_settings": {
    "postgresql": {
    "dbname": "prodigy",
    "user": "prodigy_owner",
    "password": "XXXXXX",
    "host": "your-db-host"
    }
    }
    }
  2. Provide Prodigy license key and other environment variables via .env file:

    .env

    PRODIGY_LICENSE_KEY="XXXX-XXXX-XXXX-XXXX"
    PRODIGY_LOGGING="basic"
  3. Create a local assets directory with all local resources requried by the recipe such as e.g. the spaCy training config file. If you need any third party libraries, make sure to specify them in requrements.txt and pass the path to it via --requirements argument. spacy[cuda12x,transformers] is installed by default in the image if the --require-gpu flag is used.

  4. Since the Prodigy web application is not required for training the model, call modal.run with the --detach option to detach it from the local terminal:

    dotenv run --python -mprodigymodal.run"train --ner ner_dataset_v1 --config ./assets/train.cfg --label-stats --verbose --gpu-id 0"--assets assets--require-gpu--modal-gpu a100--detach

Note that we are not specifying the output directory of the train recipe. The model weights will be saved in Modal’s volumes storage functionality. The default volume name is prodigy-vol (you can change it by editing the Modal script). If you specified the output directory for the models, it would be created relative to the default models directory in prodigy-vol volume.

Once the training has finished, you can access the trained models either via the Modal dashboard or the Modal CLI. For example, you can download to models to your machine:

Download trained model

modalvolumegetprodigy-volmodels/model-best./
⠙ Downloading file(s) to local...
Downloading file(s) to local... 0:00:11 ━━━━━━━━━━━━━━━ (12 out of 13 files completed)
model-best/transformer/model ━━━━━━━━━━━━━━━ 0.0% • 0.0/502.0 MB • ? • -:--:--

You can interact with volumes either by Modal dashboard or via the Modal CLI, for example, to list the current content of the prodigy-vol. For more commands to operate on volumes, see the Modal volume CLI docs.

List volume content

modalvolumelsprodigy-vol
┏━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┓
┃ Filename ┃ Type ┃ Created/Modified ┃ Size ┃
┡━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━┩
│ models │ dir │ 2024-08-26 13:51 CEST │ 47 B │
│ other │ dir │ 2024-08-26 13:51 CEST │ 47 B │
└────────────┴──────┴───────────────────────┴──────┘

Example 3: Deploy a custom Prodigy recipe with SSO

  1. Define the target Postgres DB in the global or local prodigy.json configuration file:

    prodigy.json (excerpt)

    {
    "db": "postgresql",
    "db_settings": {
    "postgresql": {
    "dbname": "prodigy",
    "user": "prodigy_owner",
    "password": "XXXXXX",
    "host": "your-db-host"
    }
    }
    }
  2. Provide Prodigy License and SSO environment variables via .env file:

    .env

    PRODIGY_LICENSE_KEY="XXXX-XXXX-XXXX-XXXX"
    PRODIGY_OIDC_AUTH_ENABLED="1"
    PRODIGY_DEPLOYED_URL="https://MODAL_SPACENAME--ner-annotation-project.modal.run"
    PRODIGY_OIDC_DISCOVERY_URL="https://your-oidc-provider/.well-known/openid-configuration"
    PRODIGY_OIDC_CLIENT_ID="xxxxxxxxxxxxxxx"
    PRODIGY_OIDC_CLIENT_SECRET="xxxxxxxxxxx"
    PRODIGY_LOGGING="basic"

    PRODIGY_DEPLOYED_URL will be the URL generated by Modal. It can be customized using the --modal-function-label option. Modal offers custom domains for team and enterprise plans. Custom domain can be configured from Prodigy CLI as well.

  3. Create a local assets directory with all local resources requried by the recipe such as e.g. the custom Prodigy recipe code. Make sure the PRODIGY_DEPLOYED_URL i.e. the URL generated by Modal for your Prodigy deployment is included as “allowed callback URLs” in your OIDC provider settings as explained here

  4. Call modal.deploy. Now, accessing the PRODIGY_DEPLOYED_URL should trigger the usual SSO workflow.

    dotenv run --python -mprodigymodal.deploy"ner.manual ner_dataset_v1 blank:en ./assets/news_headlines.jsonl --label PER,ORG -F ./assets/my-custom-recipe.py"--assets assets

API

  • Interface: terminal only
  • Use case: generate Modal script to execute Prodigy in cloud environment

Generates a Modal script for executing Prodigy command in Modal’s cloud service. It uses arguments provided via CLI as well as environment variables to define the Prodigy container and Prodigy app in Modal.

prodigymodal.setupcommand--assets--requirements--output--modal-gpu--modal-function-label--modal-custom-domains--server
ArgumentTypeDescriptionDefault
commandstrProdigy command to run in Modal environment. Don’t forget to quote it as a string.
--assets, -AstrPath to the directory that contains all data, models and code files required to run the recipe.None
--requirements, -rstrPath to the requirements file to be installed in Modal environment. Should be relative to --assets.None
--output, -OstrOutput directory to save Modal script. If not provided, the files will be written to the current working directory.None
--require_gpu, -GboolRequire GPU to run the recipe. Necessary for workflows with transformers in the loop.False
--modal-gpu, -MGstrModal GPU configuration shortcode. Ignored if require_gpu is False.any
--modal-function-label, -MFLstrLabel for the Modal function.""
--modal-custom-domains, -MCDstrComma-separated custom domains for the Modal deployment.[]
--server, -SboolDeploy Prodigy as web endpoint.False
--timeout, -TintTimeout for the modal non-web function in seconds. Only relevant when --server is not set.18000

Example

prodigymodal.setup"ner.manual ner_dataset_v1 blank:en ./data/news_headlines.jsonl --label PER,ORG"--assets data--server
  • Interface: terminal only
  • Use case: serve Prodigy app in Modal cloud environment

Serve Prodigy as an ephemeral app in Modal’s serverless cloud. Recommended for testing the Prodigy endpoint on Modal before fully deploying it.

prodigymodal.servecommand--assets--requirements--persist--require-gpu--modal-gpu--modal-function-label--modal-custom-domains
ArgumentTypeDescriptionDefault
commandstrProdigy command to execute in Modal environment. Don’t forget to quote it as a string.
--assets,AstrPath to the directory that contains all data, models and code files required to run the recipe.None
--requirements, -rstrPath to the requirements file to be installed in Prodigy container in Modal.None
--persist, -PboolPersist Modal python script (modal_script.py) on disc. If set Modal script will be stored in urrent working directory. Otherwise, it will be written to a temporary directory an deleted.False
--require-gpu, -GboolRequire GPU to run the recipe. Necessary for workflows with transformers in the loop.False
--modal-gpu, -MGstrModal GPU configuration shortcode. Ignored if require_gpu is False.any
--modal-function-label, -MFLstrLabel for the Modal function.""
--modal-custom-domains, -MCDstrComma-separated custom domains for the Modal deployment.[]

Example

prodigymodal.serve"ner.manual ner_dataset_v1 blank:en ./data/news_headlines.jsonl --label PER,ORG"--assets data
  • Interface: terminal only
  • Use case: deploy Prodigy app in Modal cloud environment.

Deploy Prodigy as a web endpoint in Modal’s serverless cloud.

prodigymodal.deploycommand--assets--requirements--persist--require-gpu--modal-gpu--modal-function-label--modal-custom-domains
ArgumentTypeDescriptionDefault
commandstrProdigy command to execute in Modal environment. Don’t forget to quote it as a string.
--assets,AstrPath to the directory that contains all data, models and code files required to run the recipe.None
--requirements, -rstrPath to the requirements file to be installed in Prodigy container in Modal.None
--persist, -PboolPersist Modal python script (modal_script.py) on disc. If set Modal script will be stored in the current working directory. Otherwise, it will be written to a temporary directory an deleted.False
--require-gpu, -GboolRequire GPU to run the recipe. Necessary for workflows with transformers in the loop.False
--modal-gpu, -MGstrModal GPU configuration shortcode. Ignored if require_gpu is False.any
--modal-function-label, -MFLstrLabel for the Modal function.""
--modal-custom-domains, -MCDstrComma-separated custom domains for the Modal deployment.[]

Example

prodigymodal.deploy"ner.manual ner_dataset_v1 blank:en ./data/news_headlines.jsonl --label PER,ORG"--assets data
  • Interface: terminal only
  • Use case: run Prodigy terminal command in Modal cloud environment.

Run a Prodigy terminal command in Modal cloud without spinning up the annotation server.

prodigymodal.runcommand--assets--requirements--persist--require-gpu--modal-gpu--modal-function-label--detach
ArgumentTypeDescriptionDefault
commandstrProdigy command to execute in Modal environment. Don’t forget to quote it as a string.
--assets,AstrPath to the directory that contains all data, models and code files required to run the recipe.None
--requirements, -rstrPath to the requirements file to be installed in Prodigy container in Modal.None
--persist, -PboolPersist Modal python script (modal_script.py) on disc. If set Modal script will be stored in the current working directory. Otherwise, it will be written to a temporary directory an deleted.False
--require-gpu, -GboolRequire GPU to run the recipe. Necessary for workflows with transformers in the loop.False
--modal-gpu, -MGstrModal GPU configuration shortcode. Ignored if require_gpu is `False.any
--modal-function-label, -MFLstrLabel for the Modal function.""
--timeout, -TintTimeout for the Modal function in seconds.18000
--detach, -DboolDetach Modal process from local terminal.False

Example

prodigymodal.run"train my_model --ner ner_dataset --base-model en_core_web_lg --config data/train.cfg"--assets data--detach