Text Classification

This workflow explains how to train and evaluate a text classification system using Prodigy. You can use this tutorial to solve problems such as sentiment analysis, chatbot intent detection, and flagging abusive or fraudulent behaviours. Prodigy makes text classification particularly powerful, because you can try out new ideas very quickly.

In this example, we'll stream issue titles from the GitHub API, and create a system to predict whether an issue is about the project documentation. This could be used to organize and process a project's backlog of issues, or create a bot to suggest tags for new issues. It took us 40 minutes to create over 830 annotations, including 20% evaluation examples, which was enough to give 87% accuracy. For details, see the evaluation and results.

prodigy dataset gh_issues "Classify issues on GitHub"✨ Created dataset 'gh_issues'.prodigy textcat.teach gh_issues en_core_web_sm "docs" --api github --label DOCUMENTATION✨ Starting the web server on port 8080...

The first step is to initialize a new dataset, adding a quick description for future reference. The next command starts the annotation server. The textcat.teach subcommand tells prodigy to run the built-in recipe function teach(), using the rest of the arguments supplied on the command line.

Once the server is running, you can open the front-end by visiting http://localhost:8080 in your browser, triggering a request for examples to annotate, which in this example will be filled by streaming data from the GitHub API. As you click through the examples, your decisions will be sent back to the server, to be saved in the database as annotations.

Annotating with Prodigy

Prodigy puts the model in the loop, so that it can actively participate in the training process. The model uses what it already knows to figure out what to ask you next. The model is updated by the answers you provide, so that the system learns as you go. Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible.

Screenshot of the Prodigy web application

Opening http://localhost:8080, we get a sequence of recent GitHub issue titles, displayed with our category as the title. If the category is correct, click accept, press a, or swipe left on a touch interface.

Improve documentation for service and token urls

If the category does not apply, click reject, press x, or swipe right.

Docker for Windows fails to start

Some examples are unclear or exceptions that you don't want the model to learn from. In these cases, you can click ignore or press space. In this project, we're ignoring all non-English text, as well as ambiguous titles.


After around 40 minutes of annotating the stream of issue titles for the search queries "docs", "documentation", "readme" and "instructions", we end up with a total of 830 annotations that break down as follows:


Training from the annotations

When you use Prodigy with a model in the loop and exit the application, the collected annotations are stored in the database. Once all of the annotations are corrected, better accuracy can usually be achieved by retraining from scratch. Although there are good techniques for streaming stochastic gradient descent, nothing works quite as well or nearly as simply as the standard batchwise approach.

After collecting a batch of annotations, you can train a model on them using the textcat.batch-train recipe. The training procedure makes several passes over the annotations, shuffling them each time.

prodigy textcat.batch-train gh_issues --output /tmp/model --eval-split 0.2Loaded blank modelUsing 20% of examples (156) for evaluationUsing 100% of remaining examples (674) for trainingCorrect 142Incorrect 14Baseline 0.65Precision 0.87Recall 0.87F-score 0.87Model: /tmp/modelTraining data: /tmp/model/training.jsonlEvaluation data: /tmp/model/evaluation.jsonl

After making any change to a statistical model, it's important to perform a numeric evaluation, on examples that haven't been used during training. If you don't provide an evaluation data set, Prodigy will split off a percentage of the examples as evaluation data. After batch training, the best model will be saved to the output directory, together with two JSONL files containing the training and evaluation data.

Interactive evaluation

You can also use the textcat.eval to build up an evaluation set from scratch, using a text source of your choice. The recipe reads in examples using the source you describe, and uses your pre-trained model to classify the examples. This also lets you keep the evaluation set consistent across your experiment. The --exclude option lets you exclude examples for other datasets from the evaluation set – for example, the training data.

prodigy dataset gh_issues_eval "Evaluate GitHub text classifier"✨ Created dataset 'gh_issues_eval'.prodigy textcat.eval gh_issues_eval /tmp/model "docs" --api github --label DOCUMENTATION --exclude gh_issues✨ Starting the web server on port 8080...

Exporting and using the model

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production.

Usage in spaCy v2.0

import spacy nlp = spacy.load('/tmp/model') doc = nlp('missing documentation') print(doc.cats) # {'DOCUMENTATION': 0.9812000393867493} doc = nlp('docker container not loading') print(doc.cats) # {'DOCUMENTATION': 0.005252907983958721} doc = nlp('installation not working on windows') print(doc.cats) # {'DOCUMENTATION': 0.0033084796741604805}


Within the first hour of annotation, the system classified 140 out of the 156 evaluation issues correctly. To put this into some context, we have to look at the class balance of the data. In the evaluation data, 65% of the examples were labelled reject, i.e. they were tagged as not documentation issues. This gives a baseline accuracy of 65%, which the classifier easily exceeded.

We can get some sense of how the system will improve as more data is annotated by retraining the system with fewer examples. The chart below shows the accuracy achieved with 10%, 25%, 50% and 75% of the training data. The last 25% of the training data brought 3% improvement in accuracy, indicating that further training will improve the system. Similar logic is used to estimate the progress indicator during training.

Using the textcat.train-curve recipe, you can output a training curve and get an idea for how the model is performing with different numbers of examples. The recipe outputs the best accuracy score for each training run, as well as the improvement in accuracy.

prodigy textcat.train-curve gh_issues --n-samples 4 --eval-split 0.2 --label DOCUMENTATION % ACCURACY 25% 0.73 +0.73 50% 0.82 +0.09 75% 0.84 +0.02 100% 0.87 +0.03

The recipe takes the same arguments as textcat.batch-train. You can also customise the number of samples using the --n-samples argument, for example, 10 for snapshots at 10%, 20%, 30% and so on.

Recipe details

A Prodigy recipe is a Python function that can be run via the command line. The built-in recipes are all built from components that you can also import and use yourself. Any function wrapped with the @recipe decorator can be used as a Prodigy subcommand. To start the annotation server, the function should return a dictionary of components that specify the stream of examples, annotation interface to use and other parameters. Here's a simplified version of the built-in textcat.teach recipe:

textcat.teach recipe

from prodigy import recipe, get_stream from prodigy.models.textcat import TextClassifier from prodigy.components.sorters import prefer_uncertain import spacy @recipe('textcat.teach', dataset=("Dataset ID", "positional"), spacy_model=("Loadable spaCy model (for tokenization)"), source=("Source data (file path or API query)"), api=("Optional API loader to use", "option", "a", str), loader=("Optional file loader to use", "option", "lo", str), label=("Label to annotate", "option", "l", str) ) def teach(dataset, spacy_model, source, api=None, loader=None, label=''): """Annotate texts to train a new text classification label""" nlp = spacy.load(spacy_model, disable=['tagger', 'parser', 'ner']) stream = get_stream(source, api, loader) model = TextClassifier(nlp, label=label) return { 'dataset': dataset, 'view_id': 'classification', 'stream': prefer_uncertain(model(stream)), 'update': model.update, 'config': {'lang': nlp.lang, 'label': model.label} }

The recipe returns the following components:

datasetID of the dataset to add the collected annotations to.
view_id The annotation interface to use in the web app. If not set, Prodigy will guess the best matching one from the first example.
streamIterable stream of annotation examples.
update Function that is invoked every time Prodigy receives an annotated batch of examples. Can be used to update a model.
config Other parameters to display in the web app, or Prodigy config defaults.