Text Classification

Text classification models learn to assign one or more labels to text. You can use text classification over short pieces of text like sentences or headlines, or longer texts like paragraphs or even whole documents. One of our top tips for practical NLP is to break down complicated NLP tasks into text classification problems whenever possible. Text classification problems tend to be easier to annotate consistently, and the models need fewer examples to reach high accuracy.

Whether you’re doing intent detection, information extraction, semantic role labeling or sentiment analysis, Prodigy provides easy, flexible and powerful annotation options. Active learning keeps you efficient even if your classes are heavily imbalanced.

Quickstart

For balanced classes, the easiest way to get started is to use textcat.manual with a text source and one or more labels. See the docs on manual annotation for examples. Setting the --exclusive flag makes the categories mutually exclusive, so you’ll only be able to select one label option.

Once you’ve collected a dataset of maybe a few hundred annotations, you can run training experiments to see if you’re on the right track. The train recipe takes one or more Prodigy datasets, trains a model and outputs statistics and results. You can also use data-to-spacy to export data in spaCy’s JSON format, or db-out to export your annotations to use in any other process or application.

If your classes are imbalanced and you annotated an unbiased sample, your sample would include very few examples that your label applies to, making it difficult to train a reliable model. To make annotation more efficient, you can use the textcat.teach recipe to suggest the most relevant examples to annotate. It uses match patterns of trigger phrases to collect enough positive examples, and updates a model in the loop that suggests candidates it’s most uncertain about. See this section for an example.

Annotation can be very efficient, because you only have to press accept or reject. Once you’re done annotating, you can use train to update your model with the annotations.

If you have an existing text classification model trained with spaCy, you can load it into the textcat.teach recipe and give it feedback on the predictions it’s most uncertain about. This means you’re focusing on annotating examples that potentially make the biggest difference. The progress indicator in the sidebar shows an estimate of how much you still need to annotate until there’s nothing left to learn – or, phrased differently, an estimate of when the loss is going to hit zero. This gives you an idea of when to stop. Once you’re done annotating, you can use the train recipe to update the model with the new annotations.

If you’re not using a spaCy model, you can write a custom recipe that integrates your model, so you can use it as part of the same textcat.teach-style active learning flow.

If you have existing annotations, you can convert them to Prodigy’s format and use the db-in command to import them to a new dataset. Each record should have a "text" and a list of "spans". You can then run train to train your model, use textcat.manual to add more annotations, or run the review recipe to correct mistakes and resolve conflicts.

If all you want to do is train and you don’t need to collect or correct any annotations, you might find it more efficient to just train with spaCy (or any other library) directly.


Choosing the right recipe and workflow

  1. Fully manual. This is the classical approach and a very reliable way to get all of your examples annotated with all classes. For each example, you select one or more categories from a list. At the end of the process, you export “gold-standard” data that you can train your model with. In Prodigy, you can use this workflow with the textcat.manual recipe that displays the labels as options and lets you select one (mutually-exclusive categories) or multiple (multilabel classification).

  2. Binary with suggestions from patterns, active learning and a model in the loop: This workflow can be helpful if your classes are very imbalanced and it’s not feasible go through all texts in order. To help select more relevant examples, you can use patterns to describe trigger words and phrase of the categories that you’re looking for. Instead of annotating every example, you can use the model to suggest you the most relevant examples to annotate and give it feedback on its predictions. There are many different ways you can select the “best” examples, and a whole line of research dedicated to exploring active learning techniques. Prodigy’s textcat.teach recipe implements simple uncertainty sampling. Based on your decisions, the model is updated in the loop and guided towards better predictions. Prodigy also includes utilities that let you implement custom workflows with a model in the loop.

Annotating whole documents vs. annotating sentences

If your documents are longer than a few hundred words each, we recommend applying the annotations to smaller sections of the document. Often paragraphs work well. Breaking large documents up into chunks lets the annotator focus on smaller pieces of text at a time, which helps them move through the data more consistently. It also gives you finer-grained labels: you get to see which paragraphs were marked as indicating a label, which makes it much easier to review the decisions later.

You can always keep your documents in order, so that your annotators get to move through the document from start to finish. However, if you have an annotation task where the annotator really needs to see the whole document to make a decision, that’s often a sign that your text classification model might struggle. Current technologies struggle to put together information across sentences in complex ways. Often you can restructure tasks that require that much context into multiple labels applied at different points, plus a little bit of rule-based logic. This thread on the forum shows a few ideas for this for fact extraction from earnings news.


Fully manual annotation

To get started, you need a file with raw input text and a one or more labels. The following command will start the web server, stream in news headlines from news_headlines.jsonl and show the label options Technology, Politics, Economy and Entertainment. Instead of passing in a list of comma-separated labels, you can also point the --label argument to a text file with one label per line.

Download news_headlines.jsonl

Recipe command

prodigy textcat.manual news_topics ./news_headlines.jsonl --label Technology,Politics,Economy,Entertainment

Try it live and select options!

This live demo requires JavaScript to be enabled.

By default, you’re able to select multiple categories. If your labels are mutually exclusive and only one of them can apply, you can set the --exclusive flag. You’re now only able to select one option and the answer is submitted automatically when you make a selection.

Recipe command

prodigy textcat.manual language_identification ./web_dump.jsonl --label English,German,Other --exclusive
This live demo requires JavaScript to be enabled.

When you hit the accept, reject or ignore buttons, your answer will be submitted and Prodigy will add an "answer" key to the annotation task dict – for example, "answer": "accept". When you’re annotating manually with options, you typically only want to use accepted answers. Ignoring an answer typically means that you want to skip it completely and exclude it from everything – for example, because you don’t know the answer or because the question is confusing or not representative. The reject button is less relevant here, because there’s nothing to say no to – however, you can use it to reject examples that have actual problems that need fixing, like noisy preprocessing artifacts, HTML markup that wasn’t cleaned properly, texts in other languages, and so on. When you view or export your data later, e.g. with db-out, you can then explicitly filter out those examples and deal with them.

The “score” field in the bottom right corner of the annotation card shows you the score of the current suggestion. Even though the recipe tries to present you with the most uncertain scores, it can sometimes happen that you see very different scores instead. So why does this happen?

Streams are generators and only operate on one batch at a time. They can also stream from huge files or potentially infinite sources of data, so Prodigy can’t just load it all into memory and keep sorting the whole stream. Instead, it uses an exponential moving average to decide whether to send out a score, based on the distribution of previous scores. This also prevents it from getting stuck if the model suddenly produces higher or lower scores. If the scores are confusing and the model isn’t producing meaningful suggestions, try collecting some gold-standard data first before switching to the binary workflow.

When you annotate with a model in the loop, the model is also updated in the background. So why do you still need to train your model on the annotations afterwards, and can’t just export the model that was updated in the loop? The main reason is that the model in the loop is only updated once each new annotation. This is never going to be as effective as batch training a model on the whole dataset, making multiple passes over the data, shuffling on each epoch and using other deep learning tricks like dropout rates, compounding batch sizes and so on. If you batch train your model with the collected annotations afterwards, you should receive the same model you had in the loop, just better.

When you stop the recipe, the model in the loop is discarded and you can use train to train a better version of it using your annotations. If you just restart the recipe with the base model, it’ll start again at the beginning – otherwise, Prodigy would have to first batch train it behind the scenes and you might have to wait for quite a while until you can get started annotating. If you want to start with the updated model, you can train it with your annotations, output it to a directory and then initialize textcat.teach with the updated model:

prodigy train textcat textcat_dataset en_core_web_sm --output ./batch-trained-model
prodigy textcat.teach textcat_dataset ./batch-trained-model ./data.jsonl --label INSULT

To prevent unintended side-effects, you typically want to train the base model from scratch using all annotations every time you train – for example, you want to update en_core_web_sm with all annotations from one or more datasets and not update batch-trained-model, save the result, update that again and so on.


Manual annotations with binary labels

If you only provide a single label, the annotation decision becomes much simpler: does the label apply or not? In this case, Prodigy will present the question as a binary task using the classification interface. You can then hit accept or reject. Even if you have more than one label, it can sometimes be very efficient to make several passes over the data instead of selecting from a list of options. The annotators can focus on one concept at a time, which can reduce the potential for human error – especially when working with complicated texts and label schemes.

Recipe command

prodigy textcat.manual language_identification ./web_dump.jsonl --label English
This live demo requires JavaScript to be enabled.

Dealing with very large label sets or hierarchical labels

If you’re working on a task that involves more than 10 or 20 labels, it’s often better to break the annotation task up a bit more, so that annotators don’t have to remember the whole annotation scheme. Remembering and applying a complicated annotation scheme can slow annotation down a lot, and lead to much less reliable annotations. Because Prodigy is programmable, you don’t have to approach the annotations the same way you want your models to work. You can break up the work so that it’s easy to perform reliably, and then merge everything back later when it’s time to train your models.

If your annotation scheme is mutually exclusive (that is, texts receive exactly one label), you’ll often want to organize your labels into a hierarchy, grouping similar labels together. For instance, let’s say you’re working on a chat bot that supports 200 different intents. Choosing between all 200 intents will be very difficult, so you should do a first pass where you annotate much more general categories. You’d then take all the texts annotated for some general type, such as information, and set up a new annotation task to sort them into more specific subtypes. This lets the annotators study up on that part of the annotation scheme, so they can make more reliable decisions.

This live demo requires JavaScript to be enabled.
This live demo requires JavaScript to be enabled.

If your annotation scheme is not mutually exclusive (that is, texts can receive zero or more labels), it’s often most fastest to annotate one label at a time. This approach might seem inefficient, because you’ll have to make many more annotation passes over the data. However, if you’re annotating for just one label, you usually don’t need to read the text very closely – you can see immediately whether your label applies, letting you flash through the data at seconds per example.


Binary annotation with suggestions from patterns, active learning and a model in the loop

Annotation for text classification can get tricky if the classes you’re dealing with are very imbalanced. For instance, let’s say you want to detect insults in online comments. The majority of the comments you’ve extracted, e.g. from Reddit (luckily) do not contain any insults. If you annotated an unbiased sample, your sample would include very few comments that your INSULT label applies to, making it difficult to train a reliable model.

The textcat.teach recipe lets you take advantage of two cool NLP techniques to collect a more representative data sample. When you start the server, you’re shown binary questions and as you annotate, the model in the loop is updated with your answers and guided towards better predictions. The suggestions you see are the ones that the model is most uncertain about. In the beginning, that’s pretty much everything. So to get over the cold start, you can provide match patterns describing words and phrases that are likely indicators of the given label – for instance, "idiot" or "douchebag". The pattern matches will be mixed in with the model suggestions. This ensures that the model starts off with enough positive examples to make meaningful suggestions.

Download INSULT patterns Download annotated dataset

Recipe command

prodigy textcat.teach textcat_insults blank:en ./reddit-comments.jsonl --label INSULT --patterns ./insults-patterns.jsonl

Try it live and accept or reject!

This live demo requires JavaScript to be enabled.

The progress indicator in the sidebar shows an estimate of how much you still need to annotate until there’s nothing left to learn – or, phrased differently, an estimate of when the loss is going to hit zero. As you annotate more examples, the model will slowly get a better sense of the INSULT label and will suggest more relevant examples.

The highlighted span above shows the pattern match that was responsible for suggesting this example for annotation. Of course, patterns can also produce false positives that you’d have to reject – but that’s also very helpful. You don’t just want your model to learn that “sentences containing ‘douchebag’ are always an insult”. Note that the highlighted span is only added to visualize the match – it’s not going to be used directly as a feature in the model. However, the words that occur in the text will obviously have an impact on the model either way.

Video tutorial: training an insults classifier

The following video shows an end-to-end workflow using terms.teach to quickly bootstrap a list of trigger phrases based on word vectors and textcat.teach to collect annotations with a model in the loop. It took 40 minutes to create over 830 annotations, including 20% evaluation examples, which was enough to give 87% accuracy. You can download the annotated dataset from GitHub.

Working with patterns

Match patterns are typically provided as a JSONL (newline-delimited JSON) file and can be used to pre-select examples based on expressions they contain. This is especially useful to find positive candidates if your classes are very imbalanced. For instance, if you’re annotating whether a news headline is about a company sales or acquisition, you could define a condition like “contains any form of the verb ‘acquire’” or “includes this company name”. Prodigy supports two types of patterns:

patterns.jsonl{"pattern": [{"lemma": "acquire"}, {"pos": "PROPN"}], "label": "COMPANY_SALE"}
{"pattern": "acquisition", "label": "COMPANY_SALE"}
  1. Token patterns: These patterns are lists of dictionaries with one dictionary describing one token to match. The token attributes to match on can be the token’s "text" or lowercase form "lower", but also lexical attributes like "is_punct" or linguistic features like "lemma" or "pos". You can find more details in the spaCy’s documentation on rule-based matching.

  2. String matches: If the pattern value is a string, it will be used for exact string matching. While {"lower": "berlin"} matches “Berlin”, “berlin” and so on, "Berlin" will only match “Berlin”. The advantage of string patterns is that you don’t have to worry about the tokenization and whether the patterns describe the correct tokens. They also make it easy to re-use existing word lists and dictionaries.

More about Prodigy pattern files


Active learning with a custom model

You don’t need to use spaCy to annotate with a model in the loop. Custom recipes are Python functions that let you script annotation workflows by returning components like the stream or an update callback to update the model in the loop. Just make sure to pick a model implementation that supports updates in small batches and that’s sensitive enough to small updates (since you want your annotations to have an effect).

Step 1: Use the model to predict and score labelspseudocode 
class Model(object): def __call__(self, stream): for eg in stream: predictions = your_model(eg["text"]) for score, label in predictions: example = copy.deepcopy(eg) example["label"] = label yield (score, example)

On their own, the scores and examples aren’t that interesting yet – you typically want to use the scores to only select the most relevant examples for annotation. Prodigy provides several sorter functions that take a stream of (score, example) tuples and pick examples to send out for annotation. The textcat.teach recipe uses the prefer_uncertain sorter, which selects scores closest to 0.5.

Step 2: Sort the stream by scorepseudocode 
from prodigy.components.sorters import prefer_uncertain model = Model() stream = model(stream) stream = prefer_uncertain(stream)
Step 3: Update the model with answerspseudocode 
class Model(object): def update(self, answers): accepted = [eg for eg in answers if eg["answer"] == "accept"] rejected = [eg for eg in answers if eg["answer"] == "reject"] update_your_model(accepted, rejected)

By default, Prodigy streams are generators and Prodigy will only ever ask for the next batch from the stream. So as you annotate and update the model, future batches will receive scores from your updated model in the loop. For a simplified example of that loop, check out the textcat_custom_model.py recipe script. It uses a DummyModel that “predicts” random numbers to illustrate the idea – you’d obviously replace that with your own implementation using a library like scikit-learn, PyTorch or TensorFlow.

Dummy text classification modelpseudocode 
class DummyModel(object): def __init__(self, labels): # The model can keep arbitrary state – let's use a simple random float # to represent the current weights self.weights = random.random() self.labels = labels def __call__(self, stream): for eg in stream: # Score the example with respect to the current weights eg['label'] = random.choice(self.labels) score = (random.random() + self.weights) / 2 yield (score, eg) def update(self, answers): # Update the model weights with the new answers self.weights = random.random()

Finally, you can put it all together in a recipe function using the @prodigy.recipe decorator.

Step 4: Putting it all together in a recipepseudocode 
import prodigy from prodigy.components.loaders import JSONL from prodigy.components.sorters import prefer_uncertain @prodigy.recipe("custom-textcat") def custom_textcat_recipe(dataset, source): model = Model() stream = JSONL(source) # load the data stream = model(stream) # call custom predict function stream = prefer_uncertain(stream) # sort to prefer uncertain scores return { "dataset": dataset, # dataset to save annotations to "stream": stream, # the incoming stream of examples "update": model.update, # the update callback "view_id": "classification" # annotation interface to use }

Command-line usage

prodigy custom-textcat textcat_dataset ./your_data.jsonl -F recipe.py

Optionally, you can also add pattern matching to pre-select examples based on the matches they contain. Prodigy’s PatternMatcher wraps spaCy’s Matcher and PhraseMatcher so you can use both token-based patterns and string matches. Using the combine_matches helper, you can create one unified predict function that gets model predictions and matches and interleaves them, and a unified update callback that updates both the model and the pattern matcher.

Step 5: Add match patterns (optional)pseudocode 
import prodigy from prodigy.components.loaders import JSONL from prodigy.components.sorters import prefer_uncertain from prodigy.models.matcher import PatternMatcherfrom prodigy.util import combine_modelsimport spacy @prodigy.recipe("custom-textcat") def custom_textcat_recipe(dataset, source, patterns=None): model = Model() if patterns is None: predict = model update = model.update else: nlp = spacy.blank("en") matcher = PatternMatcher(nlp, label_span=False, label_task=True) matcher = matcher.from_disk(patterns) # Combine the textcat model with the PatternMatcher annotate/update both predict, update = combine_models(model, matcher) stream = JSONL(source) # load the data stream = predict(stream) # call custom predict function stream = prefer_uncertain(stream) # sort to prefer uncertain scores return { "dataset": dataset, # dataset to save annotations to "stream": stream, # the incoming stream of examples "update": update, # the update callback "view_id": "classification" # annotation interface to use }

Command-line usage

prodigy custom-textcat textcat_dataset ./your_data.jsonl ./patterns.jsonl -F recipe.py

Training text classification models

Once you’ve labelled some data with Prodigy, you can start your training experiments. If you’ve collected annotations from different sources or multiple annotators, it’s often a good idea to use the review recipe to resolve any conflicts and double-check the data. It’s also recommended to create a separate, dedicated evaluation set that you can compare different approaches against.

  1. Train a spaCy model using Prodigy’s CLI. The train recipe is a wrapper around spaCy’s training API and optimized for training straight from Prodigy datasets and quick experiments. It reads from a dataset, holds back data for evaluation and outputs nicely-formatted results. This workflow is the best choice if you just want to get going or quickly check if you’re “on the right track” and your model is learning things.

  2. Train a model with spaCy directly. Once you’re getting more serious, it often makes sense to train your model directly with the library you’re using – e.g. spaCy. This gives you more control over the training process and hyperparameters, and lets you train all model components at once. The data-to-spacy command lets you convert Prodigy datasets to spaCy’s JSON format to use with the spacy train command. It’s recommended to use the review recipe on the different annotation types first to resolve conflicts properly. To check if your data is valid and contains no issues, you can run spaCy’s debug-data command.

  3. Train a model with any other implementation or framework. The db-out exports annotations in a straightforward JSONL format. If you’ve collected binary annotations, each example will have a "label" and an "answer" that’s either "accept", "reject" or "ignore" (see here for the format). If you’ve collected multiple choice annotations, each example will have an "accept" key mapped to a list of selected label IDs. This should make it easy to convert and use it to train any model.