Text Classification

Whether you're doing intent detection, information extraction, semantic role labeling or sentiment analysis, Prodigy provides easy, flexible and powerful annotation options. Active learning keeps you efficient even if your classes are heavily imbalanced.

POSITIVE
This place has gotten plenty of bad reviews – but plenty of people know nothing about food!
source: Yelp score: 0.49

Focus on what the model is most uncertain about

Prodigy puts the model in the loop, so that it can actively participate in the training process, using what it already knows to figure out what to ask you next. The model learns as you go, based on the answers you provide. Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible.

Bootstrap with powerful patterns

Prodigy is a fully scriptable annotation tool, letting you automate as much as possible with custom rule-based logic. If your classes are imbalanced, you don't want to waste time labeling irrelevant examples. Instead, give Prodigy rules or a list of trigger words, review the matches in context and annotate the exceptions. As you annotate, a statistical model can learn to suggest similar examples, generalising beyond your initial patterns.

Try out new ideas quickly

Annotation is usually the part where projects stall. Instead of having an idea and trying it out, you start scheduling meetings, writing specifications and dealing with quality control. With Prodigy, you can have an idea over breakfast and get your first results by lunch. Once the model is trained, you can export it as a versioned Python package, giving you a smooth path from prototype to production.

Build custom workflows

Prodigy allows you to mix and match its annotation interfaces to build the experience that works best for your classification task. Customise the view to show the annotators exactly the information they need to make their decisions, and collect better data, faster.

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset, source):
    stream = JSONL(source)
    model = load_your_model()
    return {
        'dataset': dataset,
        'stream': prefer_uncertain(model(stream)),
        'update': model.update,
        'view_id': 'classification'
    }

Plug in your own models

Custom recipes let you integrate machine learning models using any framework of your choice, load in data from different sources, implement your own storage solution or add other hooks and features. No matter how complex your pipeline is – if you can call it from a Python function, you can use it in Prodigy.

Create patterns

Match patterns help you pre-select potential candidates based for each category, based on trigger words and phrases. Each dictionary describes one token and supports the same attributes as spaCy’s Matcher.

{"label": "COMPANY_SALE", "pattern": [{"lemma": "acquire"}]}

Start the server

To start Prodigy, run the textcat.teach recipe with a base model, the category label, your patterns file to bootstrap suggestions and a text source. All annotations you collect will be saved to the dataset specified as the first argument.

prodigy textcat.teach your_dataset en_core_web_sm your_data.jsonl --label COMPANY_SALE --patterns patterns.jsonl

Annotate

As you click accept or reject, the model in the loop will be updated and will start learning about your new text category. Once you’ve annotated enough examples, the model will also start suggesting labels it's most uncertain about, based on what it has learned so far.

Start the server

To start Prodigy, run the textcat.teach recipe with a base model, the category label and a text source. All annotations you collect will be saved to the dataset specified as the first argument.

prodigy textcat.teach your_dataset en_core_web_sm your_data.jsonl --label POSITIVE

Annotate

As you click accept or reject, the model in the loop will be updated and will start learning about your new text category. Prodigy will prioritise the examples the model is most uncertain about, based on what it has learned so far.

Train

You can export your annotations at any time, or use the textcat.batch-train command to train a model directly from the database – perfect for quick experiments. Part of the data will be held aside, letting you output the weights that generalised best.

prodigy textcat.batch-train your_dataset en_core_web_sm --label COMPANY_SALE --output /tmp/model --n-iter 10
prodigy textcat.batch-train your_dataset en_core_web_sm --label POSITIVE --output /tmp/model --n-iter 10

Test the model

After training, Prodigy exports a ready-to-use spaCy model that you can load in and test with examples. This also gives you a good idea of how the model is performing, and the data needed to improve the accuracy.

import spacy

nlp = spacy.load('/tmp/model')
doc = nlp(u"Microsoft confirms it will acquire GitHub")
categories = doc.cats
import spacy

nlp = spacy.load('/tmp/model')
doc = nlp(u"This is a great movie – highly recommended!")
categories = doc.cats

Add options to your data

The easiest way to provide the labels is to add a list of "options" to each example in your data. An option should have a unique ID, as well as a text label that's displayed to the annotator.

{
    "text": "Thanks for your great work – really made my day!",
    "options": [
        {"id": 0, "text": "negative"},
        {"id": 1, "text": "positive"},
        {"id": 2, "text": "neutral"}
    ]
}

Optional: Configure the interface

Your local prodigy.json lets you configure how the options should be displayed. Setting "choice_style" to "multiple" will allow multiple selection. If you only allow single selections, you can set "choice_auto_accept" to true to automatically accept and confirm the selected answer and move on to the next question.

{
    "choice_style": "multiple",
    "choice_auto_accept": false
}

Start the server

To start Prodigy, run the mark recipe, which can be used to statically label data using a given annotation interface. In this case, Prodigy will use the choice UI to display the options as clickable boxes underneath the text.

prodigy mark your_dataset your_data.jsonl --view_id choice

Annotate

Select one or more options (depending on the interface settings) and click accept to confirm. Instead of clicking on the options, you can also use the number keys on the keyboard.

Export your data

Prodigy stores annotations in a simple JSON format to make it easy to reuse your data in other applications. Each record will have an "accept" property containing a list of the selected option IDs.

prodigy db-out your_dataset > annotations.jsonl

What’s your goal?

scikit