Named Entity Recognition

Tagging names, concepts or key phrases is a crucial task for Natural Language Understanding pipelines. The Prodigy annotation tool lets you label NER training data or improve an existing model's accuracy with ease.

ordered the new beats product today, hope theyre as good as everyone says 🤔
source: Twitter score: 0.51

Focus on what the model is most uncertain about

Prodigy puts the model in the loop, so that it can actively participate in the training process, using what it already knows to figure out what to ask you next. The model learns as you go, based on the answers you provide. Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible.

Fast and flexible annotation

Prodigy's web-based annotation app has been carefully designed to be as efficient as possible. By breaking complex tasks down into smaller units of work, your annotators stay focused on one decision at a time, giving you better data, faster.

Bootstrap with powerful patterns

Prodigy is a fully scriptable annotation tool, letting you automate as much as possible with custom rule-based logic. You don't want to waste time labeling every instance of "New York" by hand. Instead, give Prodigy rules or a list of examples, review the entities in context and annotate the exceptions. As you annotate, a statistical model can learn to suggest similar entities, generalising beyond your initial patterns.

Try out new ideas quickly

Annotation is usually the part where projects stall. Instead of having an idea and trying it out, you start scheduling meetings, writing specifications and dealing with quality control. With Prodigy, you can have an idea over breakfast and get your first results by lunch. Once the model is trained, you can export it as a versioned Python package, giving you a smooth path from prototype to production.

@prodigy.recipe('custom-recipe')
def custom_recipe(dataset, source):
    stream = JSONL(source)
    model = load_your_model()
    return {
        'dataset': dataset,
        'stream': prefer_uncertain(model(stream)),
        'update': model.update,
        'view_id': 'ner'
    }

Plug in your own models

Custom recipes let you integrate machine learning models using any framework of your choice, load in data from different sources, implement your own storage solution or add other hooks and features. No matter how complex your pipeline is – if you can call it from a Python function, you can use it in Prodigy.

Start the server

To start Prodigy, run the ner.teach recipe with the model you want to improve, one or more labels and a text source. All annotations you collect will be saved to the dataset specified as the first argument.

prodigy ner.teach your_dataset en_core_web_sm your_data.jsonl --label PERSON

Annotate

As your texts stream in, Prodigy will look up all possible analyses for each sentence and suggest you the entities the model is most uncertain about. Those are also the entities that need your feedback the most. As you click accept or reject, the model in the loop is updated.

Create patterns

Match patterns help you find potential entity candidates and get over the "cold start problem". Each dictionary describes one token and supports the same attributes as spaCy’s Matcher. You can create patterns manually, using word vectors or from an existing ontology.

{"label": "WEBSITE", "pattern": [{"lower": "reddit"}]}

Start the server

To start Prodigy, run the ner.teach recipe with a base model, one or more labels you want to add, your patterns file to bootstrap suggestions and a text source. All annotations you collect will be saved to the dataset specified as the first argument.

prodigy ner.teach your_dataset en_core_web_sm your_data.jsonl --label WEBSITE --patterns patterns.jsonl

Annotate

As you click accept or reject, the model in the loop will be updated and will start learning about your new entity type. Once you’ve annotated enough examples, the model will also start suggesting entities it's most uncertain about, based on what it has learned so far.

Optional: Add manual annotations

To cover especially tricky or very specific entities, you can always add more annotations manually using the ner.manual recipe.

Start the server

To start Prodigy, run the ner.manual recipe with a data source and a comma-separated list of labels. The model is only used for tokenization. This lets you annotate faster, because the selection can snap to the token boundaries. All annotations you collect will be saved to the dataset specified as the first argument.

prodigy ner.manual your_dataset en_core_web_sm your_data.jsonl --label ORG,PRODUCT,PERSON

Annotate

To highlight a span, click and drag within the entity, or double-click on single words. Labels can be selected from the menu above, or via the number keys on your keyboard.

Export your data

Prodigy stores annotations in a simple JSON format to make it easy to reuse your data in other applications.

prodigy db-out your_dataset > annotations.jsonl

Train

You can export your annotations at any time, or use the ner.batch-train command to train a model directly from the database – perfect for quick experiments. Part of the data will be held aside, letting you output the weights that generalised best.

prodigy ner.batch-train your_dataset en_core_web_sm --label WEBSITE --output /tmp/model --n-iter 10
prodigy ner.batch-train your_dataset en_core_web_sm --label PRODUCT --output /tmp/model --n-iter 10

Test the model

After training, Prodigy exports a ready-to-use spaCy model that you can load in and test with examples. This also gives you a good idea of how the model is performing, and the data needed to improve the accuracy.

import spacy

nlp = spacy.load('/tmp/model')
doc = nlp(u"What do you think of the Reddit redesign?")
entities = [(ent.text, ent.label_) for ent in doc.ents]

What’s your goal?

scikit