Improve a Named Entity Model

To make best use of Named Entity Recognition (NER), you usually need a model that's been trained specifically for your use-case. Generic models such as the ones we provide for free with spaCy can only go so far, because there is huge variation in which entities are common in different text types.

In this example, we'll improve the accuracy of spaCy's default English model at recovering product or brand names in social media, a common requirement for online reputation management and tracking what opinions people are expressing. In around 45 minutes, we created 1800 annotations from the Reddit comments corpus, and achieved an accuracy of 78.1% on the new entities.

prodigy dataset ner_product "Improve PRODUCT on Reddit data"✨ Created dataset 'ner_product'.prodigy ner.teach ner_product en_core_web_sm ~/data/RC_2010-01.bz2 --loader reddit --label PRODUCT✨ Starting the web server on port 8080...

Annotating with Prodigy

Prodigy puts the model in the loop, so that it can actively participate in the training process. The model uses what it already knows to figure out what to ask you next. The model is updated by the answers you provide, so that the system learns as you go. Most annotation tools avoid making any suggestions to the user, to avoid biasing the annotations. Prodigy takes the opposite approach: ask the user as little as possible.

Screenshot of the Prodigy web application

Opening http://localhost:8080, we get a sequence of comments with candidate entities highlighted. If the entity is correct, click accept, press a, or swipe left on a touch interface.

I love Windows 7 product.
source: Reddit section: fffffffuuuuuuuuuuuu score: 0.99

If the entity is not correct, click reject, press x, or swipe right.

Bathroom product breaks are spent with the beeping of a new call in my ear.
source: Reddit section: IAmA score: 0.36

Some examples are unclear or exceptions that you don't want the model to learn from. In these cases, you can click ignore or press space.

No scopes in that one either, but the M1 product was a killmachine.
source: Reddit section: badcompany2 score: 0.74

After around 45 minutes of annotating the stream of texts (around 1.6 seconds per decision), we end up with a total of 1800 annotations for the label PRODUCT, which break down as follows:

DecisionCount
accept438
reject1276
ignore86
Total1800

Training from the annotations

After exiting the ner.teach recipe, the collected annotations are stored in the database. While the model state might have improved over the initial baseline, you can usually achieve even better accuracy by performing an additional batch training procedure, in which the model is re-trained from scratch now that all of the annotations are available. Although there are good techniques for streaming stochastic gradient descent, nothing works quite as well or nearly as simply as the standard batchwise approach.

After collecting a batch of annotations, you can train a model on them using the ner.batch-train recipe. The training procedure makes several passes over the annotations, shuffling them each time.

prodigy ner.batch-train ner_product en_core_web_sm --output /tmp/model --eval-split 0.5 --label PRODUCTLoaded model en_core_web_smUsing 50% of examples (883) for evaluationUsing 100% of remaining examples (891) for trainingCorrect 164Incorrect 46Baseline 0.005Accuracy 0.781Model: /tmp/modelTraining data: /tmp/model/training.jsonlEvaluation data: /tmp/model/evaluation.jsonl

Prodigy supports a few options for quick-and-dirty evaluations, to help you quickly answer the question "Is my new model any good?". You can also use Prodigy to construct a stable "gold standard" evaluation set, with complete and correct annotations using the ner.make-gold recipe.

Exporting and using the model

After training the model, Prodigy outputs a ready-to-use spaCy model, making it easy to put into production.

Usage in spaCy v2.0

import spacy nlp = spacy.load('/tmp/model') doc = nlp("Check out the new iPhone announcement on Twitter!") # Print entity labels and text for ent in doc.ents: print(ent.label_, ent.text) # Visualise the entities in the browser spacy.displacy.serve(doc, style='ent')

Results

We can get some sense of how the system will improve as more data is annotated by retraining the system with fewer examples. The chart below shows the accuracy achieved with 10%, 25%, 50% and 75% of the training data. The last 25% of the training data brought 8% improvement in accuracy, indicating that further training will improve the system. Similar logic is used to estimate the progress indicator during training.

Using the ner.train-curve recipe, you can output a training curve and get an idea for how the model is performing with different numbers of examples. The recipe outputs the best accuracy score for each training run, as well as the improvement in accuracy.

prodigy ner.train-curve ner_product --n-samples 4 --eval-split 0.5 --label PRODUCT % RIGHT WRONG ACCURACY 25% 146 64 0.70 +0.70 50% 152 58 0.72 +0.03 75% 152 58 0.72 +0.00 100% 169 41 0.80 +0.08

The recipe takes the same arguments as ner.batch-train. You can also customise the number of samples using the --n-samples argument, for example, 10 for snapshots at 10%, 20%, 30% and so on.

Recipe details

A Prodigy recipe is a Python function that can be run via the command line. The built-in recipes are all built from components that you can also import and use yourself. Any function wrapped with the @recipe decorator can be used as a Prodigy subcommand. To start the annotation server, the function should return a dictionary of components that specify the stream of examples, annotation interface to use and other parameters. Here's a simplified version of the built-in ner.teach recipe:

ner.teach recipe

from prodigy import recipe, get_stream from prodigy.models.ner import EntityRecognizer from prodigy.components.sorters import prefer_uncertain from prodigy.preprocess import split_sentences import spacy @recipe('ner.teach', dataset=("Dataset ID", "positional"), spacy_model=("Loadable spaCy model (for tokenization)"), source=("Source data (file path or API query)"), api=("Optional API loader to use", "option", "a", str), loader=("Optional file loader to use", "option", "lo", str), label=("Label to annotate", "option", "l", str) ) def teach(dataset, spacy_model, source, api=None, loader=None, label=''): """Annotate texts to train a NER model""" nlp = spacy.load(spacy_model) model = EntityRecognizer(nlp, label=label) stream = get_stream(source, api, loader) stream = split_sentences(model.orig_nlp, stream) return { 'dataset': dataset, 'view_id': 'ner', 'stream': prefer_uncertain(model(stream), bias=0.8), 'update': model.update, 'config': {'lang': nlp.lang, 'label': label} }

The recipe returns the following components:

ComponentDescription
datasetID of the dataset to add the collected annotations to.
view_id The annotation interface to use in the web app. If not set, Prodigy will guess the best matching one from the first example.
streamIterable stream of annotation examples.
update Function that is invoked every time Prodigy receives an annotated batch of examples. Can be used to update a model.
config Other parameters to display in the web app, or Prodigy config defaults.
scikit