Training a new entity type on Reddit comments

In this video we'll show you how to use Prodigy to train a phrase recognition system for a new concept by adding a new entity label to spaCy's named entity recognizer.

Specifically, we'll train a model to detect references to drugs and pharmaceuticals, using text from one of the largest online communities of US opiate users on Reddit. Millions of words of conversation from this community are publicly available, going back a number of years. This seems like a good resource for investigating questions about the emergence and progress of the opioid crisis.

Overview

Bootstrap a terminology list from seed terms. Using the web interface and the word vectors, we quickly collect over 100 drug terms. Based on those terms, we can create match patterns to suggest occurences of the words in our data.

Annotate comments from Reddit based on the term patterns list. We can now stream in examples from the Reddit corpus and annotate whether the label DRUG applies to the suggested span of text. As we annotate, the model in the loop is updated and able to suggest more examples. Using the web app, we can quickly collect 600 annotations.

Train the text classifier and export the model. Using Prodigy's built-in training command, we train a model using 80% of the annotations for training and 20% for evaluation. We manage to achieve an accuracy of 87.5% on the new DRUG entity.

Try the model on test data. After training, Prodigy exports a ready-to-use spaCy model that we can load in and test with examples. This also gives us a good idea of how the model is performing, and the training data needed to improve the accuracy.

scikit