Dependencies and Relations New: 1.10
The relations
interface can be used for a wide range of classic natural
language processing tasks, such as syntactic and semantic dependency parsing,
coreference resolution, or discourse analysis. Relations can be directed or
undirected, labelled or unlabelled, and anchored either by single words or
phrases. Phrases can be recognised either as a preprocess, or jointly during the
relations annotation.
Quickstart
I want to improve an existing spaCy dependency parsing model.
The dep.correct
recipe lets you stream in the model’s predictions and
correct them if needed. You can either annotate all available dependency labels,
or focus on a subset of them that you care most about for your specific
application. spaCy can be updated with complete parses, as well as incomplete
annotations.
Once you’ve create a dataset, you can use the train
recipe to update the
existing model with the annotations. You can also use the data-to-spacy
command to convert your annotations to JSON-formatted training data to use with
spacy train
.
I want to plug in and correct a non-spaCy dependency parser.
Prodigy represents dependency annotations in a
simple JSON format with a "text"
, a
"relations"
property describing the head and child indices and label of each
dependency relation, and a list of "tokens"
. So you could extract the
suggestions from your model in this format, and then use the mark
recipe
with --view-id relations
to label the data exactly as it comes in.
You can also write a custom recipe with a custom stream
(a regular Python generator!) to plug in your model. If you can load and run
your model in Python, you can use it with Prodigy. See
the section on custom models for an example. If you want to use
active learning with a custom model, you can make your recipe return an update
callback that’s called whenever a new batch of answers is sent back from the web
app. However, there are a few more considerations here, like how sensitive your
model is to small updates.
I want to annotate entities and relations between them.
The rel.manual
recipe lets you switch between two annotation modes: one
for labeling/correcting entity spans, and one for defining relations between
these spans and/or other tokens. You can also load in data that has been
pre-annotated with entity spans in Prodigy’s format and annotate relations
between them, or use a pretrained model to suggest entities for you. For
details, see the section on annotating named entity relations.
I want to create training data for coreference resolution.
The coref.manual
recipe incorporates default settings for coreference
annotation, such as using the relation label COREF
and disabling all tokens
that are not nouns, proper nouns or pronouns, and pre-highlighting named
entities. You can customize the labels it uses to match your language and model.
For details, see the section on annotation coreference relations.
I want to annotate other relations for a custom domain, e.g. biomedical text.
Prodigy’s rel.manual
recipe allows building very powerful custom
workflows for semi-automated dependency and relation annotation, mixing manual
span labelling and dependency attachment. You can also provide match patterns to
pre-select and merge spans, or to disable tokens that you know are not going to
be part of a relation you’re looking for.
The example in this usage guide focuses on annotating biomedical events from literature, a complex task with multiple annotation objectives using the BioNLP 2011 GENIA Shared Task annotation scheme. A similar strategy can be applied to a variety of custom domain use cases.
I have relations annotations. I want to train a relations model.
Unlike other text recipes, Prodigy’s prodigy train
and data-to-spacy
recipes
don’t support "relations"
annotations. Both recipes depend on spaCy, and spaCy
currently does not support relation extraction.
However, we’ve created a tutorial video
and project repository
that shows how you can create a custom trainable component with spacy train
. To
adopt to your project, you’ll want to clone the repo, update your annotations, and
modify the parse_data.py
script to export your annotations to .spacy
. For more details,
there are several related posts
on Prodigy Support. You may also find posts on
spaCy’s GitHub discussions if you
need help with spacy train
.
Choosing the right recipe and workflow
So you have a problem that requires data annotated with relationships between words and expressions and you want to get it done as efficiently as possible. But how do you pick the right workflow for your use case?
-
Fully manual: This is the classic approach. You’re shown all tokens in the text and you annotate labelled relations between them by clicking on them. The biggest challenge here is to prevent the process from getting too messy and tedious (and as a result, slower and more error-prone). Unless your goal is to create a new dependency treebank from scratch, you typically want to use at least some automation to merge phrases, disable irrelevant tokens or pre-label some of the data for you. In Prodigy, you can use the
rel.manual
recipe for manual relation annotation, or the more task-specificcoref.manual
with pre-defined configurations for coreference annotation. -
Manual with suggestions from model: If you already have a model that predicts something, you can use it to pre-label the data for you, and only correct its mistakes. This is especially useful for dependency parsing, where the data creation from scratch would otherwise be very tedious. Prodigy’s
dep.correct
workflow lets you stream in syntactic dependencies predicted by the model and correct them manually to create gold-standard data. You can also use a model withrel.manual
to add named entities and noun phrases.
Dependency Parsing
If you already have a pretrained spaCy pipeline
with a parser and you want to improve it on your own data, you can use the
built-in dep.correct
recipe. You don’t have to annotate all labels at the
same time – it can also be useful to focus on a smaller subset of labels that
are most relevant for your application. The following command will start the web
server, stream in headlines from news_headlines.jsonl
and provide the label
options ROOT
(root of the sentence), csubj
(clausal subject), nsubj
(nominal subject), dobj
(direct object) and pboj
(prepositional object).
Those labels will vary depending on the label scheme the model was trained with.
In the annotation UI, you can now review the dependency parse and click on
incorrectly predicted arcs to remove them, or add new dependencies by selecting
the head token and then the child token to attach it to. The ROOT
label is the
only one that should be attached to itself. You can achieve this in the UI
by double-clicking or double-tapping the token.
When you’re done with annotating, you can use the train
recipe with the
component parser
to train a dependency parser, or use the data-to-spacy
command to export JSON-formatted training data. You can also use db-out
to export data in Prodigy’s JSON format and
use it in a different process.
Combining Named Entity Recognition with relation extraction
You might already know Prodigy’s features for annotating training data for
named entity recognition. Using a workflow
like ner.manual
, you can stream in your data and highlight entity spans
for a given set of labels. For example, here we’re labelling PERSON
and GPE
(geopolitical entity):
While spans can capture a lot of important information – like the concepts that
are mentioned in the text – they can’t always capture relationships between
them. This requires another layer of data that defines two words or phrases,
typically a “head” and a “child”, and a label specifying the type of
relationship. The rel.manual
recipe allows you to stream in data that’s
already pre-annotated with named entities. In this case, we’re setting
dataset:ner_rels_ent
, which will load the previously annotated data from the
dataset ner_rels_ent
. Entities annotated in this dataset will be shown as a
merged unit, and we can assign relations between them and other tokens.
Instead of loading in a pre-annotated dataset, you can also use an existing
pretrained model to add entities for you. Here we’re using the en_core_web_sm
model and the original raw input data, and set the --add-ents
flag to include
entities found in the text. For more options and how to add custom
preprocessing, see the section on custom relations.
Joint entity and relation annotation
For some use cases, it makes sense to do entity and relation annotation at the
same time. That’s especially true if the annotation decision for both spans
and relations requires the same thought process, or if it’s difficult to
separate both tasks. In that case, you can pass an additional --span-label
argument to rel.manual
defining the entity labels to assign. The
interface now has two modes: the relation annotation mode
to connect tokens and spans, and the span
annotation mode to manually highlight and
edit spans. To add a span, click and drag across the tokens, or hold down
shift and click on the start and end token.
-
Choose the span highlighting mode by clicking the button. This will let you manually highlight spans or remove existing spans.
-
Drag across the token “Obama” or click on it to assign it the label
PERSON
. Then select the labelGPE
(geopolitical entity) at the top and do the same for “Hawaii” and “New York”. To select multiple tokens, drag across them and they will turn green, indicating the selection is valid. If you make a mistake, click on the span and then on the button to remove it. -
Choose the relations mode by clicking the button. This will let you select tokens or spans and assign relations to them.
-
Click “Obama” and then “born” to assign it the relation
SUBJECT
and do the same for “Obama” and “studied”. Then select the labelLOCATION
at the top and connect “born” and “Hawaii” and then “studied” and “New York”.
If you have a pretrained model that already predicts something, you can also
set the --add-ents
flag to pre-highlight entities suggested by the model for
you. You can then delete incorrect spans or change their label and add missing
spans if needed.
Coreference Resolution
Coreference resolution is the challenge of linking ambiguous mentions such as
“her” or “that woman” back to an antecedent providing more context about the
entity in question. You can use the built-in coref.manual
recipe to
manually create such links. This recipe allows you to focus on nouns, proper
nouns and pronouns specifically, by disabling all other tokens. The following
command will start the web server, stream in movie summaries from
plot_summaries.jsonl
and provide the label COREF
to annotate coreference
relations.
The recipe will use the model to automatically detect potential candidates for a coreference relationship. You can customize the labels used for the extraction via the recipe arguments to match the model you’re using. You can also set up your very own custom relation annotation workflow by defining custom rules for spans and disabled tokens.
To annotate, click a word or phrase and next the word or phrase you want to
connect it to. To remove an existing relationship, you can click its label. In
the above example, two coreference relationships are already annotated: “her”
→ “Lindy” and “she” → “Lindy”. Other mentions of “she” and “her” in
the sentence should be carefully annotated as either referring back to “Lindy”
or “Azaria”. Each relation you annotate will be saved as an entry under the key
"relations"
.
Single relation (example)
{"head": 8,"head_span": {"start": 38, "end": 41, "token_start": 8, "token_end": 8, "label": null},"child": 0,"child_span": {"start": 0, "end": 5, "token_start": 0, "token_end": 0, "label": "PERSON"},"label": "COREF",}
What's the difference between head and child?
Prodigy will record the direction of the relationship, from the “head” to the
“child”. This is relevant for many tasks like syntactic dependency annotation,
but less relevant for tasks like coreference resolution, where you mostly care
about pairs of coreference relations. For this use case, you can just treat the
"head"
and "child"
values of the relation as interchangable and just
consider them as one coreference pair.
Custom dependencies and relations
Prodigy’s rel.manual
recipe allows building very powerful custom
workflows for semi-automated dependency and relation annotation. It’s based on
the following philosophy:
-
Dependencies should refer to consistent units. For example, relations might refer to named entities predicted by a named entity recognizer, or noun phrases extracted using part-of-speech tags and syntactic dependency labels. You shouldn’t have to ask your annotator to label all of this from scratch, if you can automate it – instead, they should only have to correct mistakes. Using a pretrained model and rules to pre-highlight spans to annotate can make data creation faster and more consistent.
-
Not all tokens are relevant. For instance, for many tasks, punctuation (outside of entities) is never going to be part of a relation you’re annotating. If you’re annotating nominal coreference, you only need to focus on nouns, proper nouns and pronouns. Or maybe you only want to annotate relations between entity spans and ignore all other tokens. Disabling irrelevant tokens automatically lets you and your annotators focus on what matters, speeds up the process and prevents mistakes.
You can customize your workflow using the following recipe settings:
--label | Relation label(s) to annotate manually in relation annotation mode . |
--span-label | Span label(s) to annotate manually in span annotation mode . |
--patterns | Match patterns defining spans to be added. |
--disable-patterns | Match patterns defining tokens to disable. |
--add-ents | Add entities predicted by the model as spans. |
--add-nps | Add noun phrases based on tagger and parser, if rules are available. |
Example: Custom biomedical relation annotation
Annotating biomedical events from literature is a complex task, and serves as a good example for the relation annotation functionality. Here, we follow the annotation scheme from the BioNLP 2011 GENIA Shared Task, which has been the foundation of many bio-event extraction algorithms in the last decade and has become a de facto standard. The annotation process involves the following:
- Annotate spans of one or more tokens describing genes and gene products (“GGPs”). In the original Shared Task, these were provided as gold annotations.
- Identify trigger words or spans like “stabilizes” referring to a positive regulation. There are 9 different relation/event types: gene expression, transcription, protein catabolism, phosphorylation, localization, binding, regulation, positive regulation and negative regulation.
- Connect trigger words to at least the object of the event, also called “theme”, which is usually a GGP. Binding events can have multiple theme annotations.
- Connect regulation events to the subject of the event, also called “cause”, if available. For the regulation events, both the “theme” and “cause” arguments can be GGPs or other events, thus allowing a nested structure of events.
For this example, we have prepared a sample of 200 sentences in
bio_events.jsonl
, taken from the Shared Task. As the Shared Task came with
gold-standard annotations of genes and proteins, we have already added those
GGP
spans to the input text. We also know that we’re only interested in
nouns, proper nouns, verbs and adjectives or other spans that have been
pre-tagged as GGP
. So we can write a disable pattern that disables all tokens
that do not have those part-of-speech tags and are also not part of the
pre-labelled GGP
spans.
patterns_disable_bio_rel.jsonl
{"pattern": [{"POS": {"NOT_IN": ["NOUN", "PROPN", "VERB", "ADJ"]}, "_": {"label": {"NOT_IN": ["GGP"]}}}]}
The following command will start the web server, stream in the biomedical
sentences from bio_events.jsonl
, apply the disable rules, and provide a list
of relevant span labels as well as the standard relations Cause
(the subject
of an event) and Theme
(the object of an event).
-
The BioNLP ST’11 annotation scheme has “trigger words” that refer to the span of tokens that expresses a relation. For instance, “stabilizes” refers to a positive regulation. You can annotate it as such by going to the span annotation mode , selecting the label
Reg+
and clicking on the token “stabilizes”. Similarly, annotate “enabling” asReg+
and “activities” asReg
. -
Now select the relations mode by clicking the button. To annotate the first relation, select the relation type “Cause”, click on the trigger “stabilizes” to select it, then on the
GGP
“Mdmx” to define it as the subject (cause) of this event. Similarly, you can annotate “Mdm2” as being the object of this same event by also connecting it to the “stabilizes” trigger with theTheme
relation type. This relation annotation style allows you to create nested events, as one event “e.g. activities of p53” could be theTheme
of another event (“Mdm2 enabling it”).
The final annotated task should look like this:
In an end-to-end setting, you could predict the gene/protein mentions with an NER model trained for this challenge specifically, like for instance the models trained on scientific documents from the scispaCy project.
Using a custom model
You don’t need to use spaCy to let a model highlight suggestions for you. Under
the hood, the concept is pretty straightforward: if you stream in examples with
pre-defined "tokens"
, "relations"
and optional "spans"
, Prodigy will
accept and pre-highlight them. This means you can either stream in pre-labelled
data, or write a custom recipe that uses your model to add tokens and relations
to your data.
Expected format
{"text": "I like cute cats","tokens": [{"text": "I", "start": 0, "end": 1, "id": 0},{"text": "like", "start": 2, "end": 6, "id": 1},{"text": "cute", "start": 7, "end": 11, "id": 2},{"text": "cats", "start": 12, "end": 16, "id": 3}],"relations": [{"child": 0, "head": 1, "label": "nsubj"},{"child": 1, "head": 1, "label": "ROOT"},{"child": 2, "head": 3, "label": "amod"},{"child": 3, "head": 1, "label": "dobj"}]}
For example, let’s say your model returns the syntactic dependencies as a list
of token-based tags like ["nsubj", "ROOT", "amod", "dobj"]
and head indices
like [1, 1, 3, 1]
. You could then generate your data like this and add a
"head"
, "child"
and "label"
value for each relation:
Step 1: Write the stream generator
def add_relations_to_stream(stream):custom_model = load_your_custom_model()for eg in stream:deps, heads = custom_model(eg["text"])eg["relations"] = []for i, (label, head) in enumerate(zip(deps, heads)):eg["relations"].append({"child": i, "head": head, "label": label})yield eg
If you want to extract and add the dependencies at runtime, you can write a
custom recipe that loads the raw data, uses your custom
model to add "relations"
to the stream, pre-tokenizes the text and then
renders it all using the relations
interface.
Step 2: Putting it all together in a recipe
import prodigyfrom prodigy.components.stream import get_streamfrom prodigy.components.preprocess import add_tokensimport spacy@prodigy.recipe("custom-dep")def custom_dep_recipe(dataset, source):stream = get_stream(source) # load the datastream = add_relations_to_stream(stream) # add custom relationsstream = add_tokens(spacy.blank("en"), stream) # add "tokens" to streamreturn {"dataset": dataset, # dataset to save annotations to"stream": stream, # the incoming stream of examples"view_id": "relations", # annotation interface to use"labels": ["ROOT", "nsubj", "amod", "dobj"] # labels to annotate}