Dependency Parsing & Noun Phrase Chunking¶

Dependency Parsing¶

Dependency parsing is the task of extracting a dependency parse of a sentence. It is typically represented by a directed graph that depicts the grammatical structure of the sentence; where nodes are words and edges define syntactic relations between those words. A dependency relation is a triplet consisting of: a head (word), a dependent (another word) and a dependency label (describing the type of the relation).

Model	Train Data	License	Trained by	Tags	DaNLP
SpaCy	Danish Dependency Treebank	MIT	Alexandra Institute	39 Universal dependencies	✔️
DaCy	Danish Dependency Treebank	Apache v2	Center for Humanities Computing Aarhus, K. Enevoldsen	39 Universal dependencies	❌
Stanza	Danish Dependency Treebank	Apache v2	Stanford NLP Group	39 Universal dependencies	❌

The model has been trained on the Danish UD treebank which have been annotated with dependencies following the Universal Dependency scheme. It uses 39 dependency relations.

Noun Phrase Chunking¶

Chunking is the task of grouping words of a sentence into syntactic phrases (e.g. noun-phrase, verb phrase). Here, we focus on the prediction of noun-phrases (NP). Noun phrases can be pronouns (PRON), proper nouns (PROPN) or nouns (NOUN) – potentially bound with other tokens that act as modifiers, e.g., adjectives (ADJ) or other nouns. In sentences, noun phrases are generally used as subjects (nsubj) or objects (obj) (or complements of prepositions). Examples of noun-phrases :

en bog (NOUN)
en god bog (ADJ+NOUN)
Lines bog (PROPN+NOUN)

NP-chunks can be deduced from dependencies. We provide a convertion function – from dependencies to NP-chunks – thus depending on a dependency model.

Use cases¶

Dependency parsing and chunking can be used as preprocessing steps for other NLP tasks. See for example the EPE shared task, where the performance of dependency parsers is evaluated through the ouput of downstream tasks such as:

Biological Event Extraction
Fine-Grained Opinion Analysis
Negation Resolution

It can also be used, for instance, for the extraction of keyphrases or for building a knowledge graph (see our tutorial).

Models¶

🔧 SpaCy¶

Read more about the SpaCy model in the dedicated SpaCy docs , it has also been trained using the Danish Dependency Treebank dataset.

Dependency Parser¶

Below is a small getting started snippet for using the SpaCy dependency parser:

from danlp.models import load_spacy_model

# Load the dependency parser using the DaNLP wrapper
nlp = load_spacy_model()

# Using the spaCy dependency parser
doc = nlp('Ordene sættes sammen til meningsfulde sætninger.')

dependency_features = ['Id', 'Text', 'Head', 'Dep']
head_format = "\033[1m{!s:>11}\033[0m" * (len(dependency_features) )
row_format = "{!s:>11}" * (len(dependency_features) )

print(head_format.format(*dependency_features))
# Printing dependency features for each token 
for token in doc:
    print(row_format.format(token.i, token.text, token.head.i, token.dep_))

../../_images/dep_features.png

Visualizing the dependency tree with SpaCy¶

# SpaCy visualization tool
from spacy import displacy

# Run in a terminal 
# In jupyter use instead display.render 
displacy.serve(doc, style='dep')

../../_images/dep_example.png

nsubj: nominal subject, advmod: adverbial modifier, case: case marking, amod: adjectival modifier, obl: oblique nominal, punct: punctuation

Chunker¶

Below is a snippet showing how to use the chunker:

from danlp.models import load_spacy_chunking_model

# text to process
text = 'Et syntagme er en gruppe af ord, der hænger sammen'

# Load the chunker using the DaNLP wrapper
chunker = load_spacy_chunking_model()
# Using the chunker to predict BIO tags
np_chunks = chunker.predict(text)

# Using the spaCy model to get linguistic features (e.g., tokens, dependencies) 
# Note: this is used for printing features but is not necessary for processing the chunking task 
# OBS - The model loaded is the same as can be loaded with 'load_spacy_model()'  
nlp = chunker.model
doc = nlp(text)

syntactic_features=['Id', 'Text', 'Head', 'Dep', 'NP-chunk']
head_format ="\033[1m{!s:>11}\033[0m" * (len(syntactic_features) )
row_format ="{!s:>11}" * (len(syntactic_features) )

print(head_format.format(*syntactic_features))
# Printing dependency and chunking features for each token 
for token, nc in zip(doc, np_chunks):
    print(row_format.format(token.i, token.text, token.head.i, token.dep_, nc))

../../_images/chunk_features.png

DaCy¶

DaCy is a transformer-based version of the SpaCy model, thus obtaining higher performance, but with a higher computational cost. Read more about the DaCy model in the dedicated DaCy github, it has also been trained using the Danish Dependency Treebank dataset.

Stanza¶

Stanza is a python library which provides a neural network pipeline for NLP in many languages. It has been developed by the Stanford NLP Group. The Stanza dependency parser has been trained on the DDT.

📈 Benchmarks¶

See detailed scoring of the benchmarks in the example folder.

Dependency Parsing Scores¶

Dependency scores — LA (labelled attachment score), UAS (Unlabelled Attachment Score) and LAS (Labelled Attachment Score) — are reported below :

Model	LA	UAS	LAS	Sentences per second (CPU*)
SpaCy	87.68	81.36	77.46	~270
DaCy (small) v0.0.0	91.42	87.26	84.2	~38
DaCy (medium) v0.0.0	93.09	88.91	86.65	~6
DaCy (large) v0.0.0	93.64	90.49	88.42	~1
Stanza	91.82	87.13	84.42	~9

Noun Phrase Chunking Scores¶

NP chunking scores (F1) are reported below :

Model	Precision	Recall	F1	Sentences per second (CPU*)
SpaCy	91.32	91.79	91.56	~240

*Sentences per second is based on a Macbook Pro with Apple M1 chip.