Named Entity Recognition

Named Entity Recognition (NER) is the task of extracting named entities in a raw text. Common entity types are locations, organizations and persons. Currently a few tools are available for NER in Danish. Popular models for NER (BERT, Flair and spaCy) are continuously trained on the newest available named entity datasets such as DaNE and made available through the DaNLP library.

Model Train Data License Maintainer Tags DaNLP
BERT DaNE CC BY 4.0 Alexandra Institute PER, ORG, LOC
Flair DaNE MIT Alexandra Institute PER, ORG, LOC
spaCy DaNE MIT Alexandra Institute PER, ORG, LOC
daner Derczynski et al. (2014) GNU GPL v3 ITU NLP PER, ORG, LOC
NERDA DaNE MIT Ekstra Bladet PER, ORG, LOC
DaLUKE DaNE MIT Peleiden PER, ORG, LOC
DaCy DaNE Apache License 2.0 Center for Humanities Computing Aarhus, K. Enevoldsen PER, ORG, LOC (✔)
ScandiNER DaNE, NorNE, SUC 3.0, WikiANN MIT Dan Saattrup Nielsen PER, ORG, LOC

Use cases

NER is one of the most famous NLP tasks used in the industry, probably because its use cases are pretty straightforward. It can be used in many systems, by itself or in combination with other NLP models. For instance, the extraction of entities from text can be used for :

  • classifying / indexing documents (e.g articles for news providers) and then

  • recommending similar content (e.g. news articles)

  • customer support (e.g. for tagging tickets)

  • analysing feedback from customers (product reviews)

  • speeding up search engines

  • extracting information (e.g from emails)

  • building a structured database or a knowledge graph (see our example tutorial) from a corpus

  • anonymizing documents.

Models

🔧 BERT

The BERT (Devlin et al. 2019) NER model is based on the pre-trained Danish BERT representations by BotXO which has been finetuned on the DaNE dataset (Hvingelby et al. 2020). The finetuning has been done using the Transformers library from HuggingFace.

The BERT NER model can be loaded with the load_bert_ner_model() method. Note that it can maximum take 512 tokens as input at a time. For longer text sequences split before hand, for example using sentence boundary detection (e.g. by using the spacy model.)

from danlp.models import load_bert_ner_model
bert = load_bert_ner_model()
# Get lists of tokens and labels in BIO format
tokens, labels = bert.predict("Jens Peter Hansen kommer fra Danmark")
print(" ".join(["{}/{}".format(tok,lbl) for tok,lbl in zip(tokens,labels)]))

# To get a correct tokenization, you have to provide it yourself to BERT  by providing a list of tokens
# (for example SpaCy can be used for tokenization)
# With this option, output can also be choosen to be a dict with tags and position instead of BIO format
tekst_tokenized = ['Han', 'hedder', 'Anders', 'And', 'Andersen', 'og', 'bor', 'i', 'Århus', 'C']
bert.predict(tekst_tokenized, IOBformat=False)
"""
{'text': 'Han hedder Anders And Andersen og bor i Århus C',
 'entities': [{'type': 'PER','text':'Anders And Andersen','start_pos': 11,'end_pos': 30},
  {'type': 'LOC', 'text': 'Århus C', 'start_pos': 40, 'end_pos': 47}]}
"""

You can also find the BERT NER model on our HuggingFace page.

🔧 Flair

The Flair (Akbik et al. 2018) NER model uses pretrained Flair embeddings in combination with fastText word embeddings. The model is trained using the Flair library on the the DaNE dataset.

The Flair NER model can be used with DaNLP using the load_flair_ner_model() method.

from danlp.models import load_flair_ner_model
from flair.data import Sentence

# Load the NER tagger using the DaNLP wrapper
flair_model = load_flair_ner_model()

# Using the flair NER tagger
sentence = Sentence('Jens Peter Hansen kommer fra Danmark') 
flair_model.predict(sentence) 
print(sentence.to_tagged_string())

🔧 spaCy

The spaCy model is trained for several NLP tasks (read more here) using the DDT and DaNE annotations. The spaCy model can be loaded with DaNLP to do NER predictions in the following way.

from danlp.models import load_spacy_model

nlp = load_spacy_model()

doc = nlp('Jens Peter Hansen kommer fra Danmark') 
for tok in doc:
    print("{} {}".format(tok,tok.ent_type_))

NERDA

NERDA is a python package that provides an interface for fine-tuning pretrained transformers for NER. It also includes some ready-to-use fine-tuned (on DaNE) NER models based on a multilingual BERT and a Danish Electra.

DaCy

DaCy is a multi-task transformer trained using SpaCy v.3. its models is fine-tuned (on DaNE) and based upon the Danish BERT (v2) by botXO and the XLM Roberta large. For more on DaCy see the github repository or the blog post describing the training procedure.

Daner

The daner(Derczynski et al. 2014) NER tool is a wrapper around the Stanford CoreNLP using data from (Derczynski et al. 2014) (not released). The tool is not available through DaNLP but it can be used from the daner repository.

DaLUKE

The DaLUKE model is based on the knowledge-enhanced transformer LUKE. It has been first pretrained as a language model on the Danish Wikipedia and then fine-tuned on DaNE for NER.

ScandiNER

The ScandiNER model can tag text for NER in Danish, Norwegian (Bokmål and Nynorsk), Swedish, Icelandic and Faroese. It is a fine-tuned version of NB-BERT-base, a language model for Norwegian. A combination of NER datasets have been used for training: DaNE, NorNE, SUC 3.0, WikiANN (for Icelandic and Faroese).

📈 Benchmarks

The benchmarks has been performed on the test part of the DaNE dataset. None of the models have been trained on this test part. We are only reporting the scores on the LOC, ORG and PER entities as the MISC category has limited practical use. The table below has the achieved F1 score on the test set:

Model LOC ORG PER micro AVG macro AVG Sentences per second (CPU*)
BERT 83.90 72.98 92.82 84.04 83.23 ~6
Flair 84.82 62.95 93.15 81.78 80.31 ~9
spaCy 75.96 59.57 87.87 75.73 74.47 ~420
NERDA (mBERT) 80.75 65.73 92.66 80.66 79.71 ~1
NERDA (electra) 77.67 60.13 90.16 76.77 75.99 ~10
DaCy (small) v0.0.0 79.23 61.82 88.52 77.59 76.52 ~44
DaCy (medium) v0.0.0 83.96 66.23 90.41 80.50 80.20 ~6
DaCy (large) v0.0.0 85.29 79.04 94.15 86.89 86.16 ~1
DaLUKE v0.0.5 86.43 74.58 92.52 84.91 84.51 ~1
ScandiNER 88.32 82.58 95.53 89.25 88.81 ~5

*Sentences per second is based on a Macbook Pro with Apple M1 chip.

The evaluation script ner_benchmarks.py can be found here.

🎓 References