Pretrained Danish embeddings

This repository keeps a list of pretrained word embeddings publicly available in Danish. The and provides functions for downloading the embeddings as well as prepare them for use in popular NLP frameworks.

Name Model Tokens Vocab Unit Task License DaNLP
CoNLL2017 word2vec 1.6B 1,655,886 Word Skipgram CC BY-NC-SA 4.0 ✔️
Kongelige Bibliotek word2vec - 2,404,836 Word Skipgram CC0 1.0 ✔️
Facebook CC fastText - 2,000,000 Char N-gram Skipgram CC BY-SA 3.0 ✔️
Facebook Wiki fastText - 312,956 Char N-gram Skipgram CC BY-SA 3.0 ✔️
SketchEngine fastText 2B 2,722,811 Char N-gram Skipgram CC BY-NC-SA 4.0 ✔️
DSL Reddit word2vec 178,649 Word CBOW MIT ✔️
flair Flair - Char LM MIT ✔️
BERT BERT 1,611,153,110 BPE LM CC BY 4.0 ✔️

Embeddings are a way of representing text as numeric vectors, and can be calculated both for chars, subword units (Sennrich et al. 2016), words, sentences or documents. The methods for training embeddings, can roughly be categorized into static embeddings and dynamic embeddings.

Static embeddings

Static word embeddings contains a large vocabulary of words and each word has a vector representation associated. To get a representation of a word is simply a look-up in the vocabulary to get the associated vector. An example of this type of embeddings is word2vec (Mikolov et al. 2013). Relying on a vocabulary of words can result in out-of-vocabulary words. To cope with this fastText (Bojanowski et al. 2017) uses subword units that constructs a word embedding from the character n-gram embeddings occurring in the word.

Dynamic embeddings

Dynamic embeddings are contextual in the sense that the representations are dependent on the sentence they appear in. This way homonyms get different vector representations. An example of dynamic embeddings is the Flair embeddings (Akbik et al. 2018) where the embeddings are trained with the task of language modelling ie. learning to predict the next character in a sentence.

📈 Benchmarks

To evaluate word embeddings it is common to do intrinsic evaluations to directly test for syntactic or semantic relationships between words. The Danish Similarity Dataset and WordSim-353 contains word pairs annotated with a similarity score. Calculating the correlation between the word embedding similarity and the similarity score gives and indication of how well the word embeddings captures relationships between words.

Model DSD-ρ DSD-OOV WS353-ρ WS353-OOV
wiki.da.wv 0.205 1.01% 0.639 0.85%
cc.da.wv 0.313 0.00% 0.533 1.70%
conll17.da.wv 0.150 0.00% 0.549 1.70%
news.da.wv 0.306 0.00% 0.541 4.25%
sketchengine.da.wv 0.197 0.00% 0.626 0.85%
dslreddit.da.wv 0.198 0.00% 0.443 1.98%

🐣 Get started using word embeddings

Word embeddings are essentially a representation of a word in a n-dimensional space. Having a vector representation of a word enables us to find distances between words. In we have provided functions to download pretrained word embeddings and load them with the two popular NLP frameworks spaCy and Gensim.

This snippet shows how to automatically download and load pretrained static word embeddings e.g. trained on the CoNLL17 dataset, and it show some analysis the embeddings can be used for:

from danlp.models.embeddings  import load_wv_with_gensim, load_wv_with_spacy

# Load with gensim
word_embeddings = load_wv_with_gensim('conll17.da.wv')

word_embeddings.most_similar(positive=['københavn', 'england'], negative=['danmark'], topn=1)
# [('london', 0.7156291604042053)]

word_embeddings.doesnt_match("sodavand brød vin juice".split())
# 'brød'

word_embeddings.similarity('københavn', 'århus')
# 0.7677029

word_embeddings.similarity('københavn', 'esbjerg')
# 0.59988254

# Load with spacy
word_embeddings = load_wv_with_spacy('conll17.da.wv')

🔧 Flair embeddings

This repository provides pretrained Flair word embeddings trained on Danish data from Wikipedia and EuroParl both forwards and backwards. To see the code for training the Flair embeddings have a look at Flairs GitHub.

The hyperparameter are set as follows: hidden_size=1032, nlayers=1, sequence_length=250, mini_batch_size=50, max_epochs=5

The trained Flair word embeddings has been used in training a Part of Speech tagger and Name Entity Recognition tagger with Flair, check it out in the docs for pos and ner .

In the snippet below you can see how to load the pretrained flair embeddings and an example of simple use.

from danlp.models.embeddings import load_context_embeddings_with_flair
from import Sentence

# Use the wrapper from DaNLP to download and load embeddings with Flair
# You can combine it with the static embeddings
stacked_embeddings = load_context_embeddings_with_flair(word_embeddings='wiki.da.wv')

# Embed two different sentences
sentence1 = Sentence('Han fik bank')
sentence2 = Sentence('Han fik en ny bank')

# Show that it is contextual in the sense 'bank' has different embedding after context
print('{} dimensions out of {} is equal'.format(int(sum(sentence2[4].embedding==sentence1[2].embedding)), len(sentence1[2].embedding)))
# 1332 ud af 2364

🔧 BERT embeddings

BERT is a language model but the different layers can be used as embeddings of tokens or sentences. This code loads a pytorch version using the Transformers library from HuggingFace of pre-trained Danish BERT representations by BotXO model. Since the models is not a designated models for embeddings, some choices is made of what layers to use. For each tokens in a sentence there is 13 layers of dim 768. Based on the blogpost, it has been chosen to concatenate the four last layer to use as token embeddings, which gives a dimension of 4*768=3072. For sentence embeddings the second last layers is used and the mean across all tokens in the sentence is calculated.

Note, BERT tokenize out of vocabulary words into sub words.

Below is a small code snippet for getting started:

from danlp.models import load_bert_base_model
model = load_bert_base_model()
vecs_embedding, sentence_embedding, tokenized_text =model.embed_text('Han sælger frugt')

🎓 References