Models¶

Embeddings¶

This module provides you with functions for loading pretrained Danish word embeddings through several NLP frameworks:

flair

spaCy

Gensim

Available word embeddings:

wiki.da.wv

cc.da.wv

conll17.da.wv

news.da.wv

sketchengine.da.wv

Available subword embeddings:

wiki.da.swv

cc.da.swv

sketchengine.da.swv

danlp.models.embeddings.AVAILABLE_EMBEDDINGS = ['wiki.da.wv', 'cc.da.wv', 'conll17.da.wv', 'news.da.wv', 'sketchengine.da.wv', 'dslreddit.da.wv']¶

danlp.models.embeddings.AVAILABLE_SUBWORD_EMBEDDINGS = ['wiki.da.swv', 'cc.da.swv', 'sketchengine.da.swv']¶

danlp.models.embeddings.assert_wv_dimensions(wv: gensim.models.keyedvectors.Word2VecKeyedVectors, pretrained_embedding: str)¶

This function will check the dimensions of some word embeddings wv, and check them against the data stored in WORD_EMBEDDINGS.

Parameters

wv (gensim.models.KeyedVectors) – word embeddings
pretrained_embedding (str) – the name of the pretrained embeddings

danlp.models.embeddings.load_context_embeddings_with_flair(direction='bi', word_embeddings=None, cache_dir='/home/docs/.danlp', verbose=False)¶

Loads contextutal (dynamic) word embeddings with flair.

Parameters

direction (str) – bidirectional ‘bi’, forward ‘fwd’ or backward ‘bwd’
word_embedding –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

danlp.models.embeddings.load_keras_embedding_layer(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False, **kwargs)¶

Loads a Keras Embedding layer.

Parameters

pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
kwargs – used to forward arguments to the Keras Embedding layer

Returns

a Keras Embedding layer and index to word dictionary

danlp.models.embeddings.load_pytorch_embedding_layer(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a pytorch embbeding layer.

Parameters

pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a pytorch Embedding module and a list id2word

danlp.models.embeddings.load_wv_with_gensim(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose: bool = False)¶

Loads word embeddings with Gensim.

Parameters

pretrained_embedding (str) –
cache_dir – the directory for storing cached data
verbose (bool) – True to increase verbosity

Returns

KeyedVectors or FastTextKeyedVectors

danlp.models.embeddings.load_wv_with_spacy(pretrained_embedding: str, cache_dir: str = '/home/docs/.danlp', verbose=False)¶

Loads a spaCy model with pretrained embeddings.

Parameters

pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached data
verbose (bool) – True to increase verbosity

Returns

spaCy model

BERT models¶

class danlp.models.bert_models.BertBase(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT language model used for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version

Credit for code example: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

embed_text(text)¶

Calculate the embeddings for each token in a sentence ant the embedding for the sentence based on a BERT language model. The embedding for a token is chosen to be the concatenated last four layers, and the sentence embeddings to be the mean of the second to last layer of all tokens in the sentence The BERT tokenizer splits in subword for UNK word. The tokenized sentence is therefore returned as well. The embeddings for the special tokens are not returned.

Parameters: sentence (str) – raw text
Returns: three lists: token_embeddings (dim: tokens x 3072), sentence_embedding (1x738), tokenized_text
Return type: list, list, list

class danlp.models.bert_models.BertEmotion(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT Emotion model.

For classifying whether there is emotion in the text, and recognizing amongst eight emotions.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str, no_emotion=False)¶

Predicts emotion among:

0: Glæde/Sindsro

1: Tillid/Accept

2: Forventning/Interrese

3: Overasket/Målløs

4: Vrede/Irritation

5: Foragt/Modvilje

6: Sorg/trist

7: Frygt/Bekymret

Parameters

sentence (str) – raw text
no_emotion (bool) – whether there is emotion or not in the text

Returns

index of the emotion

Return type

int

predict_if_emotion(sentence)¶

Predicts whether there is emotion in the text.

Parameters: sentence (str) – raw sentence
Returns: 0 if no emotion else 1
Return type: int

predict_proba(sentence: str, emotions=True, no_emotion=True)¶

Predicts the probabilities of emotions.

Parameters

sentence (str) – raw text
emotions (bool) – whether to return the probability of the emotion
no_emotion (bool) – whether to return the probability of the sentence being emotional

Returns

a list of probabilities

Return type

List

class danlp.models.bert_models.BertHateSpeech(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT HateSpeech Model.

For detecting whether a comment is offensive or not and, if offensive, predicting what type of hate speech it is. The model is meant to be used for helping moderating online comments (thus, including the detection and categorization of spams). Following are the categories that can be predicted by the model: * Særlig opmærksomhed * Personangreb * Sprogbrug * Spam & indhold

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str, offensive: bool = True, hatespeech: bool = True)¶

Predict whether a sentence is offensive or not and/or the class of the sentence [Særlig opmærksomhed, Personangreb, Sprogbrug, Spam & indhold]

Parameters

sentence (str) – raw text
offensive (bool) – if True returns whether the sentence is offensive or not
hatespeech (bool) – if True returns the type of hate speech the sentence belongs to

Returns

a dictionary for offensive language and hate speech detection results

Return type

Dict

class danlp.models.bert_models.BertNer(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT NER model

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(text: Union[str, List[str]], IOBformat=True)¶

Predict NER labels from raw text or tokenized text. If the text is a raw string this method will return the string tokenized with BERTs subword tokens.

Parameters

text – can either be a raw text or a list of tokens
IOBformat – can either be TRUE or FALSE, but can only be False if text input is a list of tokens. Specifify if output should be in IOB format or a dictionary

Returns

the tokenized text and the predicted labels in IOB format, or a dictionary with the tags and position

Example

“varme vafler” becomes [“varme”, “va”, “##fler”]

class danlp.models.bert_models.BertNextSent(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT language model is trained for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version

Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict_if_next_sent(sent_A: str, sent_B: str)¶

Calculate the probability that sentence B follows sentence A.

Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction

Parameters

sent_A (str) – sentence A
sent_B (str) – sentence B

Returns

the probability of sentence B following sentence A

Return type

float

class danlp.models.bert_models.BertOffensive(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT offensive language identification model.

For predicting whether a text is offensive or not.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str)¶

Predict whether a text is offensive or not.

Parameters: sentence (str) – raw text
Returns: a class – OFF (offensive) or NOT (not offensive)
Return type: str

predict_proba(sentence: str)¶: For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT

class danlp.models.bert_models.BertTone(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

BERT Tone model.

For classifying both the tone [subjective, objective] and the polarity [positive, neutral, negativ] of sentences.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str, polarity: bool = True, analytic: bool = True)¶

Predict the polarity [positive, neutral, negativ] and/or the tone [subjective, objective] of the sentence.

Parameters

sentence (str) – raw text
polarity (bool) – returns the polarity if True
analytic (bool) – returns the tone if True

Returns

a dictionary for polarity and tone results

Return type

Dict

danlp.models.bert_models.load_bert_base_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Load BERT language model and use for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

BERT model

danlp.models.bert_models.load_bert_emotion_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a BERT Emotion model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a BERT Emotion model

danlp.models.bert_models.load_bert_hatespeech_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a BERT HateSpeech model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a BERT HateSpeech model

danlp.models.bert_models.load_bert_ner_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a BERT NER model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a BERT NER model

danlp.models.bert_models.load_bert_nextsent_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Load BERT language model used for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

BERT NextSent model

danlp.models.bert_models.load_bert_offensive_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a BERT offensive language identification model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a BERT offensive language identification model

danlp.models.bert_models.load_bert_tone_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a BERT Tone model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a BERT Tone model

ELECTRA models¶

class danlp.models.electra_models.ElectraOffensive(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

Electra Offensive Model.

For detecting whether a comment is offensive or not.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str)¶

Predict whether a sentence is offensive or not

Parameters: sentence (str) – raw text
Returns: a class representing whether the sentence is offensive or not (OFF/NOT)
Return type: str

predict_proba(sentence: str)¶: For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT

danlp.models.electra_models.load_electra_offensive_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads an Electra Offensive model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

an Electra Offensive model

flair models¶

danlp.models.flair_models.load_flair_ner_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a flair model for NER.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

an NER flair model

danlp.models.flair_models.load_flair_pos_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a flair model for Part-of-Speech tagging.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a POS flair model

spaCy models¶

class danlp.models.spacy_models.SpacyChunking(model=None, cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

Spacy Chunking Model

Parameters

model (spaCy model) – a (preloaded) spaCy model
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(text: Union[str, List[str]], bio=True)¶

Predict NP chunks from raw or tokenized text.

Parameters

text – can either be a raw text or a list of tokens
bio (bool) – True to return a list of labels in BIO format (same length as the sentence), False to return a list of tuples (start id, end id, chunk label)

Returns

NP chunks - either a list of labels in BIO format or a list of tuples (start id, end id, chunk label)

Example

“Jeg kommer fra en lille by” becomes

a list of BIO tags: [‘B-NP’, ‘O’, ‘O’, ‘B-NP’, ‘I-NP’, ‘I-NP’]
or a list of tuples : [(0, 1, ‘NP’), (3, 6, ‘NP’)]

danlp.models.spacy_models.load_spacy_chunking_model(spacy_model=None, cache_dir='/home/docs/.danlp', verbose=False)¶

Loads a spaCy chunking model.

Parameters

spacy_model (spaCy model) – a (preloaded) spaCy model
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

a spaCy Chunking model

Note

A spaCy model can be previously loaded using load_spacy_model and given as an argument to load_spacy_chunking_model (for instance, to avoid loading the model twice)

danlp.models.spacy_models.load_spacy_model(cache_dir='/home/docs/.danlp', verbose=False, textcat=None, vectorError=False)¶

Loads a spaCy model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
textcat (bool) – ‘sentiment’ for loading the spaCy sentiment analyser
vectorError (bool) –

Returns

a spaCy model

Warning

vectorError is a temporary work around error encounted by keeping two models and not been able to find reference name for vectors

XLM-R models¶

class danlp.models.xlmr_models.XLMRCoref(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

XLM-Roberta Coreference Resolution Model.

For predicting which expressions (word or group of words) refer to the same entity in a document.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(document: List[List[str]])¶

Predict coreferences in a document

Parameters: document (List[List[str]]) – segmented and tokenized text
Returns: a dictionary
Return type: Dict

predict_clusters(document: List[List[str]])¶

Predict clusters of entities in the document. Each predicted cluster contains a list of references. A reference is a tuple (ref text, start id, end id). The ids refer to the token ids in the entire document.

Parameters: document (List[List[str]]) – segmented and tokenized text
Returns: a list of clusters
Return type: List[List[Tuple]]

class danlp.models.xlmr_models.XlmrNed(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

XLM-Roberta for Named Entity Disambiguation.

For predicting whether or not a specific entity (QID) is mentioned in a sentence.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

predict(sentence: str, kg_context: str)¶

Predict whether a QID is mentioned in a sentence or not.

Parameters

sentence (str) – raw text
kg_context (str) – raw text

Returns

Return type

str

danlp.models.xlmr_models.load_xlmr_coref_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads an XLM-R coreference model.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

an XLM-R coreference model

danlp.models.xlmr_models.load_xlmr_ned_model(cache_dir='/home/docs/.danlp', verbose=False)¶

Loads an XLM-R model for named entity disambiguation.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

Returns

an XLM-R NED model