Models

Embeddings

This module provides you with functions for loading pretrained Danish word embeddings through several NLP frameworks:

  • flair

  • spaCy

  • Gensim

Available word embeddings:

  • wiki.da.wv

  • cc.da.wv

  • conll17.da.wv

  • news.da.wv

  • sketchengine.da.wv

Available subword embeddings:

  • wiki.da.swv

  • cc.da.swv

  • sketchengine.da.swv

danlp.models.embeddings.AVAILABLE_EMBEDDINGS = ['wiki.da.wv', 'cc.da.wv', 'conll17.da.wv', 'news.da.wv', 'sketchengine.da.wv', 'dslreddit.da.wv']
danlp.models.embeddings.AVAILABLE_SUBWORD_EMBEDDINGS = ['wiki.da.swv', 'cc.da.swv', 'sketchengine.da.swv']
danlp.models.embeddings.assert_wv_dimensions(wv: gensim.models.keyedvectors.Word2VecKeyedVectors, pretrained_embedding: str)

This function will check the dimensions of some word embeddings wv, and check them against the data stored in WORD_EMBEDDINGS.

Parameters
  • wv (gensim.models.KeyedVectors) – word embeddings

  • pretrained_embedding (str) – the name of the pretrained embeddings

danlp.models.embeddings.load_context_embeddings_with_flair(direction='bi', word_embeddings=None, cache_dir='/home/docs/.danlp', verbose=False)

Loads contextutal (dynamic) word embeddings with flair.

Parameters
  • direction (str) – bidirectional ‘bi’, forward ‘fwd’ or backward ‘bwd’

  • word_embedding

  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

danlp.models.embeddings.load_keras_embedding_layer(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False, **kwargs)

Loads a Keras Embedding layer.

Parameters
  • pretrained_embedding (str) –

  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

  • kwargs – used to forward arguments to the Keras Embedding layer

Returns

a Keras Embedding layer and index to word dictionary

danlp.models.embeddings.load_pytorch_embedding_layer(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False)

Loads a pytorch embbeding layer.

Parameters
  • pretrained_embedding (str) –

  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a pytorch Embedding module and a list id2word

danlp.models.embeddings.load_wv_with_gensim(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose: bool = False)

Loads word embeddings with Gensim.

Parameters
  • pretrained_embedding (str) –

  • cache_dir – the directory for storing cached data

  • verbose (bool) – True to increase verbosity

Returns

KeyedVectors or FastTextKeyedVectors

danlp.models.embeddings.load_wv_with_spacy(pretrained_embedding: str, cache_dir: str = '/home/docs/.danlp', verbose=False)

Loads a spaCy model with pretrained embeddings.

Parameters
  • pretrained_embedding (str) –

  • cache_dir (str) – the directory for storing cached data

  • verbose (bool) – True to increase verbosity

Returns

spaCy model

BERT models

class danlp.models.bert_models.BertBase(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT language model used for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version

Credit for code example: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

embed_text(text)

Calculate the embeddings for each token in a sentence ant the embedding for the sentence based on a BERT language model. The embedding for a token is chosen to be the concatenated last four layers, and the sentence embeddings to be the mean of the second to last layer of all tokens in the sentence The BERT tokenizer splits in subword for UNK word. The tokenized sentence is therefore returned as well. The embeddings for the special tokens are not returned.

Parameters

sentence (str) – raw text

Returns

three lists: token_embeddings (dim: tokens x 3072), sentence_embedding (1x738), tokenized_text

Return type

list, list, list

class danlp.models.bert_models.BertEmotion(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT Emotion model.

For classifying whether there is emotion in the text, and recognizing amongst eight emotions.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str, no_emotion=False)

Predicts emotion among:

  • 0: Glæde/Sindsro

  • 1: Tillid/Accept

  • 2: Forventning/Interrese

  • 3: Overasket/Målløs

  • 4: Vrede/Irritation

  • 5: Foragt/Modvilje

  • 6: Sorg/trist

  • 7: Frygt/Bekymret

Parameters
  • sentence (str) – raw text

  • no_emotion (bool) – whether there is emotion or not in the text

Returns

index of the emotion

Return type

int

predict_if_emotion(sentence)

Predicts whether there is emotion in the text.

Parameters

sentence (str) – raw sentence

Returns

0 if no emotion else 1

Return type

int

predict_proba(sentence: str, emotions=True, no_emotion=True)

Predicts the probabilities of emotions.

Parameters
  • sentence (str) – raw text

  • emotions (bool) – whether to return the probability of the emotion

  • no_emotion (bool) – whether to return the probability of the sentence being emotional

Returns

a list of probabilities

Return type

List

class danlp.models.bert_models.BertHateSpeech(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT HateSpeech Model.

For detecting whether a comment is offensive or not and, if offensive, predicting what type of hate speech it is. The model is meant to be used for helping moderating online comments (thus, including the detection and categorization of spams). Following are the categories that can be predicted by the model: * Særlig opmærksomhed * Personangreb * Sprogbrug * Spam & indhold

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str, offensive: bool = True, hatespeech: bool = True)

Predict whether a sentence is offensive or not and/or the class of the sentence [Særlig opmærksomhed, Personangreb, Sprogbrug, Spam & indhold]

Parameters
  • sentence (str) – raw text

  • offensive (bool) – if True returns whether the sentence is offensive or not

  • hatespeech (bool) – if True returns the type of hate speech the sentence belongs to

Returns

a dictionary for offensive language and hate speech detection results

Return type

Dict

class danlp.models.bert_models.BertNer(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT NER model

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(text: Union[str, List[str]], IOBformat=True)

Predict NER labels from raw text or tokenized text. If the text is a raw string this method will return the string tokenized with BERTs subword tokens.

Parameters
  • text – can either be a raw text or a list of tokens

  • IOBformat – can either be TRUE or FALSE, but can only be False if text input is a list of tokens. Specifify if output should be in IOB format or a dictionary

Returns

the tokenized text and the predicted labels in IOB format, or a dictionary with the tags and position

Example

varme vafler” becomes [“varme”, “va”, “##fler”]

class danlp.models.bert_models.BertNextSent(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT language model is trained for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version

Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict_if_next_sent(sent_A: str, sent_B: str)

Calculate the probability that sentence B follows sentence A.

Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction

Parameters
  • sent_A (str) – sentence A

  • sent_B (str) – sentence B

Returns

the probability of sentence B following sentence A

Return type

float

class danlp.models.bert_models.BertOffensive(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT offensive language identification model.

For predicting whether a text is offensive or not.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str)

Predict whether a text is offensive or not.

Parameters

sentence (str) – raw text

Returns

a class – OFF (offensive) or NOT (not offensive)

Return type

str

predict_proba(sentence: str)

For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT

class danlp.models.bert_models.BertTone(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

BERT Tone model.

For classifying both the tone [subjective, objective] and the polarity [positive, neutral, negativ] of sentences.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str, polarity: bool = True, analytic: bool = True)

Predict the polarity [positive, neutral, negativ] and/or the tone [subjective, objective] of the sentence.

Parameters
  • sentence (str) – raw text

  • polarity (bool) – returns the polarity if True

  • analytic (bool) – returns the tone if True

Returns

a dictionary for polarity and tone results

Return type

Dict

danlp.models.bert_models.load_bert_base_model(cache_dir='/home/docs/.danlp', verbose=False)

Load BERT language model and use for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

BERT model

danlp.models.bert_models.load_bert_emotion_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a BERT Emotion model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a BERT Emotion model

danlp.models.bert_models.load_bert_hatespeech_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a BERT HateSpeech model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a BERT HateSpeech model

danlp.models.bert_models.load_bert_ner_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a BERT NER model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a BERT NER model

danlp.models.bert_models.load_bert_nextsent_model(cache_dir='/home/docs/.danlp', verbose=False)

Load BERT language model used for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

BERT NextSent model

danlp.models.bert_models.load_bert_offensive_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a BERT offensive language identification model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a BERT offensive language identification model

danlp.models.bert_models.load_bert_tone_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a BERT Tone model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a BERT Tone model

ELECTRA models

class danlp.models.electra_models.ElectraOffensive(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

Electra Offensive Model.

For detecting whether a comment is offensive or not.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str)

Predict whether a sentence is offensive or not

Parameters

sentence (str) – raw text

Returns

a class representing whether the sentence is offensive or not (OFF/NOT)

Return type

str

predict_proba(sentence: str)

For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT

danlp.models.electra_models.load_electra_offensive_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads an Electra Offensive model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

an Electra Offensive model

flair models

danlp.models.flair_models.load_flair_ner_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a flair model for NER.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

an NER flair model

danlp.models.flair_models.load_flair_pos_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads a flair model for Part-of-Speech tagging.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a POS flair model

spaCy models

class danlp.models.spacy_models.SpacyChunking(model=None, cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

Spacy Chunking Model

Parameters
  • model (spaCy model) – a (preloaded) spaCy model

  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(text: Union[str, List[str]], bio=True)

Predict NP chunks from raw or tokenized text.

Parameters
  • text – can either be a raw text or a list of tokens

  • bio (bool) – True to return a list of labels in BIO format (same length as the sentence), False to return a list of tuples (start id, end id, chunk label)

Returns

NP chunks - either a list of labels in BIO format or a list of tuples (start id, end id, chunk label)

Example

Jeg kommer fra en lille by” becomes

  • a list of BIO tags: [‘B-NP’, ‘O’, ‘O’, ‘B-NP’, ‘I-NP’, ‘I-NP’]

  • or a list of tuples : [(0, 1, ‘NP’), (3, 6, ‘NP’)]

danlp.models.spacy_models.load_spacy_chunking_model(spacy_model=None, cache_dir='/home/docs/.danlp', verbose=False)

Loads a spaCy chunking model.

Parameters
  • spacy_model (spaCy model) – a (preloaded) spaCy model

  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

a spaCy Chunking model

Note

A spaCy model can be previously loaded using load_spacy_model and given as an argument to load_spacy_chunking_model (for instance, to avoid loading the model twice)

danlp.models.spacy_models.load_spacy_model(cache_dir='/home/docs/.danlp', verbose=False, textcat=None, vectorError=False)

Loads a spaCy model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

  • textcat (bool) – ‘sentiment’ for loading the spaCy sentiment analyser

  • vectorError (bool) –

Returns

a spaCy model

Warning

vectorError is a temporary work around error encounted by keeping two models and not been able to find reference name for vectors

XLM-R models

class danlp.models.xlmr_models.XLMRCoref(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

XLM-Roberta Coreference Resolution Model.

For predicting which expressions (word or group of words) refer to the same entity in a document.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(document: List[List[str]])

Predict coreferences in a document

Parameters

document (List[List[str]]) – segmented and tokenized text

Returns

a dictionary

Return type

Dict

predict_clusters(document: List[List[str]])

Predict clusters of entities in the document. Each predicted cluster contains a list of references. A reference is a tuple (ref text, start id, end id). The ids refer to the token ids in the entire document.

Parameters

document (List[List[str]]) – segmented and tokenized text

Returns

a list of clusters

Return type

List[List[Tuple]]

class danlp.models.xlmr_models.XlmrNed(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

XLM-Roberta for Named Entity Disambiguation.

For predicting whether or not a specific entity (QID) is mentioned in a sentence.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

predict(sentence: str, kg_context: str)

Predict whether a QID is mentioned in a sentence or not.

Parameters
  • sentence (str) – raw text

  • kg_context (str) – raw text

Returns

Return type

str

danlp.models.xlmr_models.load_xlmr_coref_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads an XLM-R coreference model.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

an XLM-R coreference model

danlp.models.xlmr_models.load_xlmr_ned_model(cache_dir='/home/docs/.danlp', verbose=False)

Loads an XLM-R model for named entity disambiguation.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

Returns

an XLM-R NED model