Models¶
Embeddings¶
This module provides you with functions for loading pretrained Danish word embeddings through several NLP frameworks:
flair
spaCy
Gensim
Available word embeddings:
wiki.da.wv
cc.da.wv
conll17.da.wv
news.da.wv
sketchengine.da.wv
Available subword embeddings:
wiki.da.swv
cc.da.swv
sketchengine.da.swv
-
danlp.models.embeddings.
AVAILABLE_EMBEDDINGS
= ['wiki.da.wv', 'cc.da.wv', 'conll17.da.wv', 'news.da.wv', 'sketchengine.da.wv', 'dslreddit.da.wv']¶
-
danlp.models.embeddings.
AVAILABLE_SUBWORD_EMBEDDINGS
= ['wiki.da.swv', 'cc.da.swv', 'sketchengine.da.swv']¶
-
danlp.models.embeddings.
assert_wv_dimensions
(wv: gensim.models.keyedvectors.Word2VecKeyedVectors, pretrained_embedding: str)¶ This function will check the dimensions of some word embeddings wv, and check them against the data stored in WORD_EMBEDDINGS.
- Parameters
wv (gensim.models.KeyedVectors) – word embeddings
pretrained_embedding (str) – the name of the pretrained embeddings
-
danlp.models.embeddings.
load_context_embeddings_with_flair
(direction='bi', word_embeddings=None, cache_dir='/home/docs/.danlp', verbose=False)¶ Loads contextutal (dynamic) word embeddings with flair.
- Parameters
direction (str) – bidirectional ‘bi’, forward ‘fwd’ or backward ‘bwd’
word_embedding –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
danlp.models.embeddings.
load_keras_embedding_layer
(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False, **kwargs)¶ Loads a Keras Embedding layer.
- Parameters
pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
kwargs – used to forward arguments to the Keras Embedding layer
- Returns
a Keras Embedding layer and index to word dictionary
-
danlp.models.embeddings.
load_pytorch_embedding_layer
(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a pytorch embbeding layer.
- Parameters
pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a pytorch Embedding module and a list id2word
-
danlp.models.embeddings.
load_wv_with_gensim
(pretrained_embedding: str, cache_dir='/home/docs/.danlp', verbose: bool = False)¶ Loads word embeddings with Gensim.
- Parameters
pretrained_embedding (str) –
cache_dir – the directory for storing cached data
verbose (bool) – True to increase verbosity
- Returns
KeyedVectors or FastTextKeyedVectors
-
danlp.models.embeddings.
load_wv_with_spacy
(pretrained_embedding: str, cache_dir: str = '/home/docs/.danlp', verbose=False)¶ Loads a spaCy model with pretrained embeddings.
- Parameters
pretrained_embedding (str) –
cache_dir (str) – the directory for storing cached data
verbose (bool) – True to increase verbosity
- Returns
spaCy model
BERT models¶
-
class
danlp.models.bert_models.
BertBase
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT language model used for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version
Credit for code example: https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
embed_text
(text)¶ Calculate the embeddings for each token in a sentence ant the embedding for the sentence based on a BERT language model. The embedding for a token is chosen to be the concatenated last four layers, and the sentence embeddings to be the mean of the second to last layer of all tokens in the sentence The BERT tokenizer splits in subword for UNK word. The tokenized sentence is therefore returned as well. The embeddings for the special tokens are not returned.
- Parameters
sentence (str) – raw text
- Returns
three lists: token_embeddings (dim: tokens x 3072), sentence_embedding (1x738), tokenized_text
- Return type
list, list, list
-
class
danlp.models.bert_models.
BertEmotion
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT Emotion model.
For classifying whether there is emotion in the text, and recognizing amongst eight emotions.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str, no_emotion=False)¶ Predicts emotion among:
0: Glæde/Sindsro
1: Tillid/Accept
2: Forventning/Interrese
3: Overasket/Målløs
4: Vrede/Irritation
5: Foragt/Modvilje
6: Sorg/trist
7: Frygt/Bekymret
- Parameters
sentence (str) – raw text
no_emotion (bool) – whether there is emotion or not in the text
- Returns
index of the emotion
- Return type
int
-
predict_if_emotion
(sentence)¶ Predicts whether there is emotion in the text.
- Parameters
sentence (str) – raw sentence
- Returns
0 if no emotion else 1
- Return type
int
-
predict_proba
(sentence: str, emotions=True, no_emotion=True)¶ Predicts the probabilities of emotions.
- Parameters
sentence (str) – raw text
emotions (bool) – whether to return the probability of the emotion
no_emotion (bool) – whether to return the probability of the sentence being emotional
- Returns
a list of probabilities
- Return type
List
-
class
danlp.models.bert_models.
BertHateSpeech
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT HateSpeech Model.
For detecting whether a comment is offensive or not and, if offensive, predicting what type of hate speech it is. The model is meant to be used for helping moderating online comments (thus, including the detection and categorization of spams). Following are the categories that can be predicted by the model: * Særlig opmærksomhed * Personangreb * Sprogbrug * Spam & indhold
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str, offensive: bool = True, hatespeech: bool = True)¶ Predict whether a sentence is offensive or not and/or the class of the sentence [Særlig opmærksomhed, Personangreb, Sprogbrug, Spam & indhold]
- Parameters
sentence (str) – raw text
offensive (bool) – if True returns whether the sentence is offensive or not
hatespeech (bool) – if True returns the type of hate speech the sentence belongs to
- Returns
a dictionary for offensive language and hate speech detection results
- Return type
Dict
-
class
danlp.models.bert_models.
BertNer
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT NER model
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(text: Union[str, List[str]], IOBformat=True)¶ Predict NER labels from raw text or tokenized text. If the text is a raw string this method will return the string tokenized with BERTs subword tokens.
- Parameters
text – can either be a raw text or a list of tokens
IOBformat – can either be TRUE or FALSE, but can only be False if text input is a list of tokens. Specifify if output should be in IOB format or a dictionary
- Returns
the tokenized text and the predicted labels in IOB format, or a dictionary with the tags and position
- Example
“varme vafler” becomes [“varme”, “va”, “##fler”]
-
class
danlp.models.bert_models.
BertNextSent
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT language model is trained for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert The Bert model is transformed into pytorch version
Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict_if_next_sent
(sent_A: str, sent_B: str)¶ Calculate the probability that sentence B follows sentence A.
Credit for code example: https://stackoverflow.com/questions/55111360/using-bert-for-next-sentence-prediction
- Parameters
sent_A (str) – sentence A
sent_B (str) – sentence B
- Returns
the probability of sentence B following sentence A
- Return type
float
-
class
danlp.models.bert_models.
BertOffensive
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT offensive language identification model.
For predicting whether a text is offensive or not.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str)¶ Predict whether a text is offensive or not.
- Parameters
sentence (str) – raw text
- Returns
a class – OFF (offensive) or NOT (not offensive)
- Return type
str
-
predict_proba
(sentence: str)¶ For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT
-
class
danlp.models.bert_models.
BertTone
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
BERT Tone model.
For classifying both the tone [subjective, objective] and the polarity [positive, neutral, negativ] of sentences.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str, polarity: bool = True, analytic: bool = True)¶ Predict the polarity [positive, neutral, negativ] and/or the tone [subjective, objective] of the sentence.
- Parameters
sentence (str) – raw text
polarity (bool) – returns the polarity if True
analytic (bool) – returns the tone if True
- Returns
a dictionary for polarity and tone results
- Return type
Dict
-
danlp.models.bert_models.
load_bert_base_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Load BERT language model and use for embedding of tokens or sentence. The Model is trained by BotXO: https://github.com/botxo/nordic_bert
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
BERT model
-
danlp.models.bert_models.
load_bert_emotion_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a BERT Emotion model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a BERT Emotion model
-
danlp.models.bert_models.
load_bert_hatespeech_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a BERT HateSpeech model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a BERT HateSpeech model
-
danlp.models.bert_models.
load_bert_ner_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a BERT NER model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a BERT NER model
-
danlp.models.bert_models.
load_bert_nextsent_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Load BERT language model used for next sentence predictions. The Model is trained by BotXO: https://github.com/botxo/nordic_bert
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
BERT NextSent model
-
danlp.models.bert_models.
load_bert_offensive_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a BERT offensive language identification model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a BERT offensive language identification model
-
danlp.models.bert_models.
load_bert_tone_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a BERT Tone model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a BERT Tone model
ELECTRA models¶
-
class
danlp.models.electra_models.
ElectraOffensive
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
Electra Offensive Model.
For detecting whether a comment is offensive or not.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str)¶ Predict whether a sentence is offensive or not
- Parameters
sentence (str) – raw text
- Returns
a class representing whether the sentence is offensive or not (OFF/NOT)
- Return type
str
-
predict_proba
(sentence: str)¶ For a given sentence, return its probabilities of belonging to each class, i.e. OFF or NOT
-
danlp.models.electra_models.
load_electra_offensive_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads an Electra Offensive model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
an Electra Offensive model
flair models¶
-
danlp.models.flair_models.
load_flair_ner_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a flair model for NER.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
an NER flair model
-
danlp.models.flair_models.
load_flair_pos_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a flair model for Part-of-Speech tagging.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a POS flair model
spaCy models¶
-
class
danlp.models.spacy_models.
SpacyChunking
(model=None, cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
Spacy Chunking Model
- Parameters
model (spaCy model) – a (preloaded) spaCy model
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(text: Union[str, List[str]], bio=True)¶ Predict NP chunks from raw or tokenized text.
- Parameters
text – can either be a raw text or a list of tokens
bio (bool) – True to return a list of labels in BIO format (same length as the sentence), False to return a list of tuples (start id, end id, chunk label)
- Returns
NP chunks - either a list of labels in BIO format or a list of tuples (start id, end id, chunk label)
- Example
“Jeg kommer fra en lille by” becomes
a list of BIO tags: [‘B-NP’, ‘O’, ‘O’, ‘B-NP’, ‘I-NP’, ‘I-NP’]
or a list of tuples : [(0, 1, ‘NP’), (3, 6, ‘NP’)]
-
danlp.models.spacy_models.
load_spacy_chunking_model
(spacy_model=None, cache_dir='/home/docs/.danlp', verbose=False)¶ Loads a spaCy chunking model.
- Parameters
spacy_model (spaCy model) – a (preloaded) spaCy model
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
a spaCy Chunking model
Note
A spaCy model can be previously loaded using load_spacy_model and given as an argument to load_spacy_chunking_model (for instance, to avoid loading the model twice)
-
danlp.models.spacy_models.
load_spacy_model
(cache_dir='/home/docs/.danlp', verbose=False, textcat=None, vectorError=False)¶ Loads a spaCy model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
textcat (bool) – ‘sentiment’ for loading the spaCy sentiment analyser
vectorError (bool) –
- Returns
a spaCy model
Warning
vectorError is a temporary work around error encounted by keeping two models and not been able to find reference name for vectors
XLM-R models¶
-
class
danlp.models.xlmr_models.
XLMRCoref
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
XLM-Roberta Coreference Resolution Model.
For predicting which expressions (word or group of words) refer to the same entity in a document.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(document: List[List[str]])¶ Predict coreferences in a document
- Parameters
document (List[List[str]]) – segmented and tokenized text
- Returns
a dictionary
- Return type
Dict
-
predict_clusters
(document: List[List[str]])¶ Predict clusters of entities in the document. Each predicted cluster contains a list of references. A reference is a tuple (ref text, start id, end id). The ids refer to the token ids in the entire document.
- Parameters
document (List[List[str]]) – segmented and tokenized text
- Returns
a list of clusters
- Return type
List[List[Tuple]]
-
class
danlp.models.xlmr_models.
XlmrNed
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
XLM-Roberta for Named Entity Disambiguation.
For predicting whether or not a specific entity (QID) is mentioned in a sentence.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
predict
(sentence: str, kg_context: str)¶ Predict whether a QID is mentioned in a sentence or not.
- Parameters
sentence (str) – raw text
kg_context (str) – raw text
- Returns
- Return type
str
-
danlp.models.xlmr_models.
load_xlmr_coref_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads an XLM-R coreference model.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
an XLM-R coreference model
-
danlp.models.xlmr_models.
load_xlmr_ned_model
(cache_dir='/home/docs/.danlp', verbose=False)¶ Loads an XLM-R model for named entity disambiguation.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
- Returns
an XLM-R NED model