Datasets¶

Dacoref¶

class danlp.datasets.dacoref.Dacoref(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank. It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. This resource extends the annotation by using different mapping techniques and by augmenting with Qcodes from Wiktionary. Read more about it in the danlp docs.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

load_as_conllu(predefined_splits: bool = False)¶

Parameters: predefined_splits (bool) – Boolean
Returns: A single parsed conllu list or a list of train, dev, test split parsed conllu list depending on predefined_split

DaNED & DaWikiNED¶

class danlp.datasets.daned.DaNED(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the DaNED dataset. The DaNED dataset is derived from the Dacoref dataset which is itself based on the DDT (thus, divided in train, dev and test in the same way). It is annotated for named entity disambiguation (also refered as named entity linking).

Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).

The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

get_kg_context_from_qid(qid: str, output_as_dictionary=False, allow_online_search=False)¶

Return the knowledge context and the description of an entity from its QID.

Parameters

qid (str) – a wikidata QID
output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)
allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)

Returns

string (or dictionary) of properties and description

load_with_pandas()¶

Loads the DaNED dataset in dataframes with pandas.

Returns: 3 dataframes – train, dev, test

class danlp.datasets.dawikined.DaWikiNED(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the DaWikiNED dataset. The DaWikiNED dataset contains Wikipedia text annotated for named entity disambiguation. It contains only a train set as it is intended for use as a dataset augmentation for the DaNED dataset.

Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).

The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

get_kg_context_from_qid(qid: str, output_as_dictionary=False, allow_online_search=False)¶

Return the knowledge context and the description of an entity from its QID.

Parameters

qid (str) – a wikidata QID
output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)
allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)

Returns

string (or dictionary) of properties and description

load_with_pandas()¶

Loads the DaWikiNED dataset in a dataframe with pandas.

Returns: a dataframe for train data

DanNet¶

class danlp.datasets.dannet.DanNet(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

DanNet wrapper, providing functions to access the main features of DanNet. See also : https://cst.ku.dk/projekter/dannet/.

Dannet consists of a set of 4 databases:

words

word senses

relations

synsets

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

domains(word, pos=None)¶

Returns the domains of word.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of domains

hypernyms(word, pos=None)¶

Returns the hypernyms of word.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of hypernyms

hyponyms(word, pos=None)¶

Returns the hyponyms of word.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of hypernyms

load_with_pandas()¶

Loads the datasets in 4 dataframes

Returns: 4 dataframes: words, wordsenses, relations, synsets

meanings(word, pos=None)¶

Returns the meanings of word.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of meanings

pos(word)¶

Returns the part-of-speech tags word can be categorized with among “Noun”, “Verb” or “Adjective”.

Parameters: word – text
Returns: list of part-of-speech tags

synonyms(word, pos=None)¶

Returns the synonyms of word.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of synonyms

Example

“hav” returns [“sø”, “ocean”]

wordnet_relations(word, pos=None, eurowordnet=True)¶

Returns the name of the relations word is associated with.

Parameters

word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of relations

Danish Dependency Treebank¶

class danlp.datasets.ddt.DDT(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the Danish Dependency Treebank (DDT) through several frameworks/formats.

The DDT dataset has been annotated with NER tags in the IOB2 format. The dataset is downloaded in CoNLL-U format, but with this class it can be converted to spaCy format or a simple NER format similar to the CoNLL 2003 NER format.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

load_as_conllu(predefined_splits: bool = False)¶

Load the DDT in CoNLL-U format.

Parameters: predefined_splits (bool) –
Returns: A single pyconll.Conll or a tuple of (train, dev, test) pyconll.Conll depending on predefined_split

load_with_flair(predefined_splits: bool = False)¶

Load the DDT with flair.

This function is inspired by the “Reading Your Own Sequence Labeling Dataset” from Flairs tutorial on reading corpora:

https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md

Parameters: predefined_splits (bool) –
Returns: ColumnCorpus

Note

TODO: Make a pull request to flair similar to this: https://github.com/zalandoresearch/flair/issues/383

load_with_spacy()¶

Loads the DDT with spaCy.

This function converts the conllu files to json in the spaCy format.

Returns: GoldCorpus

Note

Not using jsonl because of: https://github.com/explosion/spaCy/issues/3523

DKHate¶

class danlp.datasets.dkhate.DKHate(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the DKHate dataset. The DKHate dataset contains user-generated comments from social media platforms (Facebook and Reddit) annotated for various types and target of offensive language. The original corpus has been used for the OffensEval 2020 shared task. Note that only labels for Offensive language identification (sub-task A) are available. Which means that each sample in this dataset is labelled with either NOT (Not Offensive) or OFF (Offensive).

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

load_with_pandas()¶

Loads the DKHate dataset in a dataframe with pandas.

Returns: a dataframe for test data and a dataframe for train data

Sentiment datasets¶

class danlp.datasets.sentiment.AngryTweets(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the AngryTweets Sentiment dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe.

Returns: a dataframe

class danlp.datasets.sentiment.EuroparlSentiment1(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the Europarl Sentiment dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe and drop duplicates and nan values

Returns: a dataframe

class danlp.datasets.sentiment.EuroparlSentiment2(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the Europarl Sentiment dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset as a dataframe

Returns: a dataframe

class danlp.datasets.sentiment.LccSentiment(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the LCC Sentiment dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe, combines and drops duplicates and nan values

Returns: a dataframe

class danlp.datasets.sentiment.TwitterSent(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the Twitter Sentiment dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe.

Returns: a dataframe of the test set and a dataframe of the train set

DaUnimorph¶

class danlp.datasets.unimorph.DaUnimorph(cache_dir='/home/docs/.danlp', verbose=False)¶

Bases: object

Danish Unimorph. See also : https://unimorph.github.io/.

The Danish Unimorph is a database which contains knowledge (lemmas and morphological features) about different forms of nouns and verbs in Danish.

Parameters

cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity

get_inflections(form, pos=None, is_lemma=False, with_features=False)¶

Returns all possible inflections (forms) of a word (based on its lemma)

Returns: list of words

get_lemmas(form, pos=None, with_features=False)¶

Returns the lemma(s) of a word

Returns: list of lemmas

load_with_pandas()¶

Loads the dataset in a dataframe

Returns: a dataframe

WikiANN¶

class danlp.datasets.wiki_ann.WikiAnn(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the WikiANN dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_flair(predefined_splits: bool = False)¶

Loads the dataset with flair.

Parameters: predefined_splits (bool) –
Returns: ColumnCorpus

load_with_spacy()¶

Loads the dataset with spaCy.

This function will convert the CoNLL02/03 format to json format for spaCy. As the function will return a spacy.gold.GoldCorpus which needs a dev set this function also splits the dataset into a 70/30 split as is done by Pan et al. (2017).

Pan et al. (2017): https://aclweb.org/anthology/P17-1178

Returns: GoldCorpus

Word similarity datasets¶

class danlp.datasets.word_sim.DSD(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the Danish Similarity Dataset dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe.

Returns: a dataframe

words() → set¶

Loads the vocabulary.

Return type: set

class danlp.datasets.word_sim.WordSim353Da(cache_dir: str = '/home/docs/.danlp')¶

Bases: object

Class for loading the WordSim-353 dataset.

Parameters: cache_dir (str) – the directory for storing cached models

load_with_pandas()¶

Loads the dataset in a dataframe.

Returns: a dataframe

words() → set¶

Loads the vocabulary.

Return type: set