Datasets

Dacoref

class danlp.datasets.dacoref.Dacoref(cache_dir: str = '/home/docs/.danlp')

Bases: object

This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank. It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. This resource extends the annotation by using different mapping techniques and by augmenting with Qcodes from Wiktionary. Read more about it in the danlp docs.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

load_as_conllu(predefined_splits: bool = False)
Parameters

predefined_splits (bool) – Boolean

Returns

A single parsed conllu list or a list of train, dev, test split parsed conllu list depending on predefined_split

DaNED & DaWikiNED

class danlp.datasets.daned.DaNED(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the DaNED dataset. The DaNED dataset is derived from the Dacoref dataset which is itself based on the DDT (thus, divided in train, dev and test in the same way). It is annotated for named entity disambiguation (also refered as named entity linking).

Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).

The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

get_kg_context_from_qid(qid: str, output_as_dictionary=False, allow_online_search=False)

Return the knowledge context and the description of an entity from its QID.

Parameters
  • qid (str) – a wikidata QID

  • output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)

  • allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)

Returns

string (or dictionary) of properties and description

load_with_pandas()

Loads the DaNED dataset in dataframes with pandas.

Returns

3 dataframes – train, dev, test

class danlp.datasets.dawikined.DaWikiNED(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the DaWikiNED dataset. The DaWikiNED dataset contains Wikipedia text annotated for named entity disambiguation. It contains only a train set as it is intended for use as a dataset augmentation for the DaNED dataset.

Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).

The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

get_kg_context_from_qid(qid: str, output_as_dictionary=False, allow_online_search=False)

Return the knowledge context and the description of an entity from its QID.

Parameters
  • qid (str) – a wikidata QID

  • output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)

  • allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)

Returns

string (or dictionary) of properties and description

load_with_pandas()

Loads the DaWikiNED dataset in a dataframe with pandas.

Returns

a dataframe for train data

DanNet

class danlp.datasets.dannet.DanNet(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

DanNet wrapper, providing functions to access the main features of DanNet. See also : https://cst.ku.dk/projekter/dannet/.

Dannet consists of a set of 4 databases:

  • words

  • word senses

  • relations

  • synsets

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

domains(word, pos=None)

Returns the domains of word.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of domains

hypernyms(word, pos=None)

Returns the hypernyms of word.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of hypernyms

hyponyms(word, pos=None)

Returns the hyponyms of word.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of hypernyms

load_with_pandas()

Loads the datasets in 4 dataframes

Returns

4 dataframes: words, wordsenses, relations, synsets

meanings(word, pos=None)

Returns the meanings of word.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of meanings

pos(word)

Returns the part-of-speech tags word can be categorized with among “Noun”, “Verb” or “Adjective”.

Parameters

word – text

Returns

list of part-of-speech tags

synonyms(word, pos=None)

Returns the synonyms of word.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of synonyms

Example

hav” returns [“sø”, “ocean”]

wordnet_relations(word, pos=None, eurowordnet=True)

Returns the name of the relations word is associated with.

Parameters
  • word – text

  • pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)

Returns

list of relations

Danish Dependency Treebank

class danlp.datasets.ddt.DDT(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the Danish Dependency Treebank (DDT) through several frameworks/formats.

The DDT dataset has been annotated with NER tags in the IOB2 format. The dataset is downloaded in CoNLL-U format, but with this class it can be converted to spaCy format or a simple NER format similar to the CoNLL 2003 NER format.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

load_as_conllu(predefined_splits: bool = False)

Load the DDT in CoNLL-U format.

Parameters

predefined_splits (bool) –

Returns

A single pyconll.Conll or a tuple of (train, dev, test) pyconll.Conll depending on predefined_split

load_with_flair(predefined_splits: bool = False)

Load the DDT with flair.

This function is inspired by the “Reading Your Own Sequence Labeling Dataset” from Flairs tutorial on reading corpora:

https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md

Parameters

predefined_splits (bool) –

Returns

ColumnCorpus

Note

TODO: Make a pull request to flair similar to this: https://github.com/zalandoresearch/flair/issues/383

load_with_spacy()

Loads the DDT with spaCy.

This function converts the conllu files to json in the spaCy format.

Returns

GoldCorpus

Note

Not using jsonl because of: https://github.com/explosion/spaCy/issues/3523

DKHate

class danlp.datasets.dkhate.DKHate(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the DKHate dataset. The DKHate dataset contains user-generated comments from social media platforms (Facebook and Reddit) annotated for various types and target of offensive language. The original corpus has been used for the OffensEval 2020 shared task. Note that only labels for Offensive language identification (sub-task A) are available. Which means that each sample in this dataset is labelled with either NOT (Not Offensive) or OFF (Offensive).

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

load_with_pandas()

Loads the DKHate dataset in a dataframe with pandas.

Returns

a dataframe for test data and a dataframe for train data

Sentiment datasets

class danlp.datasets.sentiment.AngryTweets(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the AngryTweets Sentiment dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe.

Returns

a dataframe

class danlp.datasets.sentiment.EuroparlSentiment1(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the Europarl Sentiment dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe and drop duplicates and nan values

Returns

a dataframe

class danlp.datasets.sentiment.EuroparlSentiment2(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the Europarl Sentiment dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset as a dataframe

Returns

a dataframe

class danlp.datasets.sentiment.LccSentiment(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the LCC Sentiment dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe, combines and drops duplicates and nan values

Returns

a dataframe

class danlp.datasets.sentiment.TwitterSent(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the Twitter Sentiment dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe.

Returns

a dataframe of the test set and a dataframe of the train set

DaUnimorph

class danlp.datasets.unimorph.DaUnimorph(cache_dir='/home/docs/.danlp', verbose=False)

Bases: object

Danish Unimorph. See also : https://unimorph.github.io/.

The Danish Unimorph is a database which contains knowledge (lemmas and morphological features) about different forms of nouns and verbs in Danish.

Parameters
  • cache_dir (str) – the directory for storing cached models

  • verbose (bool) – True to increase verbosity

get_inflections(form, pos=None, is_lemma=False, with_features=False)

Returns all possible inflections (forms) of a word (based on its lemma)

Returns

list of words

get_lemmas(form, pos=None, with_features=False)

Returns the lemma(s) of a word

Returns

list of lemmas

load_with_pandas()

Loads the dataset in a dataframe

Returns

a dataframe

WikiANN

class danlp.datasets.wiki_ann.WikiAnn(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the WikiANN dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_flair(predefined_splits: bool = False)

Loads the dataset with flair.

Parameters

predefined_splits (bool) –

Returns

ColumnCorpus

load_with_spacy()

Loads the dataset with spaCy.

This function will convert the CoNLL02/03 format to json format for spaCy. As the function will return a spacy.gold.GoldCorpus which needs a dev set this function also splits the dataset into a 70/30 split as is done by Pan et al. (2017).

Returns

GoldCorpus

Word similarity datasets

class danlp.datasets.word_sim.DSD(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the Danish Similarity Dataset dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe.

Returns

a dataframe

words() → set

Loads the vocabulary.

Return type

set

class danlp.datasets.word_sim.WordSim353Da(cache_dir: str = '/home/docs/.danlp')

Bases: object

Class for loading the WordSim-353 dataset.

Parameters

cache_dir (str) – the directory for storing cached models

load_with_pandas()

Loads the dataset in a dataframe.

Returns

a dataframe

words() → set

Loads the vocabulary.

Return type

set