Datasets¶
Dacoref¶
-
class
danlp.datasets.dacoref.
Dacoref
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank. It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. This resource extends the annotation by using different mapping techniques and by augmenting with Qcodes from Wiktionary. Read more about it in the danlp docs.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
load_as_conllu
(predefined_splits: bool = False)¶ - Parameters
predefined_splits (bool) – Boolean
- Returns
A single parsed conllu list or a list of train, dev, test split parsed conllu list depending on predefined_split
DaNED & DaWikiNED¶
-
class
danlp.datasets.daned.
DaNED
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the DaNED dataset. The DaNED dataset is derived from the Dacoref dataset which is itself based on the DDT (thus, divided in train, dev and test in the same way). It is annotated for named entity disambiguation (also refered as named entity linking).
Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).
The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
get_kg_context_from_qid
(qid: str, output_as_dictionary=False, allow_online_search=False)¶ Return the knowledge context and the description of an entity from its QID.
- Parameters
qid (str) – a wikidata QID
output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)
allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)
- Returns
string (or dictionary) of properties and description
-
load_with_pandas
()¶ Loads the DaNED dataset in dataframes with pandas.
- Returns
3 dataframes – train, dev, test
-
class
danlp.datasets.dawikined.
DaWikiNED
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the DaWikiNED dataset. The DaWikiNED dataset contains Wikipedia text annotated for named entity disambiguation. It contains only a train set as it is intended for use as a dataset augmentation for the DaNED dataset.
Each entry is a tuple of a sentence and the QID of an entity. The label represents whether the entity corresponding to the QID is mentioned in the sentence. Each QID is linked to a knowledge graph (wikidata properties) and a description (wikidata description).
The same sentence can appear multiple times in the dataset (associated with different QIDs). But only one of them should have a label “1” (which corresponds to the correct entity).
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
get_kg_context_from_qid
(qid: str, output_as_dictionary=False, allow_online_search=False)¶ Return the knowledge context and the description of an entity from its QID.
- Parameters
qid (str) – a wikidata QID
output_as_dictionary (bool) – whether the properties should be a dictionary or a string (default)
allow_online_search (bool) – whether searching Wikidata online when qid not in database (default False)
- Returns
string (or dictionary) of properties and description
-
load_with_pandas
()¶ Loads the DaWikiNED dataset in a dataframe with pandas.
- Returns
a dataframe for train data
DanNet¶
-
class
danlp.datasets.dannet.
DanNet
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
DanNet wrapper, providing functions to access the main features of DanNet. See also : https://cst.ku.dk/projekter/dannet/.
Dannet consists of a set of 4 databases:
words
word senses
relations
synsets
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
domains
(word, pos=None)¶ Returns the domains of word.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of domains
-
hypernyms
(word, pos=None)¶ Returns the hypernyms of word.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of hypernyms
-
hyponyms
(word, pos=None)¶ Returns the hyponyms of word.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of hypernyms
-
load_with_pandas
()¶ Loads the datasets in 4 dataframes
- Returns
4 dataframes: words, wordsenses, relations, synsets
-
meanings
(word, pos=None)¶ Returns the meanings of word.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of meanings
-
pos
(word)¶ Returns the part-of-speech tags word can be categorized with among “Noun”, “Verb” or “Adjective”.
- Parameters
word – text
- Returns
list of part-of-speech tags
-
synonyms
(word, pos=None)¶ Returns the synonyms of word.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of synonyms
- Example
“hav” returns [“sø”, “ocean”]
-
wordnet_relations
(word, pos=None, eurowordnet=True)¶ Returns the name of the relations word is associated with.
- Parameters
word – text
pos – (list of) part of speech tag(s) (in “Noun”, “Verb”, “Adjective”)
- Returns
list of relations
Danish Dependency Treebank¶
-
class
danlp.datasets.ddt.
DDT
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the Danish Dependency Treebank (DDT) through several frameworks/formats.
The DDT dataset has been annotated with NER tags in the IOB2 format. The dataset is downloaded in CoNLL-U format, but with this class it can be converted to spaCy format or a simple NER format similar to the CoNLL 2003 NER format.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
load_as_conllu
(predefined_splits: bool = False)¶ Load the DDT in CoNLL-U format.
- Parameters
predefined_splits (bool) –
- Returns
A single pyconll.Conll or a tuple of (train, dev, test) pyconll.Conll depending on predefined_split
-
load_with_flair
(predefined_splits: bool = False)¶ Load the DDT with flair.
This function is inspired by the “Reading Your Own Sequence Labeling Dataset” from Flairs tutorial on reading corpora:
https://github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_6_CORPUS.md
- Parameters
predefined_splits (bool) –
- Returns
ColumnCorpus
Note
TODO: Make a pull request to flair similar to this: https://github.com/zalandoresearch/flair/issues/383
-
load_with_spacy
()¶ Loads the DDT with spaCy.
This function converts the conllu files to json in the spaCy format.
- Returns
GoldCorpus
Note
Not using jsonl because of: https://github.com/explosion/spaCy/issues/3523
DKHate¶
-
class
danlp.datasets.dkhate.
DKHate
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the DKHate dataset. The DKHate dataset contains user-generated comments from social media platforms (Facebook and Reddit) annotated for various types and target of offensive language. The original corpus has been used for the OffensEval 2020 shared task. Note that only labels for Offensive language identification (sub-task A) are available. Which means that each sample in this dataset is labelled with either NOT (Not Offensive) or OFF (Offensive).
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
load_with_pandas
()¶ Loads the DKHate dataset in a dataframe with pandas.
- Returns
a dataframe for test data and a dataframe for train data
Sentiment datasets¶
-
class
danlp.datasets.sentiment.
AngryTweets
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the AngryTweets Sentiment dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe.
- Returns
a dataframe
-
class
danlp.datasets.sentiment.
EuroparlSentiment1
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the Europarl Sentiment dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe and drop duplicates and nan values
- Returns
a dataframe
-
class
danlp.datasets.sentiment.
EuroparlSentiment2
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the Europarl Sentiment dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset as a dataframe
- Returns
a dataframe
-
class
danlp.datasets.sentiment.
LccSentiment
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the LCC Sentiment dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe, combines and drops duplicates and nan values
- Returns
a dataframe
-
class
danlp.datasets.sentiment.
TwitterSent
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the Twitter Sentiment dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe.
- Returns
a dataframe of the test set and a dataframe of the train set
DaUnimorph¶
-
class
danlp.datasets.unimorph.
DaUnimorph
(cache_dir='/home/docs/.danlp', verbose=False)¶ Bases:
object
Danish Unimorph. See also : https://unimorph.github.io/.
The Danish Unimorph is a database which contains knowledge (lemmas and morphological features) about different forms of nouns and verbs in Danish.
- Parameters
cache_dir (str) – the directory for storing cached models
verbose (bool) – True to increase verbosity
-
get_inflections
(form, pos=None, is_lemma=False, with_features=False)¶ Returns all possible inflections (forms) of a word (based on its lemma)
- Returns
list of words
-
get_lemmas
(form, pos=None, with_features=False)¶ Returns the lemma(s) of a word
- Returns
list of lemmas
-
load_with_pandas
()¶ Loads the dataset in a dataframe
- Returns
a dataframe
WikiANN¶
-
class
danlp.datasets.wiki_ann.
WikiAnn
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the WikiANN dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_flair
(predefined_splits: bool = False)¶ Loads the dataset with flair.
- Parameters
predefined_splits (bool) –
- Returns
ColumnCorpus
-
load_with_spacy
()¶ Loads the dataset with spaCy.
This function will convert the CoNLL02/03 format to json format for spaCy. As the function will return a spacy.gold.GoldCorpus which needs a dev set this function also splits the dataset into a 70/30 split as is done by Pan et al. (2017).
Pan et al. (2017): https://aclweb.org/anthology/P17-1178
- Returns
GoldCorpus
Word similarity datasets¶
-
class
danlp.datasets.word_sim.
DSD
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the Danish Similarity Dataset dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe.
- Returns
a dataframe
-
words
() → set¶ Loads the vocabulary.
- Return type
set
-
class
danlp.datasets.word_sim.
WordSim353Da
(cache_dir: str = '/home/docs/.danlp')¶ Bases:
object
Class for loading the WordSim-353 dataset.
- Parameters
cache_dir (str) – the directory for storing cached models
-
load_with_pandas
()¶ Loads the dataset in a dataframe.
- Returns
a dataframe
-
words
() → set¶ Loads the vocabulary.
- Return type
set