Documentation about the coreference resource: Dacoref¶
To get an overview of different datasets, please go to the general dataset docs. This is extra documentation for the coreference resources named Dacoref. In the general dataset docs, there is also a small snippet to show how to load this resource with the DaNLP package. The resource can also be downloaded directly using the link below:
This documentation provides details about how the resource has been constructed.
The work is conducted by Maria Jung Barrett.
This resource is copyrighted material, licensed under the GNU Public License version 2.
This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank (Kromann and Lynge, 2004). Please cite this work when using the coreference resource. For the Universal Dependencies conversion included in the file, please cite Johannsen et al. (2015).
Size: 64.076 tokens 3.403 sentences 341 documents
It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. The incomplete annotation can be downloaded from the project github.
The CDT documentation contains description of the coreference classes as well as inter-annotator agreement and confusion matrices.
For this resource, we used the annotation files from the annotator “Lotte” along with the UD syntax which is an automatic conversion of the CDT syntax annotation by Johansen et al. (2015). We provide the sentence ID from the UD resource as well as the document ID from CDT. The document ID has been prepended with a two letter domain code compatible with the domain codes of the Ontonotes corpus. This is a manually mapping of the sources listed in the CDT. Only nw (newswire), mz (magazine), and bn (broadcast news) were present:
299 nw documents
41 mz documents
For the CDT, only the core node of each span was annotated and one annotator manually propagated the label to the entire span. A few systematic errors were corrected in this process, the most important being that plural pronouns “we” and “they” can be coreferent with company names if they refer to the employee group of this company.
For this resource we have merged the following labels to form uniquely numbered clusters: coref, coref-evol, coref-iden, coref-iden.sb, coref-var, and ref. Coref-res and coref-res.prg are also included as clusters but not merged with any other label, nor each other.
Some notes about the annotation, but see also the CDT documentation: If conjunctions of entities are only referred to as a group, they are marked as one span. (e.g. if “Lise, Lone og Birthe” are only referred to as a group, e.g. by the plural pronoun “de”), “Line, Lone og Birthe” is marked as one span. The spans are generally as long as possible. Example: Det sidste gik ud over politikerne, da de i sin tid præsenterede [det første forslag til den milliard-dyre vandmiljøplan].
Singletons are not annotated. The annotation does not label attributative noun phrases that are connected through copula verbs such as to be. Name-initual appositive constructions are part of the same mention as the name. Generic pronouns (mainly “man” and “du”) are not clustered unless they are part of a cluster, e.g. with a reflexive or possesive pronoun.
Furthermore, the resource has been augmented with Qcodes from Wiktionary. This was a semi-automatic process conducted in the Spring of 2020 with the Wikidata entries available at that time. First, all tokens (not just named entities) were used to search using the Wikidata API. Given the entire list of matches with description for each token, one annotator decided which QID match was correct for each instance. It was decided in each case whether, e.g., “Østre landsret” refers to building, governmental administrative unit in Denmark or the legal process happening there. This was checked by another annotator who also manually added the Qcode to the correct span in the text. Both were native speakers of Danish. Furthermore, this process also included adding a generic QID for words that matched in the categories below but did not exist as a specific Wikidata entry. This can e.g. be used to decide which properties an entity may have or whether a name refers to a feminine or masculine entity.
In total, 7173 tokens were annotated with a Qcode. 2193 unique Qcodes were used.
The file can be opened by a conll reader that accepts an arbitrary number of fields, e.g. conllu
import conllu conlist = conllu.parse(open('CDT_coref.conll').read(), fields=["id", "form", "lemma", "upos", 'xpos', 'feats', 'head', 'deprel','deps', 'misc', 'coref_id', 'coref_rel', 'doc_id', 'qid'])
Johannsen, A., Alonso, H. M., & Plank, B. (2015). Universal dependencies for danish. In International Workshop on Treebanks and Linguistic Theories (TLT14) (p. 157).
M.T. Kromann and S.K. Lynge. Danish Dependency Treebank v. 1.0. Department of Computational Linguistics, Copenhagen Business School., 2004. https://github.com/mbkromann/copenhagen-dependency-treebank
Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … & El-Bachouti, M. (2013). Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23. https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
|Female given name||Q11879590|
|Male given name||Q12308941|
|Work of art||Q838948|
|governmental administrative unit in Denmark||Q21268738|
|Security (tradeable financial asset)||Q169489|
|Product / goods||Q2424752|
|Department within organisation||Q2366457|
|Dish||Q746549 (only one instance)|
|Project (also Inquiry)||Q170584|
|Bill (proposed law)||Q686822|
|People / ethnic group||Q2472587|