Documentation about the coreference resource: Dacoref

To get an overview of different datasets, please go to the general dataset docs. This is extra documentation for the coreference resources named Dacoref. In the general dataset docs, there is also a small snippet to show how to load this resource with the DaNLP package. The resource can also be downloaded directly using the link below:

Download dacoref

This documentation provides details about how the resource has been constructed.

The work is conducted by Maria Jung Barrett.

LICENCE

This resource is copyrighted material, licensed under the GNU Public License version 2.

Dacoref

This Danish coreference annotation contains parts of the Copenhagen Dependency Treebank (Kromann and Lynge, 2004). Please cite this work when using the coreference resource. For the Universal Dependencies conversion included in the file, please cite Johannsen et al. (2015).

Size: 64.076 tokens 3.403 sentences 341 documents

It was originally annotated as part of the Copenhagen Dependency Treebank (CDT) project but never finished. The incomplete annotation can be downloaded from the project github.

The CDT documentation contains description of the coreference classes as well as inter-annotator agreement and confusion matrices.

For this resource, we used the annotation files from the annotator “Lotte” along with the UD syntax which is an automatic conversion of the CDT syntax annotation by Johansen et al. (2015). We provide the sentence ID from the UD resource as well as the document ID from CDT. The document ID has been prepended with a two letter domain code compatible with the domain codes of the Ontonotes corpus. This is a manually mapping of the sources listed in the CDT. Only nw (newswire), mz (magazine), and bn (broadcast news) were present:

  • 299 nw documents

  • 41 mz documents

  • 1 bn

For the CDT, only the core node of each span was annotated and one annotator manually propagated the label to the entire span. A few systematic errors were corrected in this process, the most important being that plural pronouns “we” and “they” can be coreferent with company names if they refer to the employee group of this company.

For this resource we have merged the following labels to form uniquely numbered clusters: coref, coref-evol, coref-iden, coref-iden.sb, coref-var, and ref. Coref-res and coref-res.prg are also included as clusters but not merged with any other label, nor each other.

Some notes about the annotation, but see also the CDT documentation: If conjunctions of entities are only referred to as a group, they are marked as one span. (e.g. if “Lise, Lone og Birthe” are only referred to as a group, e.g. by the plural pronoun “de”), “Line, Lone og Birthe” is marked as one span. The spans are generally as long as possible. Example: Det sidste gik ud over politikerne, da de i sin tid præsenterede [det første forslag til den milliard-dyre vandmiljøplan].

Singletons are not annotated. The annotation does not label attributative noun phrases that are connected through copula verbs such as to be. Name-initual appositive constructions are part of the same mention as the name. Generic pronouns (mainly “man” and “du”) are not clustered unless they are part of a cluster, e.g. with a reflexive or possesive pronoun.

Furthermore, the resource has been augmented with Qcodes from Wiktionary. This was a semi-automatic process conducted in the Spring of 2020 with the Wikidata entries available at that time. First, all tokens (not just named entities) were used to search using the Wikidata API. Given the entire list of matches with description for each token, one annotator decided which QID match was correct for each instance. It was decided in each case whether, e.g., “Østre landsret” refers to building, governmental administrative unit in Denmark or the legal process happening there. This was checked by another annotator who also manually added the Qcode to the correct span in the text. Both were native speakers of Danish. Furthermore, this process also included adding a generic QID for words that matched in the categories below but did not exist as a specific Wikidata entry. This can e.g. be used to decide which properties an entity may have or whether a name refers to a feminine or masculine entity.

In total, 7173 tokens were annotated with a Qcode. 2193 unique Qcodes were used.

The file can be opened by a conll reader that accepts an arbitrary number of fields, e.g. conllu

import conllu
conlist = conllu.parse(open('CDT_coref.conll').read(), fields=["id", "form", "lemma", "upos", 'xpos', 'feats', 'head', 'deprel','deps', 'misc', 'coref_id', 'coref_rel', 'doc_id', 'qid'])

🎓 References

Johannsen, A., Alonso, H. M., & Plank, B. (2015). Universal dependencies for danish. In International Workshop on Treebanks and Linguistic Theories (TLT14) (p. 157).

M.T. Kromann and S.K. Lynge. Danish Dependency Treebank v. 1.0. Department of Computational Linguistics, Copenhagen Business School., 2004. https://github.com/mbkromann/copenhagen-dependency-treebank

Weischedel, R., Palmer, M., Marcus, M., Hovy, E., Pradhan, S., Ramshaw, L., … & El-Bachouti, M. (2013). Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadelphia, PA, 23. https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

Generic QIDs

Category QID
Family name Q101352
Unisex nickname Q49614
Female given name Q11879590
Male given name Q12308941
Unisex name Q3409032
Artist name Q483501
Magazine Q41298
Hotel Q27686
Work of art Q838948
governmental administrative unit in Denmark Q21268738
Municipal Police Q1758690
Road Q34442
Cohousing Q1107167
Postal address Q319608
Museum Q33506
Security (tradeable financial asset) Q169489
Geographic location Q2221906
Radio program Q1555508
Tv program Q15416
Product / goods Q2424752
Department within organisation Q2366457
Organization Q43229
Sports venue Q1076486
Dish Q746549 (only one instance)
Event Q1656682
Fleet Q189524
University Q3918
Disease Q12136
Coast Q93352
Ship Q11446
Award Q618779
Automobile model Q3231690
Project (also Inquiry) Q170584
Hospital Q16917
Amusement ride Q1144661
Sports team Q12973014
Building Q41176
Bill (proposed law) Q686822
Restaurant Q11707
People / ethnic group Q2472587
Educational institution Q2385804
Shop Q213441
Publication Q732577
legislation  Q49371
Night club Q622425
Newspaper Q11032
Prison Q40357
Army Q37726