Lemmatization and morphological analysis for the Latin Dependency Treebank

Giuseppe G.A. Celano

doi:10.4454/ssl.v58i1.274

Vol. 58 No 1 (2020)

Articoli

Lemmatization and morphological analysis for the Latin Dependency Treebank

PDF (English)

Giuseppe G.A. Celano

more info

Giuseppe G.A. Celano
Abteilung Automatische Sprachverarbeitung Institut für Informatik Universität Leipzig

DOI : https://doi.org/10.4454/ssl.v58i1.274

Publiée 2020-09-02

Mots-clés

Latin Dependency Treebank,
lemmatization,
PoS tagging

Mention de droit d'auteur

Résumé

The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise the Latin Dependency Treebank. Current options available for lemmatization and morphological analysis of Latin are reviewed and discussed. The pipeline to annotate the morphological layer of the Latin Dependency Treebank is shown to consist of three main steps: (i) tokenization/sentence split, which is performed via a documented rule-based algorithm, (ii) pre-population by means of COMBO, a state-of-the-art joint lemmatizer, PoS tagger, and parser trained on the data of the Latin Dependency Treebank 2.1, and (iii) manual error correction informed by the attempt to identify and document lemmatization and morphology annotation rules.

PDF (English)

Studi e Saggi Linguistici

Lemmatization and morphological analysis for the Latin Dependency Treebank

Mots-clés

Résumé

Articles similaires