V. 58 N. 1 (2020)
Articoli

Lemmatization and morphological analysis for the Latin Dependency Treebank

Giuseppe G.A. Celano
Abteilung Automatische Sprachverarbeitung Institut für Informatik Universität Leipzig

Pubblicato 2020-09-02

Parole chiave

  • Latin Dependency Treebank,
  • lemmatization,
  • PoS tagging

Abstract

The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise the Latin Dependency Treebank. Current options available for lemmatization and morphological analysis of Latin are reviewed and discussed. The pipeline to annotate the morphological layer of the Latin Dependency Treebank is shown to consist of three main steps: (i) tokenization/sentence split, which is performed via a documented rule-based algorithm, (ii) pre-population by means of COMBO, a state-of-the-art joint lemmatizer, PoS tagger, and parser trained on the data of the Latin Dependency Treebank 2.1, and (iii) manual error correction informed by the attempt to identify and document lemmatization and morphology annotation rules.