Lemmatization and morphological analysis for the Latin Dependency Treebank

Giuseppe G.A. Celano

Abstract


The present article presents some challenges posed by lemmatization and PoS tagging of Latin, with reference to the ongoing work to revise the Latin Dependency Treebank. Current options available for lemmatization and morphological analysis of Latin are reviewed and discussed. The pipeline to annotate the morphological layer of the Latin Dependency Treebank is shown to consist of three main steps: (i) tokenization/sentence split, which is performed via a documented rule-based algorithm, (ii) pre-population by means of COMBO, a state-of-the-art joint lemmatizer, PoS tagger, and parser trained on the data of the Latin Dependency Treebank 2.1, and (iii) manual error correction informed by the attempt to identify and document lemmatization and morphology annotation rules.

Parole chiave


Latin Dependency Treebank; lemmatization; PoS tagging

Full Text

PDF (English)


DOI: https://doi.org/10.4454/ssl.v58i1.274

Refback

  • Non ci sono refbacks, per ora.


Copyright (c) 2020 Studi e Saggi Linguistici

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
eISSN 2281-9142 - ISSN 0085-6827 - Webmaster - Publisher