How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Alessandro Lenci; Francesca Masini; Malvina Nissim; Sara Castagnoli; Gianluca E. Lebani; Lucia C. Passaro; Marco S.G. Senaldi

doi:10.4454/ssl.v55i2.212

Vol. 55 No 2 (2017): Word Combinations: phenomena, methods of extraction, tools

Articoli

How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

PDF (English)

Alessandro Lenci,
Francesca Masini,
Malvina Nissim,
Sara Castagnoli,
Gianluca E. Lebani,
Lucia C. Passaro,
Marco S.G. Senaldi

more info

Alessandro Lenci
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica

Francesca Masini
Dipartimento di Lingue, Letterature e Culture Moderne Università di Bologna

Malvina Nissim
Faculty of Arts University of Groningen

Sara Castagnoli
Dipartimento di Scienze della Formazione, dei Beni Cultirali e del Turismo Università di Macerata

Gianluca E. Lebani
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica Università di Pisa

Lucia C. Passaro
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica Università di Pisa

Marco S.G. Senaldi
Laboratorio di Linguistica “G. Nencioni” Scuola Normale Superiore di Pisa

DOI : https://doi.org/10.4454/ssl.v55i2.212

Publiée 2018-02-02

Mots-clés

word combinations,
computational methods,
idiomatic expressions

Mention de droit d'auteur

Résumé

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.

PDF (English)

Studi e Saggi Linguistici

How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Mots-clés

Résumé

Articles similaires