Vol. 55 No. 2 (2017): Word Combinations: phenomena, methods of extraction, tools
Articles

How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Alessandro Lenci
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica
Francesca Masini
Dipartimento di Lingue, Letterature e Culture Moderne Università di Bologna
Malvina Nissim
Faculty of Arts University of Groningen
Sara Castagnoli
Dipartimento di Scienze della Formazione, dei Beni Cultirali e del Turismo Università di Macerata
Gianluca E. Lebani
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica Università di Pisa
Lucia C. Passaro
CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica Università di Pisa
Marco S.G. Senaldi
Laboratorio di Linguistica “G. Nencioni” Scuola Normale Superiore di Pisa

Published 2018-02-02

Keywords

  • word combinations,
  • computational methods,
  • idiomatic expressions

Abstract

This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.