Open Access Open Access  Restricted Access Subscription or Fee Access

How to harvest Word Combinations from corpora: Methods, evaluation and perspectives

Alessandro Lenci, Francesca Masini, Malvina Nissim, Sara Castagnoli, Gianluca E. Lebani, Lucia C. Passaro, Marco S.G. Senaldi


This paper reports on work, carried out in the framework of the CombiNet project, focusing on the automatic extraction of word combinations from large corpora, with a view to represent the full distributional profile of selected lemmas. We describe two extraction methods, based on part-of-speech sequences (P-method) and syntactic patterns (S-method), respectively, evaluating their performance – contrastively, and with reference to external benchmarks – and discussing the relevance of automatic knowledge acquisition for lexicographic purposes. Our results indicate that both approaches provide valuable data and confirm previous claims that P-methods and S-methods are largely complementary, as they tend to retrieve different types of word combinations. In the second part of the paper, we present SYMPAThy, a data representation format devised to fruitfully merge the two methods by leveraging their respective points of strength. In order to explore SYMPAThy’s potentialities, a preliminary investigation on a small set of Italian idioms, and specifically their degree of fixedness/productivity, is also described.


word combinations; computational methods; idiomatic expressions

Full Text:



  • There are currently no refbacks.
eISSN 2281-9142 - ISSN 0085-6827 - Webmaster - Publisher