Strudel: A corpus-based semantic model based on properties and types

This page contains information and resources pertaining to the Strudel algorithm for the extraction of property-based concept descriptions from corpora.

NB: Please consider also the similar but larger DM resource!

The reference publication is:

M. Baroni, B. Murphy, E. Barbu and M. Poesio. 2010. Strudel: A corpus-based semantic model based on properties and types. Cognitive Science 34 (2): 222-254.

Abstract: Computational models of meaning trained on naturally occurring text successfully model human performance on tasks involving simple similarity measures, but they characterize meaning in terms of undifferentiated bags of words or topical dimensions. This has led some to question their psychological plausibility (Murphy, 2002; Schunn, 1999). We present here a fully automatic method for extracting a structured and comprehensive set of concept descriptions directly from an English part-of-speech-tagged corpus. Concepts are characterized by weighted properties, enriched with concept-property types that approximate classical relations such as hypernymy and function. Our model outperforms comparable algorithms in cognitive tasks pertaining not only to concept-internal structures (discovering properties of concepts, grouping properties by property type) but also to inter-concept relations (clustering into superordinates), suggesting the empirical validity of the property-based approach.

The supplementary materials mentioned in the article are also available on the Cognitive Science site indicated in the article, but you can also find them here in an easier-to-browse format.

Technical report

Code

We provide here the 2 Perl scripts we used for pattern extraction and generalization. All other parts of the Strudel procedure are implemented using Unix command line tools or external code (see the Links section).

Once you make them executable, you can see the usage information of the scripts by calling them with the -h option.

Data

The data here (all compressed text files) pertain to the "Baked Strudel" version described in the main article.

Input

Output

NB: Please consider also the similar but larger DM resource!

The Baked Strudel model, with each line in format:

concept property log-likelihood-ratio (gentype count prop)+

where gentype is a generalized type, count is the number of times the type connects concept and property and prop is the proportion of the type in the overall type sketch distribution of the concept-property pair. This file only contains types that account for at least 10% of the overall type distribution. We can provide a larger file with all types upon request.

Experiments

This archive contains the full input and output data pertaining to the adaptation of the Baked Strudel model to the clustering experiments reported in sections 7.2 and 8.2 of the paper. See the readme.txt file included in the archive for details.

Links

Contact

Write to marco baroni AT unitn it