This page contains information and resources pertaining to the Distributional Memory approach to corpus-based semantics proposed in the following article:
M. Baroni and A. Lenci. Distributional Memory: A general framework for corpus-based semantics. 2010. Computational Linguistics 36 (4): 673-721.
Abstract: Research into corpus-based semantics has focused on the development of ad hoc models that treat single tasks, or sets of closely related tasks, as unrelated challenges to be tackled by extracting different kinds of distributional information from the corpus. As an alternative to this "one task, one model" approach, the Distributional Memory framework extracts distributional information once and for all from the corpus, in the form of a set of weighted word-link-word tuples arranged into a third-order tensor. Different matrices are then generated from the tensor, and their rows and columns constitute natural spaces to deal with different semantic problems. In this way, the same distributional information can be shared across tasks such as modeling word similarity judgments, discovering synonyms, concept categorization, predicting selectional preferences of verbs, solving analogy problems, classifying relations between word pairs, harvesting qualia structures with patterns or example pairs, predicting the typical properties of concepts, and classifying verbs into alternation classes. Extensive empirical testing in all these domains shows that a Distributional Memory implementation performs competitively against task-specific algorithms recently reported in the literature for the same tasks, and against our implementations of several state-of-the-art methods. The Distributional Memory approach is thus shown to be tenable despite the constraints imposed by its multi-purpose nature.
We make the full TypeDM labeled tensor available, in the hope that others will find it useful as a semantic resource (see the article for details). We also provide the word-by-link-word matrix in a version compressed by random indexing, that can be used for efficient computation of word-to-word (attributional) similarity. Finally, we distribute (courtesy of Partha Pratim Talukdar) the top 10 nearest neighbours of each word from the random-indexed matrix.
If you are interested in other data discussed in the article (e.g., other DM models), please get in touch with us.
Given the size of the tensor (4GB when uncompressed), we gzipped it and split it into multiple files to facilitate download. Please download each of the following parts (about 127MB each, except the last one):
Once you downloaded the files, concatenate them and unzip them by entering the following command sequence (in the directory where you downloaded the files):
cat typedm.part.a* | gzip -d > typedm.txt
To check that the download-and-decompress process went fine, compute the md5 checksum of the file you recreated:
md5sum typedm.txt
You should get the following string:
7f106b7d388fb06b2746f7f41043fdda
Each line of the resulting file contains one entry of the labeled tensor with the following tab-delimited fields: word link word score. For example:
alliance-n against-1 socialism-n 6.4310
Nouns have the n suffix, verbs are marked by v and adjectives by j.
Note that for each word1 link word2 entry, the tensor will also contain a word2 inverse-link word1 entry with the same value. For example, given the line above, you can be sure that the file also contains the following line:
socialism-n against alliance-n 6.4310
Please refer to the article to make sense of the TypeDM tensor data.
Since we envisage that one of the main applications of DM will be to compute semantic similarity between words, we provide a matrix where the 30,686 words in the TypeDM model are represented by 5K-dimensional vectors that approximate their vectors of much higher dimensionality in the full DM w-by-lw matricization. The reduced matrix has been created using the random indexing method, and, being considerably smaller than the full matrix, it should make word similarity computations faster and more efficient (our informal experimentation indicates that performance on various tasks using the 5K-dimensional matrix is essentially identical to performance with the full w-by-lw matrix).
Please download the following parts (about 124MB each, except the last one):
Put them back together and decompress:
cat ri.w-lw.part.a* | gzip -d > ri.w-lw.txt
If everything went well, when asking for the md5 checksum of the reconstructed file...
md5sum ri.w-lw.txt
you should get the following string:
5d815a4061dca0dec38b855536678a57
Each line of the resulting file contains, in tab-delimited fields, one word, followed by its values on the 5,000 reduced dimensions.
Partha Pratim Talukdar constructed the following list of the top 10 nearest neighbours of each word (a "distributional thesaurus") based on the reduced matrix above, using the SSpace package :
Each line of the file contains, in tab-delimited fields, one word followed by one of its top 10 nearest neighbours, followed by the cosine between the word and the neighbour.
Back to the indexWe provide a set of perl scripts to manipulate the tensor tuples: building matrices for different views of the data, measuring similarity, etc. Again, refer to the article for details.
Some of the scripts (e.g., the ones to compress a matrix with random indexing or SVD, or to measure cosine similarities) might also be useful to researchers that are not interested in DM per se.
You will find the scripts in this archive. After you download it and decompress it, you will find information about the scripts in the readme.txt file.
Back to the indexWrite to marco baroni AT gmail com
Marco's homepage.
Ale's homepage.