COMPOSES Semantic Vectors

From this page, you can download the best-performing predict and semantic vectors that we evaluated in the following paper:

M. Baroni, G. Dinu and G. Kruszewski. 2014. Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors Proceedings of ACL 2014 (52nd Annual Meeting of the Association for Computational Linguistics), East Stroudsburg PA: ACL, 238-247.

Here is an archive containing results obtained with further models and parameter settings on the same benchmarks, confirming the very high quality of our best predict vectors.

Both the count and predict vectors made available below provide representations for about 300K words (lower-cased, non-lemmatized).

The models are released under a Creative Commons Attribute licence, please cite our paper if you use them in published work.

Best predict vectors

This gzipped archive contains the predict vectors that performed best across the tasks (and also outperformed all count vectors), as reported in Table 4 of the paper. Parameters were set as follows: 5-word context window, 10 negative samples, subsampling, 400 dimensions. See the article for more details.

The vectors are the rows of a dense matrix stored in tab-delimited format (the first element of each line corresponds to a word, followed by the values in the vector representing it).

Best full-space count vectors

The following (split) gzipped archive contains the count vectors that performed best, as reported in Table 3 of the paper. Parameters were set as follows: 2-word context window, PMI weighting, no compression, 300K dimensions. See the article for more details.

The data are printed in sparse matrix format: each target word (on a separate line) is followed by its non-zero dimensions in (tab-delimited) context-value format (each context-value pair on a separate line). For example, the following fragment:

deckhand
guinean 0.284611
trawler 0.250539
cowell 0.247986
housekeeping 0.246947
bud 0.213620
steward 0.206952
drowned 0.205488
chef 0.195067
aboard 0.230511

shows some of the non-zero dimensions for the target word deckhand.

Since the archive is large, we split it into multiple files to make the download easier. Please download the following parts, and then concatenate them as shown below:

cat EN-wform.w.2.ppmi.txt.gza* > EN-wform.w.2.ppmi.txt.gz

Best reduced count vectors

The following (split) gzipped archive contains the reduced count vectors that performed best, as reported in Table 3 of the paper (third row). Parameters were set as follows: 2-word context window, PMI weighting, SVD reduction to 500 dimensions. See the article for more details.

The vectors are the rows of a dense matrix stored in tab-delimited format (first element of each line corresponds to a word, followed by the values in the vector representing it).

Since the archive is large, we split it into multiple files to make the download easier. Please download the following parts, and then concatenate them as shown below:

cat EN-wform.w.2.ppmi.svd.500.txt.gza* > EN-wform.w.2.ppmi.svd.500.txt.gz

Credits

The predict vectors were constructed with word2vec, the count vectors with DISSECT. Co-occurrence statistics extracted from BNC and WackyPedia/ukWaC.

Back to the COMPOSES main page