Tools and Resources
- EGG, a toolkit to implement language emergence games;
- CommAI-env is an environment to train and benchmark communication-based AIs;
- SCAN, simplified CommAI navigation tasks to evaluate compositional learning and zero-shot generalization;
- We constructed very large Web-derived, POS-tagged and lemmatized
corpora of English (with dependency parsing), German, Italian and
French: to download them, visit
the WaCky
project;
- LAMBADA dataset for testing broad context understanding of language models;
- The DISSECT toolkit to construct and compose
distributional semantic representations;
- High-performance distributional semantic vectors;
- The SICK data set for large-scale evaluation of
compositional semantic models (for other tools and data sets produced
by the COMPOSES project please visit
the project page);
- MEN, human-elicited semantic similarity data for the evaluation of computational models;
- BLIND (BLind Italian Norming Data), semantic norms
collected from congenitally blind and highly comparable sighted
subjects;
- BLESS
(Baroni-Lenci Evaluation of Semantic Similarity), a large data set to
evaluate computational models of semantic similarity;
- DM: trained model (weighted word-link-word tuples) and utility scripts;
- Semantic norms for German and Italian: available in
this archive
and documented in
this article;
- zipfR: a
toolkit for lexical statistics in R;
- BootCaT: a toolkit for bootstrapping specialized
language corpora and terms from the web;
- Morph-it!, a free Italian morphological
lexicon;
- Access
to the La Repubblica corpus, a large corpus of Italian newspaper
text;
- Web as Corpus post-processing tools, for
boilerplate stripping and near-duplicate identification;
- The TreeTagger page contains a version of this popular
tagger/lemmatizer trained on our Italian resources;
- Trained Italian models for the taggers in
the ACOPOST
toolkit;
- Knorpora: the Knoppix Linux live CD remastered with
tools and resources for corpus-based computational linguistics
students;
- English token and document frequency lists from the
LOB and Brown corpora;
- regexp_tokenizer: a simple tokenizer based on
regular expressions.
Back to Marco's
page