

Make sure you set the right Language property throughout this method. As a prerequisite, you need to have an SQL dump of Wikipedia ready, or any other (domain-specific) document collection which is formatted accordingly. In the method createLuceneWikipediaIndex(), you need to specify where your document collection can be found. You can find an out-of-the-box solution for the ESA index creation in the Maven module de.vsm-asl in the class EsaIndexer. If you want to create your own index to be used with ESA, you can do so with a few simple steps: This is more flexible, as we can do all sorts of weighting and normalizing, but it is also much, much slower.

When a similarity measure wants to get a vector for a word, it queries Lucene for the term frequencies of that word in each document. LuceneVectorReader are used to create the above vectors from a Lucene index build from a document collection (more about this in the next section). You need a VectorIndexReader to load them and you cannot change the vectors anymore (e.g. when you want to compute the similarity between two words the similarity measure just retrieves the vectors stored for the words and computes the similarity.
#LEXICAL SIMILARITY FULL#
The ESA indexes you have downloaded in the previous step are VectorIndexes that store for each word the full vector over the document space, i.e. Understanding the difference between VectorIndexReader and LuceneVectorReader So please also lemmatize your input texts first, before passing them to the similarity measure. Please note that the indexes have been created on lemmatized texts.
#LEXICAL SIMILARITY DOWNLOAD#
The Wikipedia index is much larger (about 900 MB zipped), you can get it here.Īfter the download has finished, unzip the whole folder into $DKPRO_HOME/ESA/VectorIndexes/, where subdir is an arbitrary name for each resource, e.g. The vector indexes can be downloaded for Wiktionary or WordNet. While it was originally proposed on Wikipedia, other document collections with similar properties have also been found to work well, e.g. Explicit Semantic Analysis: Vector IndexesĮxplicit Semantic Analysis (ESA) (Gabrilovich and Markovitch, 2007) is a method which computes similarity based on word occurrences in a given document collection. The variable should point to a (possibly yet empty) directory which is intended to store any sort of resources which are to be used by any DKPro component. Prerequisite: DKPRO_HOME environment variableīefore continuing, please make sure that you have set up an environment_variable DKPRO_HOME either system-wide or per-project in the Eclipse run configuration. In the following, we describe which resources are required by which measures, and how they can be obtained and installed. measures which determine pairwise word similarity on WordNet. Some text similarity measures implemented in our framework operate on lexical-semantic resources, e.g.
