The nilsimsa
module is an implementation of an existing locality-sensitive hashing
algorithm designed specifically to handle spam filtering. LSH is a method
of performing probabilistic dimension reduction of high-dimensional data. The
basic idea is to hash the input items so that similar items are mapped to the
same buckets with high probability (the number of buckets being much smaller
than the universe of possible input items). This is different from conventional
hash functions, such as those used in cryptography, because in this case the
goal is to maximize the probability of a collision of similar items rather than
avoid collisions.
As per the original description here:
A nilsimsa code is something like a hash, but unlike hashes, a small change in the message results in a small change in the nilsimsa code. Such a function is called a locality-sensitive hash.
<dependency>
<groupId>com.github.rholder</groupId>
<artifactId>nilsimsa</artifactId>
<version>1.0.0</version>
</dependency>
compile "com.github.rholder:nilsimsa:1.0.0"
A minimal sample of some of the functionality would look like:
String first = new Nilsimsa().update("potatoes are the best".getBytes()).toHexDigest();
String second = new Nilsimsa().update("tomatoes are really the best".getBytes()).toHexDigest();
String third = new Nilsimsa().update("bananas taste pretty good".getBytes()).toHexDigest();
System.out.println(Nilsimsa.compare(first, third)); // 3
System.out.println(Nilsimsa.compare(second, third)); // -6
System.out.println(Nilsimsa.compare(first, second)); // 53 -- closest match
System.out.println(Nilsimsa.compare(first, first)); // 128 -- exact match
The nilsimsa module uses a Gradle-based build system. In the instructions
below, ./gradlew
is invoked from the root of the source tree and serves as
a cross-platform, self-contained bootstrap mechanism for the build. The only
prerequisites are Git and JDK 1.6+.
git clone git://github.com/rholder/nilsimsa.git
./gradlew build
./gradlew install
This project is a Java port of py-nilsimsa
which is MIT/X11 licensed.
The nilsimsa
module is released under version 2.0 of the
Apache License.