API Documentation

MinHash

class datasketch.MinHash(num_perm=128, seed=1, hashobj=<built-in function openssl_sha1>, hashvalues=None, permutations=None)[source]

MinHash is a probabilistic data structure for computing Jaccard similarity between sets.

Parameters:
  • num_perm (int, optional) – Number of random permutation functions. It will be ignored if hashvalues is not None.
  • seed (int, optional) – The random seed controls the set of random permutation functions generated for this MinHash.
  • hashobj (optional) – The hash function used by this MinHash. It must implements the digest() method similar to hashlib hash functions, such as hashlib.sha1.
  • hashvalues (numpy.array or list, optional) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing state from another MinHash.
  • permutations (optional) – The permutation function parameters. This argument can be specified for faster initialization using the existing state from another MinHash.

Note

To save memory usage, consider using datasketch.LeanMinHash.

Note

Since version 1.1.1, MinHash will only support serialization using pickle. serialize and deserialize methods are removed, and are supported in datasketch.LeanMinHash instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).

Note

Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e., jaccard, merge and union) with those created after.

__init__(num_perm=128, seed=1, hashobj=<built-in function openssl_sha1>, hashvalues=None, permutations=None)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

Update this MinHash with a new value.

Parameters:b (bytes) – The value of type bytes.

Example

To update with a new string value:

minhash.update("new value".encode('utf-8'))
jaccard(other)[source]

Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.

Parameters:other (datasketch.MinHash) – The other MinHash.
Returns:The Jaccard similarity, which is between 0.0 and 1.0.
Return type:float
count()[source]

Estimate the cardinality count based on the technique described in this paper.

Returns:The estimated cardinality of the set represented by this MinHash.
Return type:int
merge(other)[source]

Merge the other MinHash with this one, making this one the union of both.

Parameters:other (datasketch.MinHash) – The other MinHash.
digest()[source]

Export the hash values, which is the internal state of the MinHash.

Returns:The hash values which is a Numpy array.
Return type:numpy.array
is_empty()[source]
Returns:
If the current MinHash is empty - at the state of just
initialized.
Return type:bool
clear()[source]

Clear the current state of the MinHash. All hash values are reset.

copy()[source]
Returns:datasketch.MinHash – A copy of this MinHash by exporting its state.
__len__()[source]
Returns:int – The number of hash values.
__eq__(other)[source]
Returns:bool – If their seeds and hash values are both equal then two are equivalent.
__weakref__

list of weak references to the object (if defined)

classmethod union(*mhs)[source]

Create a MinHash which is the union of the MinHash objects passed as arguments.

Parameters:*mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2.
Returns:A new union MinHash.
Return type:datasketch.MinHash

Lean MinHash

class datasketch.LeanMinHash(minhash)[source]

Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().

Lean MinHash inherits all methods from datasketch.MinHash. It does not store the permutations and the hashobj needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.

Example

To create a lean MinHash from an existing MinHash:

lean_minhash = LeanMinHash(minhash)

# You can compute the Jaccard similarity between two lean MinHash
lean_minhash.jaccard(lean_minhash2)

# Or between a lean MinHash and a MinHash
lean_minhash.jaccard(minhash2)

To create a MinHash from a lean MinHash:

minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.hashvalues)

# Or if you want to prevent further updates on minhash
# from affecting the state of lean_minhash
minhash = MinHash(seed=lean_minhash.seed,
                  hashvalues=lean_minhash.digest())
Parameters:minhash – The datasketch.MinHash object used to initialize the LeanMinHash.
__init__(minhash)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

This method is not available on a LeanMinHash. Calling it raises a TypeError.

copy()[source]
Returns:datasketch.MinHash – A copy of this MinHash by exporting its state.
bytesize(byteorder='@')[source]

Compute the byte size after serialization.

Parameters:byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.
Returns:Size in number of bytes after serialization.
Return type:int
serialize(buf, byteorder='@')[source]

Serialize this lean MinHash and store the result in an allocated buffer.

Parameters:
  • buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
  • byteorder (str, optional) –

    This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.

The serialization schema:
  1. The first 8 bytes is the seed integer
  2. The next 4 bytes is the number of hash values
  3. The rest is the serialized hash values, each uses 4 bytes

Example

To serialize a single lean MinHash into a bytearray buffer.

buf = bytearray(lean_minhash.bytesize())
lean_minhash.serialize(buf)

To serialize multiple lean MinHash into a bytearray buffer.

# assuming lean_minhashs is a list of LeanMinHash with the same size
size = lean_minhashs[0].bytesize()
buf = bytearray(size*len(lean_minhashs))
for i, lean_minhash in enumerate(lean_minhashs):
    lean_minhash.serialize(buf[i*size:])
classmethod deserialize(buf, byteorder='@')[source]

Deserialize a lean MinHash from a buffer.

Parameters:
  • buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
  • byteorder (str. optional) –

    This is byte order of the serialized data. Use one of the byte order characters: @, =, <, >, and !. Default is @ – the native order.

Returns:

The deserialized lean MinHash

Return type:

datasketch.LeanMinHash

Example

To deserialize a lean MinHash from a buffer.

lean_minhash = LeanMinHash.deserialize(buf)
__hash__()[source]

Return hash(self).

classmethod union(*lmhs)[source]

Create a MinHash which is the union of the MinHash objects passed as arguments.

Parameters:*mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2.
Returns:A new union MinHash.
Return type:datasketch.MinHash

Weighted MinHash

class datasketch.WeightedMinHashGenerator(dim, sample_size=128, seed=1)[source]

The weighted MinHash generator is used for creating new datasketch.WeightedMinHash objects.

This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Parameters:
  • dim (int) – The number of dimensions of the input Jaccard vectors.
  • sample_size (int, optional) – The number of samples to use for creating weighted MinHash.
  • seed (int) – The random seed to use for generating permutation functions.
__init__(dim, sample_size=128, seed=1)[source]

Initialize self. See help(type(self)) for accurate signature.

minhash(v)[source]

Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.

Parameters:v (numpy.array) – The Jaccard vector.
__weakref__

list of weak references to the object (if defined)

class datasketch.WeightedMinHash(seed, hashvalues)[source]

New weighted MinHash is generated by datasketch.WeightedMinHashGenerator. You can also initialize weighted MinHash by using the state from an existing one.

Parameters:
  • seed (int) – The random seed used to generate this weighted MinHash.
  • hashvalues – The internal state of this weighted MinHash.
__init__(seed, hashvalues)[source]

Initialize self. See help(type(self)) for accurate signature.

jaccard(other)[source]

Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.

Parameters:other (datasketch.WeightedMinHash) – The other weighted MinHash.
Returns:The weighted Jaccard similarity between 0.0 and 1.0.
Return type:float
digest()[source]

Export the hash values, which is the internal state of the weighted MinHash.

Returns:The hash values which is a Numpy array.
Return type:numpy.array
copy()[source]
Returns:A copy of this weighted MinHash by exporting its state.
Return type:datasketch.WeightedMinHash
__len__()[source]
Returns:The number of hash values.
Return type:int
__eq__(other)[source]
Returns:If their seeds and hash values are both equal then two are equivalent.
Return type:bool
__weakref__

list of weak references to the object (if defined)

MinHash LSH

class datasketch.MinHashLSH(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]

The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.

Parameters:
  • threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.
  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
  • weights (tuple, optional) – Used to adjust the relative importance of minimizing false positive and false negative when optimizing for the Jaccard similarity threshold. weights is a tuple in the format of (false_positive_weight, false_negative_weight).
  • params (tuple, optional) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.
  • storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
  • prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.

Note

weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).

__init__(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]

Initialize self. See help(type(self)) for accurate signature.

insert(key, minhash, check_duplication=True)[source]

Insert a key to the index, together with a MinHash (or weighted MinHash) of the set referenced by the key.

Parameters:
  • key (str) – The identifier of the set.
  • minhash (datasketch.MinHash) – The MinHash of the set.
  • check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.
insertion_session(buffer_size=50000)[source]

Create a context manager for fast insertion into this index.

Parameters:buffer_size (int) – The buffer size for insert_session mode (default=50000).
Returns:datasketch.lsh.MinHashLSHInsertionSession
query(minhash)[source]

Giving the MinHash of the query set, retrieve the keys that references sets with Jaccard similarities greater than the threshold.

Parameters:minhash (datasketch.MinHash) – The MinHash of the query set.
Returns:list of unique keys.
__contains__(key)[source]
Parameters:key (hashable) – The unique identifier of a set.
Returns:True only if the key exists in the index.
Return type:bool
remove(key)[source]

Remove the key from the index.

Parameters:key (hashable) – The unique identifier of a set.
is_empty()[source]
Returns:Check if the index is empty.
Return type:bool
get_counts()[source]

Returns a list of length self.b with elements representing the number of keys stored under each bucket for the given permutation.

get_subset_counts(*keys)[source]

Returns the bucket allocation counts (see get_counts() above) restricted to the list of keys given.

Parameters:keys (hashable) – the keys for which to get the bucket allocation counts
__weakref__

list of weak references to the object (if defined)

Asynchronous MinHash LSH

class datasketch.experimental.aio.lsh.AsyncMinHashLSH(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict[KT, VT] = None)[source]

Asynchronous MinHashLSH index.

Parameters:

For example usage see Asynchronous MinHash LSH at scale.

Example of supported storage configuration:

MONGO = {'type': 'aiomongo', 'basename': 'base_name_1', 'mongo': {'host': 'localhost', 'port': 27017}}

Note

__init__(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict[KT, VT] = None)[source]

Initialize self. See help(type(self)) for accurate signature.

insertion_session(batch_size=10000)[source]

Create a asynchronous context manager for fast insertion in index.

Parameters:batch_size (int) – the size of chunks to use in insert_session mode (default=10000).
Returns:datasketch.experimental.aio.lsh.AsyncMinHashLSHSession

Example

from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

_chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4)
seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer')))
objs = [MinHash(16) for _ in range(len(seq))]
for e, obj in zip(seq, objs):
    for i in e:
        obj.update(i.encode('utf-8'))
data = [(e, m) for e, m in zip(seq, objs)]

_storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}}
async def func():
    async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh:
        async with lsh.insertion_session(batch_size=1000) as session:
            fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data)
            await asyncio.gather(*fs)
delete_session(batch_size=10000)[source]

Create a asynchronous context manager for fast removal of keys from index.

Parameters:batch_size (int) – the size of chunks to use in insert_session mode (default=10000).
Returns:datasketch.experimental.aio.lsh.AsyncMinHashLSHSession

Example

from datasketch.experimental.aio.lsh import AsyncMinHashLSH
from datasketch import MinHash

def chunk(it, size):
    it = iter(it)
    return iter(lambda: tuple(islice(it, size)), ())

_chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4)
seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer')))
objs = [MinHash(16) for _ in range(len(seq))]
for e, obj in zip(seq, objs):
    for i in e:
        obj.update(i.encode('utf-8'))
data = [(e, m) for e, m in zip(seq, objs)]

_storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}}
async def func():
    async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh:
        async with lsh.insertion_session(batch_size=1000) as session:
            fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data)
            await asyncio.gather(*fs)

        async with lsh.delete_session(batch_size=3) as session:
            fs = (session.remove(key) for key in keys_to_remove)
            await asyncio.gather(*fs)
__weakref__

list of weak references to the object (if defined)

close()[source]

Cleanup client resources and disconnect from AsyncMinHashLSH storage.

get_counts()[source]

see datasketch.MinHashLSH.

get_subset_counts(*keys)[source]

see datasketch.MinHashLSH.

has_key(key)[source]

see datasketch.MinHashLSH.

insert(key, minhash, check_duplication=True)[source]

see datasketch.MinHashLSH.

is_empty()[source]

see datasketch.MinHashLSH.

query(minhash)[source]

see datasketch.MinHashLSH.

remove(key)[source]

see datasketch.MinHashLSH.

MinHash LSH Forest

class datasketch.MinHashLSHForest(num_perm=128, l=8)[source]

The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.

Parameters:
  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
  • l (int, optional) – The number of prefix trees as described in the paper.

Note

The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.

__init__(num_perm=128, l=8)[source]

Initialize self. See help(type(self)) for accurate signature.

add(key, minhash)[source]

Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.

Note

The key won’t be searchbale until the datasketch.MinHashLSHForest.index() method is called.

Parameters:
  • key (hashable) – The unique identifier of the set.
  • minhash (datasketch.MinHash) – The MinHash of the set.
index()[source]

Index all the keys added so far and make them searchable.

query(minhash, k)[source]

Return the approximate top-k keys that have the highest Jaccard similarities to the query set.

Parameters:
  • minhash (datasketch.MinHash) – The MinHash of the query set.
  • k (int) – The maximum number of keys to return.
Returns:

list of at most k keys.

is_empty()[source]

Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.

Returns:True if there is no searchable key in the index.
Return type:bool
__contains__(key)[source]
Returns:True only if the key has been added to the index.
Return type:bool
__weakref__

list of weak references to the object (if defined)

MinHash LSH Ensemble

class datasketch.MinHashLSHEnsemble(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5))[source]

The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..

Parameters:
  • threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.
  • num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
  • num_part (int, optional) – The number of partitions in LSH Ensemble.
  • m (int, optional) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.
  • weights (tuple, optional) – Used to adjust the relative importance of minizing false positive and false negative when optimizing for the Containment threshold. Similar to the weights parameter in datasketch.MinHashLSH.

Note

Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.

Note

More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.

__init__(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5))[source]

Initialize self. See help(type(self)) for accurate signature.

index(entries)[source]

Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.

Parameters:entries (iterable of tuple) – An iterable of tuples, each must be in the form of (key, minhash, size), where key is the unique identifier of a set, minhash is the MinHash of the set, and size is the size or number of unique items in the set.

Note

size must be positive.

query(minhash, size)[source]

Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.

Parameters:
  • minhash (datasketch.MinHash) – The MinHash of the query set.
  • size (int) – The size (number of unique items) of the query set.
Returns:

iterator of keys.

__contains__(key)[source]
Parameters:key (hashable) – The unique identifier of a set.
Returns:True only if the key exists in the index.
Return type:bool
is_empty()[source]
Returns:Check if the index is empty.
Return type:bool
__weakref__

list of weak references to the object (if defined)

HyperLogLog

class datasketch.HyperLogLog(p=8, reg=None, hashobj=<built-in function openssl_sha1>)[source]

The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.

This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog

Parameters:
  • p (int, optional) – The precision parameter. It is ignored if the reg is given.
  • reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
  • hashobj (optional) – The hash function used. It must implements the digest() method similar to hashlib hash functions, such as hashlib.sha1.
__init__(p=8, reg=None, hashobj=<built-in function openssl_sha1>)[source]

Initialize self. See help(type(self)) for accurate signature.

update(b)[source]

Update the HyperLogLog with a new data value in bytes.

Parameters:b (bytes) – A value of type bytes.

Example

To update with a new string value:

hyperloglog.update("new value".encode('utf-8'))
count()[source]

Estimate the cardinality of the data values seen so far.

Returns:The estimated cardinality.
Return type:int
merge(other)[source]

Merge the other HyperLogLog with this one, making this the union of the two.

Parameters:other (datasketch.HyperLogLog) –
digest(hashobj)[source]
Returns:The current internal state.
Return type:numpy.array
copy()[source]

Create a copy of the current HyperLogLog by exporting its state.

Returns:
Return type:datasketch.HyperLogLog
is_empty()[source]
Returns:True if the current HyperLogLog is empty - at the state of just initialized.
Return type:bool
clear()[source]

Reset the current HyperLogLog to empty.

__len__()[source]
Returns:
Get the size of the HyperLogLog as the size of
reg.
Return type:int
__eq__(other)[source]

Check equivalence between two HyperLogLogs

Parameters:other (datasketch.HyperLogLog) –
Returns:True if both have the same internal state.
Return type:bool

HyperLogLog++

class datasketch.HyperLogLogPlusPlus(p=8, reg=None, hashobj=<built-in function openssl_sha1>)[source]

HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:

  1. Use 64 bits instead of 32 bits for hash function
  2. A new small-cardinality estimation scheme
  3. Sparse representation (not implemented here)

This class has the same set of methods as datasketch.HyperLogLog.

__weakref__

list of weak references to the object (if defined)

count()[source]

Estimate the cardinality of the data values seen so far.

Returns:The estimated cardinality.
Return type:int