API Documentation¶
MinHash¶
-
class
datasketch.
MinHash
(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]¶ MinHash is a probabilistic data structure for computing Jaccard similarity between sets.
- Parameters
num_perm (int, optional) – Number of random permutation functions. It will be ignored if hashvalues is not None.
seed (int, optional) – The random seed controls the set of random permutation functions generated for this MinHash.
hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
hashvalues (numpy.array or list, optional) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing state from another MinHash.
permutations (optional) – The permutation function parameters. This argument can be specified for faster initialization using the existing state from another MinHash.
Note
To save memory usage, consider using
datasketch.LeanMinHash
.Note
Since version 1.1.1, MinHash will only support serialization using pickle.
serialize
anddeserialize
methods are removed, and are supported indatasketch.LeanMinHash
instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).Note
Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e.,
jaccard
,merge
andunion
) with those created after.-
__init__
(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
update
(b)[source]¶ Update this MinHash with a new value. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
- Parameters
b – The value to be hashed using the hash function specified.
Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
minhash = Minhash() minhash.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) minhash = MinHash(hashfunc=_hash_32) minhash.update("new value")
-
jaccard
(other)[source]¶ Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.
- Parameters
other (datasketch.MinHash) – The other MinHash.
- Returns
The Jaccard similarity, which is between 0.0 and 1.0.
- Return type
float
-
count
()[source]¶ Estimate the cardinality count based on the technique described in this paper.
- Returns
The estimated cardinality of the set represented by this MinHash.
- Return type
int
-
merge
(other)[source]¶ Merge the other MinHash with this one, making this one the union of both.
- Parameters
other (datasketch.MinHash) – The other MinHash.
-
digest
()[source]¶ Export the hash values, which is the internal state of the MinHash.
- Returns
The hash values which is a Numpy array.
- Return type
numpy.array
-
is_empty
()[source]¶ - Returns
- If the current MinHash is empty - at the state of just
initialized.
- Return type
bool
-
__eq__
(other)[source]¶ - Returns
bool – If their seeds and hash values are both equal then two are equivalent.
-
__weakref__
¶ list of weak references to the object (if defined)
Lean MinHash¶
-
class
datasketch.
LeanMinHash
(minhash=None, seed=None, hashvalues=None)[source]¶ Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().
Lean MinHash inherits all methods from
datasketch.MinHash
. It does not store the permutations and the hashfunc needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.Example
To create a lean MinHash from an existing MinHash:
lean_minhash = LeanMinHash(minhash) # You can compute the Jaccard similarity between two lean MinHash lean_minhash.jaccard(lean_minhash2) # Or between a lean MinHash and a MinHash lean_minhash.jaccard(minhash2)
To create a lean MinHash from the hash values and seed of an existing MinHash:
lean_minhash = LeanMinHash(seed=minhash.seed, hashvalues=minhash.hashvalues)
To create a MinHash from a lean MinHash:
minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.hashvalues) # Or if you want to prevent further updates on minhash # from affecting the state of lean_minhash minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.digest())
Note
Lean MinHash can also be used in
datasketch.MinHashLSH
,datasketch.MinHashLSHForest
, anddatasketch.MinHashLSHEnsemble
.- Parameters
minhash (optional) – The
datasketch.MinHash
object used to initialize the LeanMinHash. If this is not set, then seed and hashvalues must be set.seed (optional) – The random seed that controls the set of random permutation functions generated for this LeanMinHash. This parameter must be used together with hashvalues.
hashvalues (optional) – The hash values used to inititialize the state of the LeanMinHash. This parameter must be used together with seed.
-
__init__
(minhash=None, seed=None, hashvalues=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
bytesize
(byteorder='@')[source]¶ Compute the byte size after serialization.
- Parameters
byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.- Returns
Size in number of bytes after serialization.
- Return type
int
-
serialize
(buf, byteorder='@')[source]¶ Serialize this lean MinHash and store the result in an allocated buffer.
- Parameters
buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str, optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.
- The serialization schema:
The first 8 bytes is the seed integer
The next 4 bytes is the number of hash values
The rest is the serialized hash values, each uses 4 bytes
Example
To serialize a single lean MinHash into a bytearray buffer.
buf = bytearray(lean_minhash.bytesize()) lean_minhash.serialize(buf)
To serialize multiple lean MinHash into a bytearray buffer.
# assuming lean_minhashs is a list of LeanMinHash with the same size size = lean_minhashs[0].bytesize() buf = bytearray(size*len(lean_minhashs)) for i, lean_minhash in enumerate(lean_minhashs): lean_minhash.serialize(buf[i*size:])
-
classmethod
deserialize
(buf, byteorder='@')[source]¶ Deserialize a lean MinHash from a buffer.
- Parameters
buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
byteorder (str. optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
- Returns
The deserialized lean MinHash
- Return type
Example
To deserialize a lean MinHash from a buffer.
lean_minhash = LeanMinHash.deserialize(buf)
Weighted MinHash¶
-
class
datasketch.
WeightedMinHashGenerator
(dim, sample_size=128, seed=1)[source]¶ The weighted MinHash generator is used for creating new
datasketch.WeightedMinHash
objects.This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching
- Parameters
dim (int) – The number of dimensions of the input Jaccard vectors.
sample_size (int, optional) – The number of samples to use for creating weighted MinHash.
seed (int) – The random seed to use for generating permutation functions.
-
__init__
(dim, sample_size=128, seed=1)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
minhash
(v)[source]¶ Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.
- Parameters
v (numpy.array) – The Jaccard vector.
-
__weakref__
¶ list of weak references to the object (if defined)
-
class
datasketch.
WeightedMinHash
(seed, hashvalues)[source]¶ New weighted MinHash is generated by
datasketch.WeightedMinHashGenerator
. You can also initialize weighted MinHash by using the state from an existing one.- Parameters
seed (int) – The random seed used to generate this weighted MinHash.
hashvalues – The internal state of this weighted MinHash.
-
jaccard
(other)[source]¶ Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.
- Parameters
other (datasketch.WeightedMinHash) – The other weighted MinHash.
- Returns
The weighted Jaccard similarity between 0.0 and 1.0.
- Return type
float
-
digest
()[source]¶ Export the hash values, which is the internal state of the weighted MinHash.
- Returns
The hash values which is a Numpy array.
- Return type
numpy.array
-
__eq__
(other)[source]¶ - Returns
If their seeds and hash values are both equal then two are equivalent.
- Return type
bool
-
__weakref__
¶ list of weak references to the object (if defined)
MinHash LSH¶
-
class
datasketch.
MinHashLSH
(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]¶ The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.
- Parameters
threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
weights (tuple, optional) – Used to adjust the relative importance of minimizing false positive and false negative when optimizing for the Jaccard similarity threshold. weights is a tuple in the format of
(false_positive_weight, false_negative_weight)
.params (tuple, optional) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.
storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.
Note
weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).
-
__init__
(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
insert
(key, minhash, check_duplication=True)[source]¶ Insert a key to the index, together with a MinHash (or weighted MinHash) of the set referenced by the key.
- Parameters
key (str) – The identifier of the set.
minhash (datasketch.MinHash) – The MinHash of the set.
check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.
-
insertion_session
(buffer_size=50000)[source]¶ Create a context manager for fast insertion into this index.
- Parameters
buffer_size (int) – The buffer size for insert_session mode (default=50000).
- Returns
datasketch.lsh.MinHashLSHInsertionSession
-
query
(minhash)[source]¶ Giving the MinHash of the query set, retrieve the keys that references sets with Jaccard similarities greater than the threshold.
- Parameters
minhash (datasketch.MinHash) – The MinHash of the query set.
- Returns
list of unique keys.
-
__contains__
(key)[source]¶ - Parameters
key (hashable) – The unique identifier of a set.
- Returns
True only if the key exists in the index.
- Return type
bool
-
remove
(key)[source]¶ Remove the key from the index.
- Parameters
key (hashable) – The unique identifier of a set.
-
get_counts
()[source]¶ Returns a list of length
self.b
with elements representing the number of keys stored under each bucket for the given permutation.
-
get_subset_counts
(*keys)[source]¶ Returns the bucket allocation counts (see
get_counts()
above) restricted to the list of keys given.- Parameters
keys (hashable) – the keys for which to get the bucket allocation counts
-
__weakref__
¶ list of weak references to the object (if defined)
Asynchronous MinHash LSH¶
-
class
datasketch.experimental.aio.lsh.
AsyncMinHashLSH
(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict = None)[source]¶ Asynchronous MinHashLSH index.
- Parameters
threshold (float) – see
datasketch.MinHashLSH
.num_perm (int) – see
datasketch.MinHashLSH
.weights (tuple(float, float)) – see
datasketch.MinHashLSH
.params (tuple) – see
datasketch.MinHashLSH
.storage_config (dict) – New type of storage service - aiomongo - to use for storing hashtables and keys are implemented. If storage_config is None aiomongo storage will be used.
For example usage see Asynchronous MinHash LSH at scale.
Example of supported storage configuration:
MONGO = {'type': 'aiomongo', 'basename': 'base_name_1', 'mongo': {'host': 'localhost', 'port': 27017}}
Note
The module supports Python version >=3.6, and is currently experimental. So the interface may change slightly in the future.
For main functionality of LSH algorithm see
datasketch.MinHashLSH
.For additional information see MinHash LSH at scale and Asynchronous MinHash LSH at scale
-
__init__
(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict = None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
insertion_session
(batch_size=10000)[source]¶ Create a asynchronous context manager for fast insertion in index.
- Parameters
batch_size (int) – the size of chunks to use in insert_session mode (default=10000).
- Returns
datasketch.experimental.aio.lsh.AsyncMinHashLSHSession
Example
from datasketch.experimental.aio.lsh import AsyncMinHashLSH from datasketch import MinHash def chunk(it, size): it = iter(it) return iter(lambda: tuple(islice(it, size)), ()) _chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4) seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer'))) objs = [MinHash(16) for _ in range(len(seq))] for e, obj in zip(seq, objs): for i in e: obj.update(i.encode('utf-8')) data = [(e, m) for e, m in zip(seq, objs)] _storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}} async def func(): async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh: async with lsh.insertion_session(batch_size=1000) as session: fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data) await asyncio.gather(*fs)
-
delete_session
(batch_size=10000)[source]¶ Create a asynchronous context manager for fast removal of keys from index.
- Parameters
batch_size (int) – the size of chunks to use in insert_session mode (default=10000).
- Returns
datasketch.experimental.aio.lsh.AsyncMinHashLSHSession
Example
from datasketch.experimental.aio.lsh import AsyncMinHashLSH from datasketch import MinHash def chunk(it, size): it = iter(it) return iter(lambda: tuple(islice(it, size)), ()) _chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4) seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer'))) objs = [MinHash(16) for _ in range(len(seq))] for e, obj in zip(seq, objs): for i in e: obj.update(i.encode('utf-8')) data = [(e, m) for e, m in zip(seq, objs)] _storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}} async def func(): async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh: async with lsh.insertion_session(batch_size=1000) as session: fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data) await asyncio.gather(*fs) async with lsh.delete_session(batch_size=3) as session: fs = (session.remove(key) for key in keys_to_remove) await asyncio.gather(*fs)
-
__weakref__
¶ list of weak references to the object (if defined)
MinHash LSH Forest¶
-
class
datasketch.
MinHashLSHForest
(num_perm=128, l=8)[source]¶ The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.
- Parameters
num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
l (int, optional) – The number of prefix trees as described in the paper.
Note
The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.
-
add
(key, minhash)[source]¶ Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.
Note
The key won’t be searchbale until the
datasketch.MinHashLSHForest.index()
method is called.- Parameters
key (hashable) – The unique identifier of the set.
minhash (datasketch.MinHash) – The MinHash of the set.
-
query
(minhash, k)[source]¶ Return the approximate top-k keys that have the (approximately) highest Jaccard similarities to the query set.
- Parameters
minhash (datasketch.MinHash) – The MinHash of the query set.
k (int) – The maximum number of keys to return.
- Returns
list of at most k keys.
-
is_empty
()[source]¶ Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.
- Returns
True if there is no searchable key in the index.
- Return type
bool
-
__contains__
(key)[source]¶ - Returns
True only if the key has been added to the index.
- Return type
bool
-
__weakref__
¶ list of weak references to the object (if defined)
MinHash LSH Ensemble¶
-
class
datasketch.
MinHashLSHEnsemble
(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5), storage_config=None, prepickle=None)[source]¶ The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..
- Parameters
threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.
num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
num_part (int, optional) – The number of partitions in LSH Ensemble.
m (int, optional) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.
weights (tuple, optional) – Used to adjust the relative importance of minizing false positive and false negative when optimizing for the Containment threshold. Similar to the weights parameter in
datasketch.MinHashLSH
.storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.
Note
Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.
Note
More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.
-
__init__
(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5), storage_config=None, prepickle=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
index
(entries)[source]¶ Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.
- Parameters
entries (iterable of tuple) – An iterable of tuples, each must be in the form of (key, minhash, size), where key is the unique identifier of a set, minhash is the MinHash of the set, and size is the size or number of unique items in the set.
Note
size must be positive.
-
query
(minhash, size)[source]¶ Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.
- Parameters
minhash (datasketch.MinHash) – The MinHash of the query set.
size (int) – The size (number of unique items) of the query set.
- Returns
iterator of keys.
-
__contains__
(key)[source]¶ - Parameters
key (hashable) – The unique identifier of a set.
- Returns
True only if the key exists in the index.
- Return type
bool
-
__weakref__
¶ list of weak references to the object (if defined)
HyperLogLog¶
-
class
datasketch.
HyperLogLog
(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]¶ The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.
This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog
- Parameters
p (int, optional) – The precision parameter. It is ignored if the reg is given.
reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
-
__init__
(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
update
(b)[source]¶ Update the HyperLogLog with a new data value in bytes. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
- Parameters
b – The value to be hashed using the hash function specified.
Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
hll = HyperLogLog() hll.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) hll = HyperLogLog(hashfunc=_hash_32) hll.update("new value")
-
count
()[source]¶ Estimate the cardinality of the data values seen so far.
- Returns
The estimated cardinality.
- Return type
int
-
merge
(other)[source]¶ Merge the other HyperLogLog with this one, making this the union of the two.
- Parameters
other (datasketch.HyperLogLog) –
-
copy
()[source]¶ Create a copy of the current HyperLogLog by exporting its state.
- Returns
- Return type
-
is_empty
()[source]¶ - Returns
True if the current HyperLogLog is empty - at the state of just initialized.
- Return type
bool
-
__eq__
(other)[source]¶ Check equivalence between two HyperLogLogs
- Parameters
other (datasketch.HyperLogLog) –
- Returns
True if both have the same internal state.
- Return type
bool
HyperLogLog++¶
-
class
datasketch.
HyperLogLogPlusPlus
(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]¶ HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:
Use 64 bits instead of 32 bits for hash function
A new small-cardinality estimation scheme
Sparse representation (not implemented here)
- Parameters
p (int, optional) – The precision parameter. It is ignored if the reg is given.
reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 64 bits. The default hash function is based on SHA1 from hashlib.
hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
-
__init__
(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)