API Documentation¶
MinHash¶
-
class
datasketch.
MinHash
(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]¶ MinHash is a probabilistic data structure for computing Jaccard similarity between sets.
Parameters: - num_perm (int, optional) – Number of random permutation functions. It will be ignored if hashvalues is not None.
- seed (int, optional) – The random seed controls the set of random permutation functions generated for this MinHash.
- hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
- hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
- hashvalues (numpy.array or list, optional) – The hash values is the internal state of the MinHash. It can be specified for faster initialization using the existing state from another MinHash.
- permutations (optional) – The permutation function parameters. This argument can be specified for faster initialization using the existing state from another MinHash.
Note
To save memory usage, consider using
datasketch.LeanMinHash
.Note
Since version 1.1.1, MinHash will only support serialization using pickle.
serialize
anddeserialize
methods are removed, and are supported indatasketch.LeanMinHash
instead. MinHash serialized before version 1.1.1 cannot be deserialized properly in newer versions (need to migrate?).Note
Since version 1.1.3, MinHash uses Numpy’s random number generator instead of Python’s built-in random package. This change makes the hash values consistent across different Python versions. The side-effect is that now MinHash created before version 1.1.3 won’t work (i.e.,
jaccard
,merge
andunion
) with those created after.-
__init__
(num_perm=128, seed=1, hashfunc=<function sha1_hash32>, hashobj=None, hashvalues=None, permutations=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
update
(b)[source]¶ Update this MinHash with a new value. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
Parameters: b – The value to be hashed using the hash function specified. Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
minhash = Minhash() minhash.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) minhash = MinHash(hashfunc=_hash_32) minhash.update("new value")
-
jaccard
(other)[source]¶ Estimate the Jaccard similarity (resemblance) between the sets represented by this MinHash and the other.
Parameters: other (datasketch.MinHash) – The other MinHash. Returns: The Jaccard similarity, which is between 0.0 and 1.0. Return type: float
-
count
()[source]¶ Estimate the cardinality count based on the technique described in this paper.
Returns: The estimated cardinality of the set represented by this MinHash. Return type: int
-
merge
(other)[source]¶ Merge the other MinHash with this one, making this one the union of both.
Parameters: other (datasketch.MinHash) – The other MinHash.
-
digest
()[source]¶ Export the hash values, which is the internal state of the MinHash.
Returns: The hash values which is a Numpy array. Return type: numpy.array
-
is_empty
()[source]¶ Returns: - If the current MinHash is empty - at the state of just
- initialized.
Return type: bool
-
__eq__
(other)[source]¶ Returns: bool – If their seeds and hash values are both equal then two are equivalent.
-
__weakref__
¶ list of weak references to the object (if defined)
-
classmethod
union
(*mhs)[source]¶ Create a MinHash which is the union of the MinHash objects passed as arguments.
Parameters: *mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2. Returns: A new union MinHash. Return type: datasketch.MinHash
Lean MinHash¶
-
class
datasketch.
LeanMinHash
(minhash)[source]¶ Lean MinHash is MinHash with a smaller memory footprint and faster deserialization, but with its internal state frozen – no update().
Lean MinHash inherits all methods from
datasketch.MinHash
. It does not store the permutations and the hashfunc needed for updating. If a MinHash does not need further updates, convert it into a lean MinHash to save memory.Example
To create a lean MinHash from an existing MinHash:
lean_minhash = LeanMinHash(minhash) # You can compute the Jaccard similarity between two lean MinHash lean_minhash.jaccard(lean_minhash2) # Or between a lean MinHash and a MinHash lean_minhash.jaccard(minhash2)
To create a MinHash from a lean MinHash:
minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.hashvalues) # Or if you want to prevent further updates on minhash # from affecting the state of lean_minhash minhash = MinHash(seed=lean_minhash.seed, hashvalues=lean_minhash.digest())
Note
Lean MinHash can also be used in
datasketch.MinHashLSH
,datasketch.MinHashLSHForest
, anddatasketch.MinHashLSHEnsemble
.Parameters: minhash – The datasketch.MinHash
object used to initialize the LeanMinHash.-
bytesize
(byteorder='@')[source]¶ Compute the byte size after serialization.
Parameters: byteorder (str, optional) – This is byte order of the serialized data. Use one of the byte order characters: @
,=
,<
,>
, and!
. Default is@
– the native order.Returns: Size in number of bytes after serialization. Return type: int
-
serialize
(buf, byteorder='@')[source]¶ Serialize this lean MinHash and store the result in an allocated buffer.
Parameters: - buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
- byteorder (str, optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
This is preferred over using pickle if the serialized lean MinHash needs to be used by another program in a different programming language.
- The serialization schema:
- The first 8 bytes is the seed integer
- The next 4 bytes is the number of hash values
- The rest is the serialized hash values, each uses 4 bytes
Example
To serialize a single lean MinHash into a bytearray buffer.
buf = bytearray(lean_minhash.bytesize()) lean_minhash.serialize(buf)
To serialize multiple lean MinHash into a bytearray buffer.
# assuming lean_minhashs is a list of LeanMinHash with the same size size = lean_minhashs[0].bytesize() buf = bytearray(size*len(lean_minhashs)) for i, lean_minhash in enumerate(lean_minhashs): lean_minhash.serialize(buf[i*size:])
-
classmethod
deserialize
(buf, byteorder='@')[source]¶ Deserialize a lean MinHash from a buffer.
Parameters: - buf (buffer) – buf must implement the buffer interface. One such example is the built-in bytearray class.
- byteorder (str. optional) –
This is byte order of the serialized data. Use one of the byte order characters:
@
,=
,<
,>
, and!
. Default is@
– the native order.
Returns: The deserialized lean MinHash
Return type: Example
To deserialize a lean MinHash from a buffer.
lean_minhash = LeanMinHash.deserialize(buf)
-
classmethod
union
(*lmhs)[source]¶ Create a MinHash which is the union of the MinHash objects passed as arguments.
Parameters: *mhs – The MinHash objects to be united. The argument list length is variable, but must be at least 2. Returns: A new union MinHash. Return type: datasketch.MinHash
-
Weighted MinHash¶
-
class
datasketch.
WeightedMinHashGenerator
(dim, sample_size=128, seed=1)[source]¶ The weighted MinHash generator is used for creating new
datasketch.WeightedMinHash
objects.This weighted MinHash implementation is based on Sergey Ioffe’s paper, Improved Consistent Sampling, Weighted Minhash and L1 Sketching
Parameters: - dim (int) – The number of dimensions of the input Jaccard vectors.
- sample_size (int, optional) – The number of samples to use for creating weighted MinHash.
- seed (int) – The random seed to use for generating permutation functions.
-
__init__
(dim, sample_size=128, seed=1)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
minhash
(v)[source]¶ Create a new weighted MinHash given a weighted Jaccard vector. Each dimension is an integer frequency of the corresponding element in the multi-set represented by the vector.
Parameters: v (numpy.array) – The Jaccard vector.
-
__weakref__
¶ list of weak references to the object (if defined)
-
class
datasketch.
WeightedMinHash
(seed, hashvalues)[source]¶ New weighted MinHash is generated by
datasketch.WeightedMinHashGenerator
. You can also initialize weighted MinHash by using the state from an existing one.Parameters: - seed (int) – The random seed used to generate this weighted MinHash.
- hashvalues – The internal state of this weighted MinHash.
-
jaccard
(other)[source]¶ Estimate the weighted Jaccard similarity between the multi-sets represented by this weighted MinHash and the other.
Parameters: other (datasketch.WeightedMinHash) – The other weighted MinHash. Returns: The weighted Jaccard similarity between 0.0 and 1.0. Return type: float
-
digest
()[source]¶ Export the hash values, which is the internal state of the weighted MinHash.
Returns: The hash values which is a Numpy array. Return type: numpy.array
-
copy
()[source]¶ Returns: A copy of this weighted MinHash by exporting its state. Return type: datasketch.WeightedMinHash
-
__eq__
(other)[source]¶ Returns: If their seeds and hash values are both equal then two are equivalent. Return type: bool
-
__weakref__
¶ list of weak references to the object (if defined)
MinHash LSH¶
-
class
datasketch.
MinHashLSH
(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]¶ The MinHash LSH index. It supports query with Jaccard similarity threshold. Reference: Chapter 3, Mining of Massive Datasets.
Parameters: - threshold (float) – The Jaccard similarity threshold between 0.0 and 1.0. The initialized MinHash LSH will be optimized for the threshold by minizing the false positive and false negative.
- num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
- weights (tuple, optional) – Used to adjust the relative importance of
minimizing false positive and false negative when optimizing
for the Jaccard similarity threshold.
weights is a tuple in the format of
(false_positive_weight, false_negative_weight)
. - params (tuple, optional) – The LSH parameters (i.e., number of bands and size of each bands). This is used to bypass the parameter optimization step in the constructor. threshold and weights will be ignored if this is given.
- storage_config (dict, optional) – Type of storage service to use for storing hashtables and keys. basename is an optional property whose value will be used as the prefix to stored keys. If this is not set, a random string will be generated instead. If you set this, you will be responsible for ensuring there are no key collisions.
- prepickle (bool, optional) – If True, all keys are pickled to bytes before insertion. If None, a default value is chosen based on the storage_config.
Note
weights must sum to 1.0, and the format is (false positive weight, false negative weight). For example, if minimizing false negative (or maintaining high recall) is more important, assign more weight toward false negative: weights=(0.4, 0.6). Try to live with a small difference between weights (i.e. < 0.5).
-
__init__
(threshold=0.9, num_perm=128, weights=(0.5, 0.5), params=None, storage_config=None, prepickle=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
insert
(key, minhash, check_duplication=True)[source]¶ Insert a key to the index, together with a MinHash (or weighted MinHash) of the set referenced by the key.
Parameters: - key (str) – The identifier of the set.
- minhash (datasketch.MinHash) – The MinHash of the set.
- check_duplication (bool) – To avoid duplicate keys in the storage (default=True). It’s recommended to not change the default, but if you want to avoid the overhead during insert you can set check_duplication = False.
-
insertion_session
(buffer_size=50000)[source]¶ Create a context manager for fast insertion into this index.
Parameters: buffer_size (int) – The buffer size for insert_session mode (default=50000). Returns: datasketch.lsh.MinHashLSHInsertionSession
-
query
(minhash)[source]¶ Giving the MinHash of the query set, retrieve the keys that references sets with Jaccard similarities greater than the threshold.
Parameters: minhash (datasketch.MinHash) – The MinHash of the query set. Returns: list of unique keys.
-
__contains__
(key)[source]¶ Parameters: key (hashable) – The unique identifier of a set. Returns: True only if the key exists in the index. Return type: bool
-
remove
(key)[source]¶ Remove the key from the index.
Parameters: key (hashable) – The unique identifier of a set.
-
get_counts
()[source]¶ Returns a list of length
self.b
with elements representing the number of keys stored under each bucket for the given permutation.
-
get_subset_counts
(*keys)[source]¶ Returns the bucket allocation counts (see
get_counts()
above) restricted to the list of keys given.Parameters: keys (hashable) – the keys for which to get the bucket allocation counts
-
__weakref__
¶ list of weak references to the object (if defined)
Asynchronous MinHash LSH¶
-
class
datasketch.experimental.aio.lsh.
AsyncMinHashLSH
(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict = None)[source]¶ Asynchronous MinHashLSH index.
Parameters: - threshold (float) – see
datasketch.MinHashLSH
. - num_perm (int) – see
datasketch.MinHashLSH
. - weights (tuple(float, float)) – see
datasketch.MinHashLSH
. - params (tuple) – see
datasketch.MinHashLSH
. - storage_config (dict) – New type of storage service - aiomongo - to use for storing hashtables and keys are implemented. If storage_config is None aiomongo storage will be used.
For example usage see Asynchronous MinHash LSH at scale.
Example of supported storage configuration:
MONGO = {'type': 'aiomongo', 'basename': 'base_name_1', 'mongo': {'host': 'localhost', 'port': 27017}}
Note
- The module supports Python version >=3.6, and is currently experimental. So the interface may change slightly in the future.
- For main functionality of LSH algorithm see
datasketch.MinHashLSH
. - For additional information see MinHash LSH at scale and Asynchronous MinHash LSH at scale
-
__init__
(threshold: float = 0.9, num_perm: int = 128, weights: Tuple[float, float] = (0.5, 0.5), params: Tuple[int, int] = None, storage_config: Dict = None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
insertion_session
(batch_size=10000)[source]¶ Create a asynchronous context manager for fast insertion in index.
Parameters: batch_size (int) – the size of chunks to use in insert_session mode (default=10000). Returns: datasketch.experimental.aio.lsh.AsyncMinHashLSHSession Example
from datasketch.experimental.aio.lsh import AsyncMinHashLSH from datasketch import MinHash def chunk(it, size): it = iter(it) return iter(lambda: tuple(islice(it, size)), ()) _chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4) seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer'))) objs = [MinHash(16) for _ in range(len(seq))] for e, obj in zip(seq, objs): for i in e: obj.update(i.encode('utf-8')) data = [(e, m) for e, m in zip(seq, objs)] _storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}} async def func(): async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh: async with lsh.insertion_session(batch_size=1000) as session: fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data) await asyncio.gather(*fs)
-
delete_session
(batch_size=10000)[source]¶ Create a asynchronous context manager for fast removal of keys from index.
Parameters: batch_size (int) – the size of chunks to use in insert_session mode (default=10000). Returns: datasketch.experimental.aio.lsh.AsyncMinHashLSHSession Example
from datasketch.experimental.aio.lsh import AsyncMinHashLSH from datasketch import MinHash def chunk(it, size): it = iter(it) return iter(lambda: tuple(islice(it, size)), ()) _chunked_str = chunk((random.choice(string.ascii_lowercase) for _ in range(10000)), 4) seq = frozenset(chain((''.join(s) for s in _chunked_str), ('aahhb', 'aahh', 'aahhc', 'aac', 'kld', 'bhg', 'kkd', 'yow', 'ppi', 'eer'))) objs = [MinHash(16) for _ in range(len(seq))] for e, obj in zip(seq, objs): for i in e: obj.update(i.encode('utf-8')) data = [(e, m) for e, m in zip(seq, objs)] _storage_config_redis = {'type': 'aiomongo', 'mongo': {'host': 'localhost', 'port': 27017}} async def func(): async with AsyncMinHashLSH(storage_config=_storage_config_redis, threshold=0.5, num_perm=16) as lsh: async with lsh.insertion_session(batch_size=1000) as session: fs = (session.insert(key, minhash, check_duplication=True) for key, minhash in data) await asyncio.gather(*fs) async with lsh.delete_session(batch_size=3) as session: fs = (session.remove(key) for key in keys_to_remove) await asyncio.gather(*fs)
-
__weakref__
¶ list of weak references to the object (if defined)
- threshold (float) – see
MinHash LSH Forest¶
-
class
datasketch.
MinHashLSHForest
(num_perm=128, l=8)[source]¶ The LSH Forest for MinHash. It supports top-k query in Jaccard similarity. Instead of using prefix trees as the original paper, I use a sorted array to store the hash values in every hash table.
Parameters: - num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
- l (int, optional) – The number of prefix trees as described in the paper.
Note
The MinHash LSH Forest also works with weighted Jaccard similarity and weighted MinHash without modification.
-
add
(key, minhash)[source]¶ Add a unique key, together with a MinHash (or weighted MinHash) of the set referenced by the key.
Note
The key won’t be searchbale until the
datasketch.MinHashLSHForest.index()
method is called.Parameters: - key (hashable) – The unique identifier of the set.
- minhash (datasketch.MinHash) – The MinHash of the set.
-
query
(minhash, k)[source]¶ Return the approximate top-k keys that have the highest Jaccard similarities to the query set.
Parameters: - minhash (datasketch.MinHash) – The MinHash of the query set.
- k (int) – The maximum number of keys to return.
Returns: list of at most k keys.
-
is_empty
()[source]¶ Check whether there is any searchable keys in the index. Note that keys won’t be searchable until index is called.
Returns: True if there is no searchable key in the index. Return type: bool
-
__contains__
(key)[source]¶ Returns: True only if the key has been added to the index. Return type: bool
-
__weakref__
¶ list of weak references to the object (if defined)
MinHash LSH Ensemble¶
-
class
datasketch.
MinHashLSHEnsemble
(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5))[source]¶ The MinHash LSH Ensemble index. It supports Containment queries. The implementation is based on E. Zhu et al..
Parameters: - threshold (float) – The Containment threshold between 0.0 and 1.0. The initialized LSH Ensemble will be optimized for the threshold by minizing the false positive and false negative.
- num_perm (int, optional) – The number of permutation functions used by the MinHash to be indexed. For weighted MinHash, this is the sample size (sample_size).
- num_part (int, optional) – The number of partitions in LSH Ensemble.
- m (int, optional) – The memory usage factor: an LSH Ensemble uses approximately m times more memory space than a MinHash LSH with the same number of sets indexed. The higher the m the better the accuracy.
- weights (tuple, optional) – Used to adjust the relative importance of
minizing false positive and false negative when optimizing
for the Containment threshold. Similar to the weights parameter
in
datasketch.MinHashLSH
.
Note
Using more partitions (num_part) leads to better accuracy, at the expense of slower query performance. This is different from the paper and the Go implementation, in which more partitions leads to better accuracy AND faster query performance, due to parallelism.
Note
More information about the parameter m can be found in the Go implementation of LSH Ensemble, in which m is named MaxK.
-
__init__
(threshold=0.9, num_perm=128, num_part=16, m=8, weights=(0.5, 0.5))[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
index
(entries)[source]¶ Index all sets given their keys, MinHashes, and sizes. It can be called only once after the index is created.
Parameters: entries (iterable of tuple) – An iterable of tuples, each must be in the form of (key, minhash, size), where key is the unique identifier of a set, minhash is the MinHash of the set, and size is the size or number of unique items in the set. Note
size must be positive.
-
query
(minhash, size)[source]¶ Giving the MinHash and size of the query set, retrieve keys that references sets with containment with respect to the query set greater than the threshold.
Parameters: - minhash (datasketch.MinHash) – The MinHash of the query set.
- size (int) – The size (number of unique items) of the query set.
Returns: iterator of keys.
-
__contains__
(key)[source]¶ Parameters: key (hashable) – The unique identifier of a set. Returns: True only if the key exists in the index. Return type: bool
-
__weakref__
¶ list of weak references to the object (if defined)
HyperLogLog¶
-
class
datasketch.
HyperLogLog
(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]¶ The HyperLogLog data sketch for estimating cardinality of very large dataset in a single pass. The original HyperLogLog is described here.
This HyperLogLog implementation is based on: https://github.com/svpcom/hyperloglog
Parameters: - p (int, optional) – The precision parameter. It is ignored if the reg is given.
- reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
- hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 32 bits. The default hash function is based on SHA1 from hashlib.
- hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
-
__init__
(p=8, reg=None, hashfunc=<function sha1_hash32>, hashobj=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
update
(b)[source]¶ Update the HyperLogLog with a new data value in bytes. The value will be hashed using the hash function specified by the hashfunc argument in the constructor.
Parameters: b – The value to be hashed using the hash function specified. Example
To update with a new string value (using the default SHA1 hash function, which requires bytes as input):
hll = HyperLogLog() hll.update("new value".encode('utf-8'))
We can also use a different hash function, for example, pyfarmhash:
import farmhash def _hash_32(b): return farmhash.hash32(b) hll = HyperLogLog(hashfunc=_hash_32) hll.update("new value")
-
count
()[source]¶ Estimate the cardinality of the data values seen so far.
Returns: The estimated cardinality. Return type: int
-
merge
(other)[source]¶ Merge the other HyperLogLog with this one, making this the union of the two.
Parameters: other (datasketch.HyperLogLog) –
-
copy
()[source]¶ Create a copy of the current HyperLogLog by exporting its state.
Returns: Return type: datasketch.HyperLogLog
-
is_empty
()[source]¶ Returns: True if the current HyperLogLog is empty - at the state of just initialized. Return type: bool
-
__eq__
(other)[source]¶ Check equivalence between two HyperLogLogs
Parameters: other (datasketch.HyperLogLog) – Returns: True if both have the same internal state. Return type: bool
HyperLogLog++¶
-
class
datasketch.
HyperLogLogPlusPlus
(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]¶ HyperLogLog++ is an enhanced HyperLogLog from Google. Main changes from the original HyperLogLog:
- Use 64 bits instead of 32 bits for hash function
- A new small-cardinality estimation scheme
- Sparse representation (not implemented here)
Parameters: - p (int, optional) – The precision parameter. It is ignored if the reg is given.
- reg (numpy.array, optional) – The internal state. This argument is for initializing the HyperLogLog from an existing one.
- hashfunc (optional) – The hash function used by this MinHash. It takes the input passed to the update method and returns an integer that can be encoded with 64 bits. The default hash function is based on SHA1 from hashlib.
- hashobj (deprecated) – This argument is deprecated since version 1.4.0. It is a no-op and has been replaced by hashfunc.
-
__init__
(p=8, reg=None, hashfunc=<function sha1_hash64>, hashobj=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
__weakref__
¶ list of weak references to the object (if defined)