Software Open Access

ekzhu/datasketch: Improved performance for MinHash and MinHashLSH

Eric Zhu; Vadim Markovtsev; aastafiev; Wojciech Łukasiewicz; ae-foster; Jordan Martin; Ekevoo; Kevin Mann; Keyur Joshi; Spandan Thakur; Stefano Ortolani; Titusz; Vojtech Letal; Zac Bentley; fpug

  • Performance improvement for MinHash's update method.
  • Make MinHash updates 4.5X faster by using update_batch method for bulk update on MinHash. [See API doc].(http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.update_batch)
  • Further performance gain by using bulk generation of MinHash using MinHash.bulk or MinHash.generator. See API doc and pull request.
  • Optional compression for MinHash LSH index by hashing the bucket key produced by MinHashLSH._H. See pull request. This leads to saving of memory/storage space used by the index.

Thank you @Sinusoidal36!

Files (795.7 kB)
Name Size
ekzhu/datasketch-1.5.2.zip
md5:a3d3bce4aa309dab4bcd0ed08870cbc8
795.7 kB Download
1,421
226
views
downloads
All versions This version
Views 1,42126
Downloads 2261
Data volume 273.1 MB795.7 kB
Unique views 1,23025
Unique downloads 1101

Share

Cite as