Software Open Access

ekzhu/datasketch: Improved performance for MinHash and MinHashLSH

Eric Zhu; Vadim Markovtsev; aastafiev; Wojciech Łukasiewicz; ae-foster; Jordan Martin; Ekevoo; Kevin Mann; Keyur Joshi; Spandan Thakur; Stefano Ortolani; Titusz; Vojtech Letal; Zac Bentley; fpug


DataCite XML Export

<?xml version='1.0' encoding='utf-8'?>
<resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 http://schema.datacite.org/meta/kernel-4.1/metadata.xsd">
  <identifier identifierType="DOI">10.5281/zenodo.4323502</identifier>
  <creators>
    <creator>
      <creatorName>Eric Zhu</creatorName>
    </creator>
    <creator>
      <creatorName>Vadim Markovtsev</creatorName>
      <affiliation>@athenianco</affiliation>
    </creator>
    <creator>
      <creatorName>aastafiev</creatorName>
    </creator>
    <creator>
      <creatorName>Wojciech Łukasiewicz</creatorName>
    </creator>
    <creator>
      <creatorName>ae-foster</creatorName>
    </creator>
    <creator>
      <creatorName>Jordan Martin</creatorName>
    </creator>
    <creator>
      <creatorName>Ekevoo</creatorName>
    </creator>
    <creator>
      <creatorName>Kevin Mann</creatorName>
      <affiliation>Six Five Design</affiliation>
    </creator>
    <creator>
      <creatorName>Keyur Joshi</creatorName>
      <affiliation>University of Illinois, Urbana-Champaign</affiliation>
    </creator>
    <creator>
      <creatorName>Spandan Thakur</creatorName>
      <affiliation>Adobe</affiliation>
    </creator>
    <creator>
      <creatorName>Stefano Ortolani</creatorName>
    </creator>
    <creator>
      <creatorName>Titusz</creatorName>
    </creator>
    <creator>
      <creatorName>Vojtech Letal</creatorName>
      <affiliation>@blindspot-ai</affiliation>
    </creator>
    <creator>
      <creatorName>Zac Bentley</creatorName>
      <affiliation>Klaviyo</affiliation>
    </creator>
    <creator>
      <creatorName>fpug</creatorName>
    </creator>
  </creators>
  <titles>
    <title>ekzhu/datasketch: Improved performance for MinHash and MinHashLSH</title>
  </titles>
  <publisher>Zenodo</publisher>
  <publicationYear>2020</publicationYear>
  <dates>
    <date dateType="Issued">2020-12-15</date>
  </dates>
  <resourceType resourceTypeGeneral="Software"/>
  <alternateIdentifiers>
    <alternateIdentifier alternateIdentifierType="url">https://zenodo.org/record/4323502</alternateIdentifier>
  </alternateIdentifiers>
  <relatedIdentifiers>
    <relatedIdentifier relatedIdentifierType="URL" relationType="IsSupplementTo">https://github.com/ekzhu/datasketch/tree/1.5.2</relatedIdentifier>
    <relatedIdentifier relatedIdentifierType="DOI" relationType="IsVersionOf">10.5281/zenodo.598238</relatedIdentifier>
  </relatedIdentifiers>
  <version>1.5.2</version>
  <rightsList>
    <rights rightsURI="info:eu-repo/semantics/openAccess">Open Access</rights>
  </rightsList>
  <descriptions>
    <description descriptionType="Abstract">&lt;ul&gt;
&lt;li&gt;Performance improvement for MinHash's update method.&lt;/li&gt;
&lt;li&gt;Make MinHash updates 4.5X faster by using &lt;code&gt;update_batch&lt;/code&gt; method for bulk update on MinHash. [See API doc].(&lt;a href="http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.update_batch"&gt;http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.update_batch&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;Further performance gain by using bulk generation of MinHash using &lt;code&gt;MinHash.bulk&lt;/code&gt; or &lt;code&gt;MinHash.generator&lt;/code&gt;. See &lt;a href="http://ekzhu.com/datasketch/documentation.html#datasketch.MinHash.bulk"&gt;API doc&lt;/a&gt; and &lt;a href="https://github.com/ekzhu/datasketch/pull/142"&gt;pull request&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Optional compression for MinHash LSH index by hashing the bucket key produced by &lt;code&gt;MinHashLSH._H&lt;/code&gt;. See &lt;a href="https://github.com/ekzhu/datasketch/pull/143"&gt;pull request&lt;/a&gt;. This leads to saving of memory/storage space used by the index.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Thank you @Sinusoidal36!&lt;/p&gt;</description>
  </descriptions>
</resource>
1,453
228
views
downloads
All versions This version
Views 1,45331
Downloads 2281
Data volume 276.3 MB795.7 kB
Unique views 1,26130
Unique downloads 1111

Share

Cite as