Skip to main navigation Skip to search Skip to main content

Improving the Sensitivity of MinHash Through Hash-Value Analysis

  • Université Gustave Eiffel

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

MinHash sketching is an important algorithm for efficient document retrieval and bioinformatics. We show that the value of the matching MinHash codes convey additional information about the Jaccard similarity of S and T over and above the fact that the MinHash codes agree. This observation holds the potential to increase the sensitivity of minhash-based retrieval systems. We analyze the expected Jaccard similarity of two sets as a function of observing a matching MinHash value a under a reasonable prior distribution on intersection set sizes, and present a practical approach to using MinHash values to improve the sensitivity of traditional Jaccard similarity estimation, based on the Kolmogorov-Smirnov statistical test for sample distributions. Experiments over a wide range of hash function counts and set similarities show a small but consistent improvement over chance at predicting over/under-estimation, yielding an average accuracy of 61% over the range of experiments.

Original languageEnglish
Title of host publication34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023
EditorsLaurent Bulteau, Zsuzsanna Liptak
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959772761
DOIs
StatePublished - Jun 2023
Event34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023 - Marne-la-Vallee, France
Duration: Jun 26 2023Jun 28 2023

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume259

Conference

Conference34th Annual Symposium on Combinatorial Pattern Matching, CPM 2023
Country/TerritoryFrance
CityMarne-la-Vallee
Period06/26/2306/28/23

Keywords

  • MinHash sketching
  • hashing
  • sequence similarity

Fingerprint

Dive into the research topics of 'Improving the Sensitivity of MinHash Through Hash-Value Analysis'. Together they form a unique fingerprint.

Cite this