Skip to main navigation Skip to search Skip to main content

Efficient entity resolution for heterogeneous datasets

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Entity resolution (ER) is the process of determining which records in a collection or collections represent the same entity. It has become an emerging challenge in this big data era. A number of techniques have been developed to improve duplicate elimination. The majority of these approaches employ blocking methods and focus on homogeneous data collections. Applying existing approaches to heterogeneous data collections may encounter both precision and efficiency difficulties. We present a new technique applicable for heterogeneous data that is more efficient than the existing techniques. The technique utilizes a selective comparison algorithm which not only provides a blocking scheme, but also a two-phase comparison selection process. Our approach uses a finer token selection process to avoid building oversized blocks. Then, it filters out blocks containing records that are not likely to match. In addition, we process the comparisons within blocks to resolve only those that are likely to be duplicates. As a result, we significantly reduce the number of comparisons and increase the number of detected duplicates. The results of our experimental studies demonstrate the usefulness of our algorithm with respect to both effectiveness and efficiency.

Original languageEnglish
Title of host publication23rd International Conference on Software Engineering and Data Engineering, SEDE 2014
EditorsLing Ding, Yan Shi
PublisherInternational Society of Computers and Their Applications (ISCA)
Pages111-118
Number of pages8
ISBN (Electronic)9781880843963
StatePublished - 2014
Event23rd International Conference on Software Engineering and Data Engineering, SEDE 2014 - New Orleans, United States
Duration: Oct 13 2014Oct 15 2014

Publication series

Name23rd International Conference on Software Engineering and Data Engineering, SEDE 2014

Conference

Conference23rd International Conference on Software Engineering and Data Engineering, SEDE 2014
Country/TerritoryUnited States
CityNew Orleans
Period10/13/1410/15/14

Keywords

  • Bloom filters
  • Duplicate elimination
  • Entity resolution

Fingerprint

Dive into the research topics of 'Efficient entity resolution for heterogeneous datasets'. Together they form a unique fingerprint.

Cite this