TY - GEN
T1 - DeepClean
T2 - 5th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2018
AU - Zhang, Xinyang
AU - Ji, Yujie
AU - Nguyen, Chanh
AU - Wang, Ting
N1 - Publisher Copyright: © 2018 IEEE.
PY - 2018/7/2
Y1 - 2018/7/2
N2 - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
AB - As one critical task in the data analysis pipeline, data cleaning is notoriously human labor-intensive and error-prone. Knowledge base-assisted data cleaning has proved a powerful tool for finding and fixing data defects; however, its applicability is inevitably bounded by the natural limitations of knowledge bases. Meanwhile, although a vast number of knowledge sources exist in the form of free-text corpora (e.g., Wikipedia), transforming them into formats usable by existing data cleaning tools can be prohibitively costly and error-prone, if not at all impossible. Here, we present DeepClean, the first end-to-end data cleaning framework powered by free-text knowledge sources. At a high level, DeepClean leverages a knowledge source through its question-answering (QA) interface and achieves high-quality cleaning via iterative question asking. Specifically, DeepClean detects and repairs data defects in three stages: (i) Pattern extraction - it automatically discovers the semantic types of the data attributes as well as their correlations; (ii) Question generation - it translates each data tuple into a minimal set of validation questions; (iii) Completion and repair - by checking the answers returned by the knowledge source against the data values, it identifies erroneous cases and suggests possible fixes. Through extensive empirical studies, we demonstrate that DeepClean is applicable to a range of domains, and can effectively repair a variety of data defects, highlighting data cleaning powered by free-text knowledge sources as a promising direction for future research.
KW - Data clean
KW - Free text knowledge source
KW - Question asking
UR - https://www.scopus.com/pages/publications/85062867879
U2 - 10.1109/DSAA.2018.00039
DO - 10.1109/DSAA.2018.00039
M3 - Conference contribution
T3 - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
SP - 283
EP - 292
BT - Proceedings - 2018 IEEE 5th International Conference on Data Science and Advanced Analytics, DSAA 2018
A2 - Bonchi, Francesco
A2 - Provost, Foster
A2 - Eliassi-Rad, Tina
A2 - Wang, Wei
A2 - Cattuto, Ciro
A2 - Ghani, Rayid
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 1 October 2018 through 4 October 2018
ER -