TY - GEN
T1 - HEDC
T2 - 3rd ACM International Workshop on Cloud Data Management, CloudDB 2012 - Co-located with CIKM 2012
AU - Shi, Yingjie
AU - Meng, Xiaofeng
AU - Wang, Fusheng
AU - Gan, Yantao
PY - 2012
Y1 - 2012
N2 - With increasing popularity of cloud based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional database to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plan, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct the exact histogram on massive data, building the approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC, a Histogram Estimator for Data in the Cloud. We design a histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC can provide promising histogram estimate for massive data in the cloud.
AB - With increasing popularity of cloud based data management, improving the performance of queries in the cloud is an urgent issue to solve. Summary of data distribution and statistical information has been commonly used in traditional database to support query optimization, and histograms are of particular interest. Naturally, histograms could be used to support query optimization and efficient utilization of computing resources in the cloud. Histograms could provide helpful reference information for generating optimal query plan, and generate basic statistics useful for guaranteeing the load balance of query processing in the cloud. Since it is too expensive to construct the exact histogram on massive data, building the approximate histogram is a more feasible solution. This problem, however, is challenging to solve in the cloud environment because of the special data organization and processing mode in the cloud. In this paper, we present HEDC, a Histogram Estimator for Data in the Cloud. We design a histogram estimate workflow based on an extended MapReduce framework, and propose novel sampling mechanisms to leverage the sampling efficiency and estimate accuracy. We experimentally validate our techniques on Hadoop and the results demonstrate that HEDC can provide promising histogram estimate for massive data in the cloud.
KW - Cloud computing
KW - Histogram estimate
KW - MapReduce
KW - Sampling
UR - https://www.scopus.com/pages/publications/84870506246
U2 - 10.1145/2390021.2390032
DO - 10.1145/2390021.2390032
M3 - Conference contribution
SN - 9781450317085
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 51
EP - 58
BT - CloudDB'12 - Proceedings of the 3rd ACM International Workshop on Cloud Data Management, Co-located with CIKM 2012
Y2 - 29 October 2012 through 29 October 2012
ER -