Approximate Partition Selection for Big-Data Workloads using Summary Statistics
Published in Proceedings of the VLDB Endowment, July 2020.
Abstract
Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently per- form this query processing using a set of precomputed summary statistics, which inform the choice of partition subset and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7x to 70x reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB.
Paper (1MB)
BibTeX entry
@inproceedings{approx-vldb20,
author = "Kexin Rong and Yao Lu and Peter Bailis and Srikanth Kandula and Philip Levis",
title = "{Approximate Partition Selection for Big-Data Workloads using Summary Statistics}",
booktitle = "{Proceedings of the VLDB Endowment}",
year = {2020},
month = {July}
}