POSH: A Data-Aware Shell
Published in Proceedings of the 2020 USENIX Annual Technical Conference (ATC '20), June 2020.
Abstract
We present POSH, a framework that accelerates shell applications with I/O-heavy components, such as data analytics with command-line utilities. Remote storage such as networked filesystems can severely limit the performance of these applications: data makes a round trip over the network for relatively little computation at the client. Reducing the data movement by moving the code to the data can improve performance.
POSH automatically optimizes unmodified I/O-intensive shell applications running over remote storage by offloading the I/O-intensive portions to proxy servers closer to the data. A proxy can run directly on a storage server, or on a machine closer to the storage layer than the client. POSH intercepts shell pipelines and uses metadata called annotations to decide where to run each command within the pipeline. We address three principal challenges that arise: an annotation language that allows POSH to understand which files a command will access, a scheduling algorithm that places commands to minimize data movement, and a system runtime to execute a distributed schedule but retain local semantics.
We benchmark POSH on real shell pipelines such as image processing, network security analysis, log analysis, distributed system debugging, and git. We find that POSH provides speedups ranging from 1.6x to 15x compared to NFS, without requiring any modifications to the applications.
BibTeX entry
@inproceedings{posh-atc20,
author = "Deepti Raghavan and Sadjad Fouladi and Philip Levis and Matei Zaharia",
title = "{POSH: A Data-Aware Shell}",
booktitle = "{Proceedings of the 2020 USENIX Annual Technical Conference (ATC '20)}",
year = {2020},
month = {June}
}