Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing
Published in Proceedings of the 13th European Conference on Computer Systems (EuroSys '18), April 2018.
Abstract
Existing cloud computing control planes do not scale to more than a few hundred cores, while frameworks without control planes scale but take seconds to reschedule a job. We propose an asynchronous control plane for cloud computing systems, in which a central controller can dynamically reschedule jobs but worker nodes never block on communication with the controller. By decoupling control plane traffic from program control flow in this way, an asynchronous control plane can scale to run millions of computations per second while being able to reschedule computations within milliseconds.
We show that an asynchronous control plane can match the scalability and performance of TensorFlow and MPI-based programs while rescheduling individual tasks in milliseconds. Scheduling an individual task takes 1us, such that a 1,152 core cluster can schedule over 120 million tasks/second and this scales linearly with the number of cores. The ability to schedule huge numbers of tasks allows jobs to be divided into very large numbers of tiny tasks, whose improved load balancing can speed up computations 2.1 - 2.3x.
Paper (1MB)
BibTeX entry
@inproceedings{canary-eurosys18,
author = "Hang Qu and Omid Mashayekhi and Chinmayee Shah and Philip Levis",
title = "{Decoupling the Control Plane from Program Control Flow for Flexibility and Performance in Cloud Computing}",
booktitle = "{Proceedings of the 13th European Conference on Computer Systems (EuroSys '18)}",
year = {2018},
month = {April}
}