ALTO: An Efficient Network Orchestrator for Compound AI Systems
Keshav Santhanam, Deepti Raghavan, Muhammad Shahir Rahman, Thejas Venkatesh, Neha Kunjal, Pratiksha Thaker, Philip Levis, and Matei Zaharia
Published in EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems, April 2024.
Abstract
We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO leverages an optimization opportunity specific to generative language models, which is streaming intermediate outputs from the language model to downstream stages. We highlight two challenges that emerge while serving these applications at scale: handling how some stages can be stateful across partial outputs, and handling how language models can produce variable amounts of text. To address these challenges, we motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling. ALTO's partial output streaming increases throughput by up to 3x for a fixed latency target of 4 seconds / request and reduces tail latency by 1.8x compared to a baseline serving approach, on a complex chat bot verification pipeline.
Paper (608KB)
BibTeX entry
@inproceedings{euromlsys24-alto, author = "Keshav Santhanam and Deepti Raghavan and Muhammad Shahir Rahman and Thejas Venkatesh and Neha Kunjal and Pratiksha Thaker and Philip Levis and Matei Zaharia", title = "{ALTO: An Efficient Network Orchestrator for Compound AI Systems}", booktitle = "{EuroMLSys '24: Proceedings of the 4th Workshop on Machine Learning and Systems}", year = {2024}, month = {April} }





Login