PARALLEL DATA LAB 

PDL Abstract

Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow

ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), March 30–April 3, 2025, Rotterdam, Netherlands.

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, Rashmi Vinayak

Carnegie Mellon University

http://www.pdl.cmu.edu/

This paper introduces Helix, a distributed system for high-throughput,low-latency large language model (LLM) serving in heterogeneous GPU clusters. The key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem on directed, weighted graphs, whose nodes represent GPU instances and edges capture both GPU and network heterogeneity through their capacities. Helix then uses a mixed integer linear programming (MILP) algorithm to discover highly optimized strategies to serve LLMs on heterogeneous GPUs. This approach allows Helix to jointly optimize model placement and request scheduling, two highly entangled tasks in heterogeneous LLM serving. Our evaluation on several heterogeneous clusters ranging from 24 to 42 GPU nodes shows that Helix improves serving throughput by up to 3.3× and reduces prompting and decoding latency by up to 66% and 24%, respectively, compared to existing approaches. Helix is available at https://github.com/Thesys-lab/Helix-ASPLOS25.

KEYWORDS: large language model serving, system for ML, distributed systems, cloud computing

FULL PAPER: pdf