publications | Yanying Lin (林彦颖)

2026

EuroSys
FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters

Yanying Lin, Shijie Peng, Chengzhi Lu, and 2 more authors

In Proceedings of the 21st European Conference on Computer Systems, 2026

Insight Bib HTML PDF

Insight: We find that for pipeline inference, the optimal online pipeline structure varies greatly depending on the incoming request distribution. Longer pipelines can absorb load spikes by dispersing demand, while shorter pipelines reduce communication overhead. Dynamically adjusting the pipeline structure according to the request distribution is an effective approach in serverless environments.
@inproceedings{lin2026flexpipe, title = {FlexPipe: Adapting Dynamic LLM Serving Through Inflight Pipeline Refactoring in Fragmented Serverless Clusters}, author = {Lin, Yanying and Peng, Shijie and Lu, Chengzhi and Xu, Chengzhong and Ye, Kejiang}, booktitle = {Proceedings of the 21st European Conference on Computer Systems}, year = {2026}, }

2025

SoCC
Understanding Diffusion Model Serving in Production: A Top-Down Analysis of Workload, Scheduling, and Resource Efficiency

Yanying Lin, Shuaipeng Wu, Shutian Luo, and 8 more authors

In Proceedings of the 2025 ACM Symposium on Cloud Computing, 2025

Insight Bib PDF Slides

Insight: The main insights of this work are obtained from production systems. It focuses on the performance of online inference, middleware, and hardware layers, as well as the relationships between these three components. We find that skewed request distributions force systems to stress network and cache layers, resulting in poor resource locality and reduced system performance.
@inproceedings{lin2025understanding, title = {Understanding Diffusion Model Serving in Production: A Top-Down Analysis of Workload, Scheduling, and Resource Efficiency}, author = {Lin, Yanying and Wu, Shuaipeng and Luo, Shutian and Xu, Hong and Shen, Haiying and Ma, Chong and Shen, Min and Chen, Le and Xu, Chengzhong and Qu, Lin and others}, booktitle = {Proceedings of the 2025 ACM Symposium on Cloud Computing}, pages = {1--15}, year = {2025}, }
Cluster
ROCK: Serving Multimodal Model in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters

Shuaipeng Wu, Yanying* Lin, Shijie Peng, and 4 more authors

In Proceedings of the 2025 IEEE International Conference on Cluster Computing, 2025

Insight Bib PDF

Insight: This work analyzes the performance of different request workloads in production-level environments and proposes a heterogeneous graph matching algorithm to adapt to different levels of resource characteristics and inference performance of various workloads. ROCK is a system-level resource orchestration framework designed to optimize the deployment of thousands of LoRA adapters in a cloud environment through intelligent resource allocation and workload-aware scheduling.
@inproceedings{wu2025rock, title = {ROCK: Serving Multimodal Model in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters}, author = {Wu, Shuaipeng and Lin, Yanying and Peng, Shijie and Li, Yanbo and Chen, Wenyan and Xu, Chengzhong and Ye, Kejiang}, booktitle = {Proceedings of the 2025 IEEE International Conference on Cluster Computing}, year = {2025}, }
IEEE TSC
Serving LLM in Distributed GPU Cluster with Fine-Grain Pipeline Constraints

Yanying Lin, Shijie Peng, Shuaipeng Wu, and 4 more authors

IEEE Transactions on Services Computing, 2025

Insight Bib

Insight: LLMs are increasingly deployed in distributed GPU clusters using pipeline parallelism for efficient inference. However, existing approaches treat the entire pipeline as a monolithic unit, resulting in coarse-grained resource management and suboptimal SLO compliance. Each pipeline stage has distinct workload characteristics, but current systems cannot leverage this for better performance control. We propose fine-grained pipeline constraints that manage each stage independently, enabling precise SLO allocation at the stage level.
@article{lin2025serving, title = {Serving LLM in Distributed GPU Cluster with Fine-Grain Pipeline Constraints}, author = {Lin, Yanying and Peng, Shijie and Wu, Shuaipeng and Li, Yanbo and Lu, Chengzhi and Ye, Kejiang and Xu, Chengzhong}, journal = {IEEE Transactions on Services Computing}, year = {2025}, }

2024

ICDCS
Quart: Latency-Aware FaaS System for Pipelining Large Model Inference

Yanying Lin, Yanbo Li, Shijie Peng, and 5 more authors

In Proceedings of the 44th IEEE International Conference on Distributed Computing Systems, 2024

Insight Bib PDF

Insight: We find that when using pipeline inference for LLMs, each stage exhibits significant load differences due to varying resource allocation and throughput across different workloads. For example, under stable or bursty traffic patterns, throughput disparities between upstream and downstream stages can create performance-critical bottlenecks in the pipeline.
@inproceedings{lin2024quart, title = {Quart: Latency-Aware FaaS System for Pipelining Large Model Inference}, author = {Lin, Yanying and Li, Yanbo and Peng, Shijie and Tang, Yingfei and Luo, Shutian and Shen, Haiying and Xu, Chengzhong and Ye, Kejiang}, booktitle = {Proceedings of the 44th IEEE International Conference on Distributed Computing Systems}, year = {2024}, }
ICWS

Plank: Optimizing LLM Inference Performance in Pipeline Parallelism with Fine-Grained SLO Constraint

Yanying Lin, Shijie Peng, Shuaipeng Wu, and 4 more authors

In Proceedings of the 31st IEEE International Conference on Web Services, 2024

Insight

Insight: This work proposes preliminary ideas for SLO allocation in LLM pipeline parallelism. We explore fine-grained SLO constraint mechanisms to optimize inference performance by intelligently distributing service level objectives across different pipeline stages, enabling more efficient resource utilization and better meeting diverse application requirements.
IEEE TCC

Understanding Serverless Inference in Mobile-Edge Networks: a Benchmark Approach

Junhong Chen, Yanying* Lin, Shijie Peng, and 5 more authors

IEEE Transactions on Cloud Computing, 2024

2023

ICPADS

FLASH: Low-Latency Serverless Model Inference with Multi-Core Parallelism in Edge

Yanbo Li, Yanying* Lin, Shijie Peng, and 4 more authors

In Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

2022

HPCC

ESBench: Understanding Deep Learning Inference Overheads for Edge Serverless

Yanying* Lin, Junhong Chen, Yang Wang, and 1 more author

In Proceedings of the 24th IEEE International Conference on High Performance Computing and Communications, 2022
ISPA

System-level Implications of Serverless: Workload Characterizing and Performance Understanding

Deshi Deng, Yanying* Lin, and Kejiang Ye

In Proceedings of the 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 2022
IEEE TSC

Serverless Computing: State-of-the-Art, Challenges and Opportunities

Yongkang Li, Yanying Lin, Kejiang Ye, and 2 more authors

IEEE Transactions on Services Computing, 2022

Insight

Insight: This work presents a comprehensive survey of serverless computing, providing insights into the current state-of-the-art and identifying key research challenges and future opportunities. We systematically analyze the serverless paradigm from multiple perspectives including architecture, performance, security, and application domains. The survey contributes to the community by outlining promising research directions.

2021

CLOUD

BBServerless: Burst Traffic Benchmark for Serverless

Yanying Lin, Kejiang Ye, and Chengzhong Xu

In Proceedings of the 14th IEEE International Conference on Cloud Computing, 2021

Insight

Insight: This work presents an early-stage performance evaluation benchmark for Serverless computing, conducting comprehensive performance assessments of various emerging workloads within open-source FaaS frameworks. The primary objective is to identify which Serverless architectures are most suitable for deployment in private cloud environments, representing a laboratory-level exploratory study that evaluates different FaaS platforms and their performance characteristics under diverse workload conditions.
HPCC

PEAN: A Packet-Level End-To-End Attentive Network for Encrypted Traffic Identification

Peng Lin, Yishen Hu, Yanying Lin, and 2 more authors

In Proceedings of the 23rd IEEE International Conference on High Performance Computing and Communications, 2021
IEEE/ACM ToN

A Novel End-to-end Deep Learning Framework for Encrypted Traffic Identification

Peng Lin, Kejiang Ye, Yishen Hu, and 2 more authors

IEEE/ACM Transactions on Networking, 2021

2020

ICPADS

LBNN: Perceives the Change of Core Telecommunications Network State via Linear Bayesian Neural Network

Yanying Lin, Kejiang Ye, and Chengzhong Xu

In Proceedings of the 26th IEEE International Conference on Parallel and Distributed Systems, 2020