Songwei Liu
MLSys Optimization Expert · ByteDance

Songwei Liu

I build efficient LLM/AIGC inference systems for edge-cloud computing platforms.

Songwei Liu is an MLSys optimization expert in the Data-AML Heterogeneous Hardware team at ByteDance. He obtained his bachelor's degree from Huazhong University of Science & Technology, and his master's degree from Zhejiang University.

His research focuses on efficient model architecture design and foundation model training, algorithm/model optimization and software-hardware co-optimization, and inference optimization for multi-end heterogeneous platforms.

Efficient AIGC

Quantization/sparsity-driven software-hardware co-optimization, cache/MoE-token/resolution compression, and efficient foundation model training.

Efficient LLM

Quantized and sparse inference/training, speculative decoding, long-context acceleration, and deployment-oriented compression.

Heterogeneous Inference

Long-context inference systems, agentic workload serving, KVCache systems, and multi-end edge-cloud deployment.

At ByteDance, Songwei Liu leads a model optimization team that provides post-training optimization, algorithm/model optimization, and software-hardware co-optimization for Seedance, Seedream, and Volcengine open-source LLM/VLM models, substantially reducing cloud inference costs for these model families.

His academic work spans ICML, ICLR, ACL, AAAI, IJCNLP-AACL, ACM-MM, CVPRW, and Nature Communications, with a focus on practical efficiency methods that transfer from papers to production systems.

He is interested in academic cooperation around efficient AIGC/LLM systems, foundation model optimization, and software-hardware co-design. His team regularly recruits interns; interested candidates can apply through the ByteDance referral link or contact him by email.

03 / News

News

May 2026

MotionCache is accepted by ICML 2026.

Apr 2026

TCEC is accepted by ICML 2026 Spotlight + Oral, Top 0.7%.

Apr 2026

S2O is accepted by ACL 2026 Oral.

Mar 2026

DreamLite is accepted by ECCV 2026; it is a SOTA on-device unified image generation and editing model.

Dec 2025

GQSA is published at IJCNLP-AACL 2025 and receives Best Paper Honorable Mention.

Aug 2025

ERTACache is accepted by ICLR 2026.

Dec 2024

ABQ-LLM is accepted by AAAI 2025.

Jul 2023

UOE is accepted by ACM-MM 2023.

May 2022

RLFN is accepted by CVPRW 2022.

Mar 2022

RLFN won the Championship at NTIRE 2022 Efficient Super-Resolution Challenge.

May 2021

Songwei Liu joined ByteDance as an AI Infra Engineer in Shanghai, China.

04 / Publications

Selected Publications

Google Scholar snapshot on 2026-06-06: 17 publications, 562 citations, h-index 9.

2026
ECCV 2026
DreamLite overview

DreamLite: A Lightweight On-Device Unified Model for Image Generation and Editing

Kailai Feng, Yuxiang Wei, Bo Chen, Yang Pan, Hu Ye, Songwei Liu, Chenqian Yan, Yuan Gao

Code

Status: ECCV 2026 · Topic: on-device image generation and editing

Presents a SOTA on-device unified model for image generation and editing, targeting practical mobile deployment with strong quality-efficiency trade-offs.

ICML 2026 Spotlight+Oral
TCEC overview

TCEC: Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion

Songwei Liu, Chao Zeng, Chenqian Yan, Xurui Peng, Xing Wang, Fangmin Chen, Xing Mei

Status: ICML 2026 Spotlight + Oral, Top 0.7% · Topic: diffusion quantization

Studies how quantization errors propagate across diffusion timesteps and proposes timestep-aware compensation strategies for efficient low-bit generation.

ICML 2026
MotionCache overview

Motion-Aware Caching for Efficient Autoregressive Video Generation

Jing Xu, Yuexiao Ma, Xuzhe Zheng, Xing Wang, Shiwei Liu, Chenqian Yan, Xiawu Zheng, Rongrong Ji, Fei Chao, Songwei Liu

Status: ICML 2026 · Topic: video generation cache reuse

Uses motion-aware token update scheduling to reduce redundant computation in autoregressive video generation while preserving temporal quality.

ACL 2026 Oral
S2O overview

S2O: Early Stopping for Sparse Attention via Online Permutation

Yu Zhang*, Songwei Liu*, Chenqian Yan, Sheng Lin, Beichen Ning, Fangmin Chen, Xing Wang

Status: ACL 2026 Oral · Role: co-first author; Project Lead (LD) · * equal contribution

Introduces online permutation and early-stopping mechanisms for sparse attention, reducing attention computation while keeping model quality stable.

ICLR 2026
ERTACache overview

ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu†, Mingbao Lin

Status: ICLR 2026 · Role: co-corresponding author · † co-corresponding author

Combines timestep adjustment with online error rectification to make diffusion cache reuse more robust under aggressive acceleration settings.

2025
arXiv 2025
Seedance overview

Seedance 1.5 Pro: A Native Audio-Visual Joint Generation Foundation Model

Team Seedance, Heyi Chen, Siyan Chen, et al., Songwei Liu, et al.

Status: arXiv 2025 · Topic: audio-visual generation foundation model

Reports a native audio-visual joint generation foundation model and the production-oriented optimization stack behind efficient deployment.

05 / Projects

Projects

Production-facing optimization work across AIGC system-algorithm co-design, model optimization, and edge-cloud inference systems.

AIGC System-Algorithm Co-design

Seedance / Seedream Inference and Training Optimization

Led algorithm optimization and software-hardware co-optimization for Seedance 1.0-2.0 and Seedream 4.0-5.0 on heterogeneous NPU/GPU hardware, covering non-NVIDIA backends.

Inference

Designed quantization/sparsity algorithms and operator stacks compatible with dynamic LoRA and distributed FSDP/TP/EP architectures, supporting Seedance and Seedream production migrations from full BF16 to INT8/FP8, and then further to full INT4/MXFP4 online deployment.

Training

Designed hierarchical quantized training strategies and rebuilt the FSDP communication path around quantized weights to reduce distributed training communication overhead. This was the first production deployment of quantized training for ByteDance generative models.

AIGC Algorithm Model Optimization

Cache/MoE/Token Compression and Distillation

Developed cache reuse methods for diffusion and autoregressive generation, including timestep correction, offline policy search, online error rectification, and motion-aware token update scheduling.

Built lightweight model optimization pipelines for DynamicRes, 2D/3D VAE compression, and distillation-oriented generative model deployment across image/video generation scenarios.

The model-compression capability matrix further accelerates low-NFE step-distilled models by 35% to 50% at inference time.

Edge-Cloud Collaborative Inference

Lightweight Foundation Models, Extreme Compression, and Efficient Engines

Lightweight foundation models: developed SOTA lightweight LLM/VLM foundation models that are scheduled for open source release, the lightweight unified generation-editing model DreamLite, and the edge-cloud inference framework HybridSD.

Extreme model compression: built ultra-low-bit quantization solutions for edge-side NPU/GPU platforms, achieving lossless inference at an equivalent 2-bit precision while supporting products used by billions of users.

Inference engine: participated in designing the ByteNN-LLM on-device LLM/AIGC inference engine architecture, where a 1+N on-device serving architecture enables a single foundation model to support multiple business needs, and delivered the industry's first PC-CUDA arbitrary-precision quantized inference solution.

06 / Skills

Skills

Languages

Python, C++, CUDA C/PTX

MLSys / AIGC / LLM

System-algorithm co-design, model compression, PTQ/QAT, sparse and quantized kernels, cache reuse, distributed inference/training optimization.

Frameworks

vLLM, CUTLASS, Triton, distributed serving/training stacks, heterogeneous NPU/GPU deployment toolchains.

Kernels

GEMM, Attention, Dense/Sparse operator tuning with MMA/WMMA and PTX assembly, quantized GEMM for compute- and memory-bound workloads.

07 / Background

Background

Education
2018.09 - 2021.03

Master, Zhejiang University, Hangzhou, Zhejiang.

2014.09 - 2018.06

Bachelor, Huazhong University of Science and Technology, Wuhan, Hubei.

2011.09 - 2014.06

Shangqiu No.1 Senior Middle School, Shangqiu, Henan.

Internships
2020.06 - 2020.09

HikVision Research Center, Hangzhou.

2019.08 - 2020.06

Tencent JARVIS Research Center, network architecture search, Shenzhen.

2019.04 - 2019.07

FaBu, autonomous driving and model compression, Hangzhou.

08 / Invited Talks

Invited Talks

2024.11

Quantization and Sparsity Optimization for AIGC Models

Public presentation at ML-Summit 2024.

09 / Contact

Get in Touch

If you are seeking academic cooperation, invited talks, or technical discussion around efficient AIGC/LLM systems, the best way to reach Songwei Liu is via email.

Current Collaboration Interests

Open to research collaborations on efficient foundation model training, AIGC/LLM inference optimization, cache reuse, sparse/quantized computation, and software-hardware co-design.

Efficient AIGC foundation models
Long-context and agentic serving systems
Edge-cloud collaborative inference
Start a Conversation