Streamlining the Collaborative Chain of Models into A Single Forward Pass in Generation-Based Tasks
Yuanjie Lyu, Chao Zhang, Yuhao Chen, Yong Chen, Tong Xu

TL;DR
This paper introduces FTHSS, a prompt-tuning method that enables sharing of KV hidden states across models, reducing redundant computations and resource usage in generation tasks.
Contribution
FTHSS allows models to share KV hidden states through prompt tuning, eliminating the need for recomputation and improving efficiency in model chains.
Findings
FTHSS matches traditional chain performance on four tasks.
Reduces inference time and KV cache storage.
Enables efficient multi-round generation scenarios.
Abstract
In Retrieval-Augmented Generation (RAG) and agent-based frameworks, the "Chain of Models" approach is widely used, where multiple specialized models work sequentially on distinct sub-tasks. This approach is effective but increases resource demands as each model must be deployed separately. Recent advancements attempt to address this by applying prompt tuning, which allows a shared base model to adapt to multiple tasks with minimal parameter changes. However, a key challenge remains: intermediate outputs, passed between models as plain text, require recomputation of hidden states (i.e., Key and Value (KV) states in Transformers) during inference. In this paper, we introduce FTHSS, a novel prompt-tuning method that enables models to share KV hidden states, eliminating redundant forward passes and reducing KV cache storage. By modifying input and attention masks during training, FTHSS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Business Process Modeling and Analysis · Simulation Techniques and Applications
MethodsSoftmax · Attention Is All You Need · Balanced Selection
