FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal

TL;DR
FiRST introduces an adaptive layer-skipping method for transformer-based LLMs, reducing inference latency while maintaining or improving quality, suitable for resource-constrained environments.
Contribution
It proposes a model-agnostic, input-adaptive layer selection algorithm that preserves KV caching and enhances latency-quality trade-offs in large language models.
Findings
Significantly reduces inference latency.
Outperforms existing layer skipping strategies in quality metrics.
Maintains or improves model performance with adaptive layer selection.
Abstract
Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across different domains such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit, and 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations,…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The writing is clear - FIRST is evaluated on two different tasks including machine translation and summarization.
- Ablation study is missing. - Evaluation is insufficient. (1) It will be better if more than one dataset is evaluated for each task (machine translation and summarization). (2) Besides the Llama-3-8B base model, more model architectures should be evaluated. (3) The baselines are the base model (Llama-3-8B) with and without LoRA fine-tuning, which is not sufficient. In addition, besides unified skipping, more related methods should be compared with the proposed method. - Figures should be plotte
1. The paper adopts the idea of the router skipping specific layers for efficient LLM inference while maintaining performance. 2. The paper adopts the LoRA module to maintain the model's performance on different tasks. 3. The paper is well-written and easy to follow.
1. Experiments were only conducted on fine-tuned datasets. There was no zero-shot performance shown for the proposed method, for example, PPL and accuracy on the `lm-evaluation-harness` benchmark. 2. Lack of baseline, the FiRST only compared with Unified Layer Skipping. Beyond layer skipping, there are many methods targeting LLMs inference efficiency, e.g., [1][2][3] 3. The overhead of the proposed is not discussed. [1] SLEB: Streamlining LLMs through Redundancy Verification and Elimination of
1. The idea of using routers to dynamically skip layers based on input characteristics can help address the latency issues prevalent in deploying LLMs on resource-constrained devices. 2. The use of LoRA adapters for fine-tuning while layer skipping helps mitigate the potential degradation in model performance.
1. The method is only experimented with on the Llama-3-8B model and classical Machine Translation and Summarization tasks, omitting newer, more challenging benchmarks such as commonsense reasoning, MMLU, and BIG-bench hard, which are critical for evaluating the generality of LLMs. 2. Given that current popular LLM benchmarks often contain many subtasks, the proposed method's adaptability and generalization across such varied tasks might be limited. 3. The necessity for a two-step training proc
1. This paper proposes the router for the layer selection, which allows the fine-tuning to optimize the selection strategy. 2. The LoRA is adopted for further optimization, which shows helpful for the selection strategy optimization.
1. The paper novelty is limited. The layerly redundancy has been explored by a lot of works [1] [2] [3], while none of those works were introduced for comparison in the results tables. 2. The proposed routers require additional parameters and corresponding training process, which is not kind of efficient and effective compared to those works [1] [2] [3] that do not require additional parameters and training. Meanwhile, the router selection has been adopted for DiT models in work [4]. 3. The meth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNetwork Packet Processing and Optimization · Algorithms and Data Compression · Speech Recognition and Synthesis
MethodsBalanced Selection
