RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks

Ningyuan Liu; Jing Yang; Kaitong Cai; Keze Wang

arXiv:2512.20920·cs.LG·December 25, 2025

RevFFN: Memory-Efficient Full-Parameter Fine-Tuning of Mixture-of-Experts LLMs with Reversible Blocks

Ningyuan Liu, Jing Yang, Kaitong Cai, Keze Wang

PDF

Open Access

TL;DR

RevFFN introduces reversible Transformer blocks for mixture-of-experts LLMs, enabling memory-efficient full parameter fine-tuning on standard GPUs by reconstructing activations during backpropagation.

Contribution

It presents a novel reversible Transformer design that reduces memory usage in fine-tuning large language models with mixture-of-experts architecture.

Findings

01

Significantly reduces peak memory consumption during fine-tuning.

02

Enables full parameter fine-tuning on single GPU hardware.

03

Maintains model expressive capacity while improving efficiency.

Abstract

Full parameter fine tuning is a key technique for adapting large language models (LLMs) to downstream tasks, but it incurs substantial memory overhead due to the need to cache extensive intermediate activations for backpropagation. This bottleneck makes full fine tuning of contemporary large scale LLMs challenging in practice. Existing distributed training frameworks such as DeepSpeed alleviate this issue using techniques like ZeRO and FSDP, which rely on multi GPU memory or CPU offloading, but often require additional hardware resources and reduce training speed. We introduce RevFFN, a memory efficient fine tuning paradigm for mixture of experts (MoE) LLMs. RevFFN employs carefully designed reversible Transformer blocks that allow reconstruction of layer input activations from outputs during backpropagation, eliminating the need to store most intermediate activations in memory. While…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling