Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

TL;DR
This paper introduces HyLo, a method for converting pretrained Transformer LLMs into hybrid models that significantly extend context length and efficiency, enabling advanced long-context applications.
Contribution
HyLo provides a practical, staged approach for upcycling pretrained Transformers into hybrid architectures with enhanced long-context capabilities and reduced memory usage.
Findings
Extends context length up to 32 times with post-training adaptation.
Reduces KV-cache memory by over 90%, enabling 2M-token decoding.
Outperforms state-of-the-art upcycled hybrid models on long-context benchmarks.
Abstract
Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to through efficient post-training and reduces KV-cache…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
