Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Parsa Ashrafi Fashi; Utkarsh Saxena; Mehdi Rezagholizadeh; Aref Jafari; Akash Haridas; Mingyu Yang; Vansh Bhatia; Guihong Li; Vikram Appia; Emad Barsoum

arXiv:2604.24715·cs.CL·April 28, 2026

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas, Mingyu Yang, Vansh Bhatia, Guihong Li, Vikram Appia, Emad Barsoum

PDF

TL;DR

This paper introduces HyLo, a method for converting pretrained Transformer LLMs into hybrid models that significantly extend context length and efficiency, enabling advanced long-context applications.

Contribution

HyLo provides a practical, staged approach for upcycling pretrained Transformers into hybrid architectures with enhanced long-context capabilities and reduced memory usage.

Findings

01

Extends context length up to 32 times with post-training adaptation.

02

Reduces KV-cache memory by over 90%, enabling 2M-token decoding.

03

Outperforms state-of-the-art upcycled hybrid models on long-context benchmarks.

Abstract

Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32 \times$ through efficient post-training and reduces KV-cache…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.