Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Xuezhe Ma; Shicheng Wen; Linghao Jin; Bilge Acun; Ruihang Lai; Bohan Hou; Will Lin; Hao Zhang; Songlin Yang; Ryan Lee; Mengxi Wu; Jonathan May; Luke Zettlemoyer; Carole-Jean Wu

arXiv:2601.06463·cs.LG·January 13, 2026

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu

PDF

Open Access

TL;DR

Gecko is a novel neural architecture designed to efficiently process extremely long sequences, outperforming existing models in scalability and long-range dependency capture without additional context extension techniques.

Contribution

The paper introduces Gecko, a new neural architecture that inherently handles arbitrarily long sequences with improved efficiency and long-context capabilities, building upon and extending prior models like Mega and Megalodon.

Findings

01

Achieves better training loss than Llama2-7B and Megalodon-7B.

02

Handles sequences up to 4 million tokens.

03

Retrieves information from contexts four times longer than its attention window.

Abstract

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Reservoir Computing · Advanced Memory and Neural Computing