Can Custom Models Learn In-Context? An Exploration of Hybrid Architecture Performance on In-Context Learning Tasks
Ryan Campbell, Nelson Lojo, Kesava Viswanadha, Christoffer Grondal, Tryggestad, Derrick Han Sun, Sriteja Vijapurapu, August Rolfsen, Anant Sahai

TL;DR
This paper investigates how different hybrid architectures of language models affect in-context learning performance, revealing architectural impacts on efficiency and proposing a new performance metric.
Contribution
It extends previous work to hybrid GPT-2/LLaMa and LLaMa/Mamba models, analyzing architectural effects on in-context learning and introducing the ICL regression score metric.
Findings
Certain architectural changes degrade ICL accuracy and training efficiency.
Some hybrid models show improved ICL performance, indicating potential for architecture optimization.
The ICL regression score provides a comprehensive performance measure.
Abstract
In-Context Learning (ICL) is a phenomenon where task learning occurs through a prompt sequence without the necessity of parameter updates. ICL in Multi-Headed Attention (MHA) with absolute positional embedding has been the focus of more study than other sequence model varieties. We examine implications of architectural differences between GPT-2 and LLaMa as well as LlaMa and Mamba. We extend work done by Garg et al. (2022) and Park et al. (2024) to GPT-2/LLaMa hybrid and LLaMa/Mamba hybrid models - examining the interplay between sequence transformation blocks and regressive performance in-context. We note that certain architectural changes cause degraded training efficiency/ICL accuracy by converging to suboptimal predictors or converging slower. We also find certain hybrids showing optimistic performance improvements, informing potential future ICL-focused architecture modifications.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Cosine Annealing · Dense Connections · Layer Normalization · Residual Connection · Focus · Linear Warmup With Cosine Annealing · Adam · Attention Is All You Need
