LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding
Doohyuk Jang, Sihwan Park, June Yong Yang, Yeonsung Jung, Jihun Yun,, Souvik Kundu, Sung-Yub Kim, Eunho Yang

TL;DR
LANTERN introduces a relaxed decoding method that significantly accelerates visual autoregressive models by addressing token selection ambiguity, enabling more flexible token use without sacrificing image quality.
Contribution
The paper proposes LANTERN, a novel relaxed decoding approach that improves speculative decoding efficiency in visual AR models by leveraging token interchangeability in latent space.
Findings
LANTERN achieves 1.75x speed-up over naive speculative decoding.
LANTERN achieves 1.82x speed-up over greedy decoding.
The method maintains image quality and semantic coherence.
Abstract
Auto-Regressive (AR) models have recently gained prominence in image generation, often matching or even surpassing the performance of diffusion models. However, one major limitation of AR models is their sequential nature, which processes tokens one at a time, slowing down generation compared to models like GANs or diffusion-based methods that operate more efficiently. While speculative decoding has proven effective for accelerating LLMs by generating multiple tokens in a single forward, its application in visual AR models remains largely unexplored. In this work, we identify a challenge in this setting, which we term \textit{token selection ambiguity}, wherein visual AR models frequently assign uniformly low probabilities to tokens, hampering the performance of speculative decoding. To overcome this challenge, we propose a relaxed acceptance condition referred to as LANTERN that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Vision and Imaging
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Diffusion
