IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Jongwoo Park; Kanchana Ranasinghe; Jinhyeok Jang; Cristina Mata; Yoo Sung Jang; Michael S Ryoo

arXiv:2601.16207·cs.RO·January 23, 2026

IVRA: Improving Visual-Token Relations for Robot Action Policy with Training-Free Hint-Based Guidance

Jongwoo Park, Kanchana Ranasinghe, Jinhyeok Jang, Cristina Mata, Yoo Sung Jang, Michael S Ryoo

PDF

Open Access

TL;DR

IVRA is a training-free method that enhances spatial understanding in vision-language-action models by leveraging built-in affinity hints, leading to improved robot manipulation success across various benchmarks without retraining.

Contribution

It introduces IVRA, a novel inference-time technique that improves visual-token relations in VLA models using existing model signals, without additional training or external encoders.

Findings

01

IVRA improves success rates by +4.2% on 2D benchmarks.

02

It yields consistent gains on 3D manipulation tasks.

03

The method enhances geometric structure preservation during inference.

Abstract

Many Vision-Language-Action (VLA) models flatten image patches into a 1D token sequence, weakening the 2D spatial cues needed for precise manipulation. We introduce IVRA, a lightweight, training-free method that improves spatial understanding by exploiting affinity hints already available in the model's built-in vision encoder, without requiring any external encoder or retraining. IVRA selectively injects these affinity signals into a language-model layer in which instance-level features reside. This inference-time intervention realigns visual-token interactions and better preserves geometric structure while keeping all model parameters fixed. We demonstrate the generality of IVRA by applying it to diverse VLA architectures (LLaRA, OpenVLA, and FLOWER) across simulated benchmarks spanning both 2D and 3D manipulation (VIMA and LIBERO) and on various real-robot tasks. On 2D VIMA, IVRA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics