RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning   in Vision-Language Models for Indoor Autonomous Perception

Joshua R. Waite; Md. Zahid Hasan; Qisai Liu; Zhanhong Jiang; Chinmay; Hegde; Soumik Sarkar

arXiv:2501.18880·cs.CV·February 3, 2025

RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception

Joshua R. Waite, Md. Zahid Hasan, Qisai Liu, Zhanhong Jiang, Chinmay, Hegde, Soumik Sarkar

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning-based framework that generates synthetic data to fine-tune vision-language models, significantly improving their spatial reasoning abilities in indoor autonomous perception tasks.

Contribution

It presents a novel RL-guided synthetic data generation method that enhances VLM fine-tuning by targeting specific vulnerabilities and improving spatial reasoning performance.

Findings

01

Improved spatial reasoning accuracy in VLMs.

02

Effective synthetic data generation via RL agent.

03

Enhanced fine-tuning process for task-specific performance.

Abstract

Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications