RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception
Joshua R. Waite, Md. Zahid Hasan, Qisai Liu, Zhanhong Jiang, Chinmay, Hegde, Soumik Sarkar

TL;DR
This paper introduces a reinforcement learning-based framework that generates synthetic data to fine-tune vision-language models, significantly improving their spatial reasoning abilities in indoor autonomous perception tasks.
Contribution
It presents a novel RL-guided synthetic data generation method that enhances VLM fine-tuning by targeting specific vulnerabilities and improving spatial reasoning performance.
Findings
Improved spatial reasoning accuracy in VLMs.
Effective synthetic data generation via RL agent.
Enhanced fine-tuning process for task-specific performance.
Abstract
Vision-language model (VLM) fine-tuning for application-specific visual grounding based on natural language instructions has become one of the most popular approaches for learning-enabled autonomous systems. However, such fine-tuning relies heavily on high-quality datasets to achieve successful performance in various downstream tasks. Additionally, VLMs often encounter limitations due to insufficient and imbalanced fine-tuning data. To address these issues, we propose a new generalizable framework to improve VLM fine-tuning by integrating it with a reinforcement learning (RL) agent. Our method utilizes the RL agent to manipulate objects within an indoor setting to create synthetic data for fine-tuning to address certain vulnerabilities of the VLM. Specifically, we use the performance of the VLM to provide feedback to the RL agent to generate informative data that efficiently fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
