Scene-Aware Vectorized Memory Multi-Agent Framework with Cross-Modal Differentiated Quantization VLMs for Visually Impaired Assistance
Xiangxiang Wang, Xuanyu Wang, YiJia Luo, Yongbin Yu, Manping Fan, Jingtao Zhang, Liyong Ren

TL;DR
This paper introduces a novel, memory-efficient vision-language model and multi-agent system that significantly improve environmental perception for visually impaired individuals while reducing computational requirements.
Contribution
It presents a cross-modal differentiated quantization framework and a scene-aware vectorized memory multi-agent system, enhancing efficiency and integrated scene understanding in assistive technology.
Findings
Memory reduced from 38GB to 11.3GB with minimal performance loss.
Achieves 2.83-3.52s latency for initial speech output.
Maintains high accuracy on OCR-VQA and MMBench benchmarks.
Abstract
Visually impaired individuals face significant challenges in environmental perception. Traditional assistive technologies often lack adaptive intelligence, focusing on individual components rather than integrated systems. While Vision-Language Models (VLMs) offer a promising path to richer, integrated understanding, their deployment is severely limited by substantial computational requirements, demanding dozens of gigabytes of memory. To address these gaps in computational efficiency and integrated design, this study proposes a dual technological innovation framework: a cross-modal differentiated quantization framework for VLMs and a scene-aware vectorized memory multi-agent system. The quantization framework implements differentiated strategies, reducing memory from 38GB to 11.3GB. The multi-agent system uses vectorized memory and perception-memory-reasoning workflows to provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
