Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments
Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

TL;DR
This paper introduces a vision-language integration framework that combines visual encoders and large language models to improve zero-shot scene understanding in complex real-world environments, enabling better recognition and interpretation without prior labeled data.
Contribution
It presents a unified model that aligns visual and textual data in a shared space, enhancing zero-shot scene understanding through multimodal fusion and reasoning.
Findings
Up to 18% improvement in top-1 accuracy on benchmark datasets
Significant gains in semantic coherence metrics
Enhanced generalization for object recognition and scene captioning
Abstract
Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
