Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Manjunath Prasad Holenarasipura Rajiv; B. M. Vidyavathi

arXiv:2510.25070·cs.CV·October 30, 2025

Vision-Language Integration for Zero-Shot Scene Understanding in Real-World Environments

Manjunath Prasad Holenarasipura Rajiv, B. M. Vidyavathi

PDF

TL;DR

This paper introduces a vision-language integration framework that combines visual encoders and large language models to improve zero-shot scene understanding in complex real-world environments, enabling better recognition and interpretation without prior labeled data.

Contribution

It presents a unified model that aligns visual and textual data in a shared space, enhancing zero-shot scene understanding through multimodal fusion and reasoning.

Findings

01

Up to 18% improvement in top-1 accuracy on benchmark datasets

02

Significant gains in semantic coherence metrics

03

Enhanced generalization for object recognition and scene captioning

Abstract

Zero-shot scene understanding in real-world settings presents major challenges due to the complexity and variability of natural scenes, where models must recognize new objects, actions, and contexts without prior labeled examples. This work proposes a vision-language integration framework that unifies pre-trained visual encoders (e.g., CLIP, ViT) and large language models (e.g., GPT-based architectures) to achieve semantic alignment between visual and textual modalities. The goal is to enable robust zero-shot comprehension of scenes by leveraging natural language as a bridge to generalize over unseen categories and contexts. Our approach develops a unified model that embeds visual inputs and textual prompts into a shared space, followed by multimodal fusion and reasoning layers for contextual interpretation. Experiments on Visual Genome, COCO, ADE20K, and custom real-world datasets…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.