Enhancing Vision Language Models with Logic Reasoning for Situational Awareness
Pavana Pradeep, Krishna Kant, Suya Yu

TL;DR
This paper introduces a method that combines vision-language models with logic reasoning to improve situational awareness by extracting detailed event information, enhancing accuracy through intelligent fine-tuning, and providing justifications for outputs.
Contribution
It presents an integrated approach that enhances VLMs with explicit logic reasoning, improving event detail extraction, accuracy, and interpretability in situational awareness tasks.
Findings
Enhanced accuracy with intelligent fine-tuning
Improved extraction of fine-grained event details
Generated justifications increase interpretability
Abstract
Vision-Language Models (VLMs) offer the ability to generate high-level, interpretable descriptions of complex activities from images and videos, making them valuable for situational awareness (SA) applications. In such settings, the focus is on identifying infrequent but significant events with high reliability and accuracy, while also extracting fine-grained details and assessing recognition quality. In this paper, we propose an approach that integrates VLMs with traditional computer vision methods through explicit logic reasoning to enhance SA in three key ways: (a) extracting fine-grained event details, (b) employing an intelligent fine-tuning (FT) strategy that achieves substantially higher accuracy than uninformed selection, and (c) generating justifications for VLM outputs during inference. We demonstrate that our intelligent FT mechanism improves the accuracy and provides a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
