ViLLa: A Neuro-Symbolic approach for Animal Monitoring
Harsha Koduri

TL;DR
ViLLa is a neuro-symbolic framework that combines visual detection, natural language understanding, and logical reasoning to interpret animal images and answer human queries transparently.
Contribution
It introduces a modular neuro-symbolic approach for animal monitoring that enhances interpretability and reasoning over visual data and language queries.
Findings
Effective in counting animals in images.
Accurately locates animals based on queries.
Provides transparent reasoning process.
Abstract
Monitoring animal populations in natural environments requires systems that can interpret both visual data and human language queries. This work introduces ViLLa (Vision-Language-Logic Approach), a neuro-symbolic framework designed for interpretable animal monitoring. ViLLa integrates three core components: a visual detection module for identifying animals and their spatial locations in images, a language parser for understanding natural language queries, and a symbolic reasoning layer that applies logic-based inference to answer those queries. Given an image and a question such as "How many dogs are in the scene?" or "Where is the buffalo?", the system grounds visual detections into symbolic facts and uses predefined rules to compute accurate answers related to count, presence, and location. Unlike end-to-end black-box models, ViLLa separates perception, understanding, and reasoning,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural dynamics and brain function
