Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Shanmukha Vellamcheti; Sanjoy Kundu; Sathyanarayanan N. Aakur

arXiv:2506.05651·cs.CV·June 9, 2025

Hallucinate, Ground, Repeat: A Framework for Generalized Visual Relationship Detection

Shanmukha Vellamcheti, Sanjoy Kundu, Sathyanarayanan N. Aakur

PDF

Open Access

TL;DR

This paper presents an iterative framework that combines large language models and visual grounding to improve generalization in visual relationship detection, especially for unseen predicates, by bootstrapping relational understanding from external knowledge.

Contribution

It introduces a novel EM-inspired method leveraging LLMs as relational priors for open-world VRD and provides a new benchmark with unseen predicates for evaluation.

Findings

01

Outperforms baselines on open-world VRD benchmark

02

Achieves higher mean recall on predicate classification

03

Enables generalization to unseen predicates

Abstract

Understanding relationships between objects is central to visual intelligence, with applications in embodied AI, assistive systems, and scene understanding. Yet, most visual relationship detection (VRD) models rely on a fixed predicate set, limiting their generalization to novel interactions. A key challenge is the inability to visually ground semantically plausible, but unannotated, relationships hypothesized from external knowledge. This work introduces an iterative visual grounding framework that leverages large language models (LLMs) as structured relational priors. Inspired by expectation-maximization (EM), our method alternates between generating candidate scene graphs from detected objects using an LLM (expectation) and training a visual model to align these hypotheses with perceptual evidence (maximization). This process bootstraps relational understanding beyond annotated data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Graph Neural Networks