CAGE-SGG: Counterfactual Active Graph Evidence for Open-Vocabulary Scene Graph Generation
Suiyang Guang, Chenyu Liu, Ruohan Zhang, Siyuan Chen

TL;DR
This paper introduces CAGE-SGG, a framework for open-vocabulary scene graph generation that verifies relations based on visual evidence, improving reliability and interpretability over prior methods.
Contribution
It proposes a counterfactual relation verification approach with evidence decomposition, relation-conditioned encoding, and graph-level optimization for more accurate scene graphs.
Findings
Improves recall-based metrics across benchmarks.
Enhances unseen predicate generalization.
Provides more reliable, evidence-grounded scene graphs.
Abstract
Open-vocabulary scene graph generation (SGG) aims to describe visual scenes with flexible and fine-grained relation phrases beyond a fixed predicate vocabulary. While recent vision-language models greatly expand the semantic coverage of SGG, they also introduce a critical reliability issue: predicted relations may be driven by language priors or object co-occurrence rather than grounded visual evidence. In this paper, we propose an evidence-rounded open-vocabulary SGG framework based on counterfactual relation verification. Instead of directly accepting plausible relation proposals, our method verifies whether each candidate relation is supported by relation-pecific visual, geometric, and contextual evidence. Specifically, we first generate open-vocabulary relation candidates with a vision-language proposer, then decompose predicate phrases into soft evidence bases such as support,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
