Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Minseok Kang; Minhyeok Lee; Minjung Kim; Jungho Lee; Donghyeong Kim; Sungmin Woo; Inseok Jeon; Sangyoun Lee

arXiv:2603.21559·cs.CV·March 24, 2026

Revisiting Weakly-Supervised Video Scene Graph Generation via Pair Affinity Learning

Minseok Kang, Minhyeok Lee, Minjung Kim, Jungho Lee, Donghyeong Kim, Sungmin Woo, Inseok Jeon, Sangyoun Lee

PDF

Open Access

TL;DR

This paper introduces Pair Affinity Learning and Scoring (PALS) with Relation-Aware Matching (RAM) to improve weakly-supervised video scene graph generation by filtering noninteractive object pairs, leading to state-of-the-art results.

Contribution

It proposes a novel pair affinity estimation method and a relation-aware pseudo-labeling technique to enhance weakly-supervised scene graph generation in videos.

Findings

01

Achieves state-of-the-art performance on Action Genome dataset.

02

Significantly improves relation detection accuracy.

03

Effectively suppresses noninteractive object pairs.

Abstract

Weakly-supervised video scene graph generation (WS-VSGG) aims to parse video content into structured relational triplets without bounding box annotations and with only sparse temporal labeling, significantly reducing annotation costs. Without ground-truth bounding boxes, these methods rely on off-the-shelf detectors to generate object proposals, yet largely overlook a fundamental discrepancy from fullysupervised pipelines. Fully-supervised detectors implicitly filter out noninteractive objects, while off-the-shelf detectors indiscriminately detect all visible objects, overwhelming relation models with noisy pairs.We address this by introducing a learnable pair affinity that estimates the likelihood of interaction between subject-object pairs. Through Pair Affinity Learning and Scoring (PALS), pair affinity is incorporated into inferencetime ranking and further integrated into contextual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning