SemCo: Toward Semantic Coherent Visual Relationship Forecasting

Yangjun Ou; Yao Liu; Li Mi; Zhenzhong Chen

arXiv:2107.01181·cs.CV·November 19, 2025

SemCo: Toward Semantic Coherent Visual Relationship Forecasting

Yangjun Ou, Yao Liu, Li Mi, Zhenzhong Chen

PDF

Open Access

TL;DR

This paper introduces SemCoBench, a benchmark emphasizing semantic coherence in visual relationship forecasting, and proposes SemCoFormer, a transformer-based model with modules to improve understanding of object interactions in videos.

Contribution

It presents a new benchmark for semantic coherence in VRF and a novel transformer-based model with modules to better distinguish relationships and focus on their dynamics.

Findings

01

Model achieves improved accuracy on SemCoBench.

02

Semantic coherence modeling enhances relationship prediction.

03

Modules effectively distinguish similar relationships.

Abstract

Visual Relationship Forecasting (VRF) aims to anticipate relations among objects without observing future visual content. The task relies on capturing and modeling the semantic coherence in object interactions, as it underpins the evolution of events and scenes in videos. However, existing VRF datasets offer limited support for learning such coherence due to noisy annotations in the datasets and weak correlations between different actions and relationship transitions in subject-object pair. Furthermore, existing methods struggle to distinguish similar relationships and overfit to unchanging relationships in consecutive frames. To address these challenges, we present SemCoBench, a benchmark that emphasizes semantic coherence for visual relationship forecasting. Based on action labels and short-term subject-object pairs, SemCoBench decomposes relationship categories and dynamics by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Byte Pair Encoding · Dropout · Label Smoothing