Affordance-First Decomposition for Continual Learning in Video-Language Understanding

Mengzhu Xu; Hanzhi Liu; Ningkang Peng; Qianyu Chen; Canran Xiao

arXiv:2512.00694·cs.CV·December 2, 2025

Affordance-First Decomposition for Continual Learning in Video-Language Understanding

Mengzhu Xu, Hanzhi Liu, Ningkang Peng, Qianyu Chen, Canran Xiao

PDF

Open Access

TL;DR

This paper introduces Affordance-First Decomposition (AFD), a novel continual learning approach for video-language understanding that explicitly separates stable affordance representations from adaptable components, achieving state-of-the-art results.

Contribution

AFD is the first method to explicitly decompose video representations into stable affordance tokens and a lightweight, query-driven adaptation mechanism under realistic constraints.

Findings

01

Achieves 51.6% accuracy with -1.8% forgetting on domain-incremental VideoQA.

02

Attains 29.6% R@[email protected] on MQ and 20.7% on NLQ in ViLCo.

03

Reaches 39.5% accuracy with -1.6% forgetting on time-incremental iVQA.

Abstract

Continual learning for video--language understanding is increasingly important as models face non-stationary data, domains, and query styles, yet prevailing solutions blur what should stay stable versus what should adapt, rely on static routing/capacity, or require replaying past videos. We aim to explicitly specify where stability lives and where plasticity should be focused under realistic memory and privacy constraints. We introduce Affordance-First Decomposition (AFD): videos are mapped to slowly varying affordance tokens that form a shared, time-aligned substrate, while a lightweight, query-routed, conflict-aware scheduler concentrates adaptation and grows capacity only when needed. The substrate is stabilized via weak alignment and teacher consistency, and training uses question-only replay. AFD achieves state-of-the-art across protocols: 51.6% average accuracy with -1.8%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis