TRoVe: Discovering Error-Inducing Static Feature Biases in Temporal Vision-Language Models
Maya Varma, Jean-Benoit Delbrouck, Sophie Ostmeier, Akshay Chaudhari, Curtis Langlotz

TL;DR
TRoVe is an automated method that identifies static feature biases in temporal vision-language models, helping to understand and mitigate systematic errors in visual change prediction tasks.
Contribution
We introduce TRoVe, a novel automated approach for discovering error-inducing static feature biases in temporal VLMs, validated on a comprehensive evaluation framework.
Findings
TRoVe accurately identifies static feature biases, outperforming baselines by 28.6%.
Applying TRoVe reveals previously unknown biases in off-the-shelf models.
Knowledge of biases improves model performance at test time.
Abstract
Vision-language models (VLMs) have made great strides in addressing temporal understanding tasks, which involve characterizing visual changes across a sequence of images. However, recent works have suggested that when making predictions, VLMs may rely on static feature biases, such as background or object features, rather than dynamic visual changes. Static feature biases are a type of shortcut and can contribute to systematic prediction errors on downstream tasks; as a result, identifying and characterizing error-inducing static feature biases is critical prior to real-world model deployment. In this work, we introduce TRoVe, an automated approach for discovering error-inducing static feature biases learned by temporal VLMs. Given a trained VLM and an annotated validation dataset associated with a downstream classification task, TRoVe extracts candidate static features from the dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Machine Learning in Healthcare
