Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage; Rico Sennnrich

arXiv:2512.24826·cs.CV·January 1, 2026

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Jason Armitage, Rico Sennnrich

PDF

Open Access

TL;DR

This paper presents a novel derivative-free control method that enhances 2D-to-3D scene understanding by enabling cross-modal systems to adapt online to occlusions and feature differentiation in multi-object 3D environments.

Contribution

It introduces a new approach that improves multivariate mutual information estimation and control of in-scene cameras without pretraining or finetuning.

Findings

01

Enhanced adaptation to occlusions in 3D scenes

02

Improved cross-modal task performance

03

No need for pretraining or finetuning

Abstract

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning