Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Habibeh Naderi; Behrouz Haji Soleimani; Stan Matwin

arXiv:2604.16247·cs.LG·April 20, 2026

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

PDF

TL;DR

HILBERT is a multimodal framework that uses reciprocal contrastive training and regularization to learn structured, balanced audio-text document representations from long sequences, improving performance in low-resource, imbalanced tasks.

Contribution

It introduces a reciprocal dual contrastive objective and regularizers for stable long-sequence multimodal embedding, advancing cross-modal alignment under severe imbalance.

Findings

01

HILBERT achieves superior performance on imbalanced multi-class audio-text tasks.

02

The reciprocal contrastive objective effectively aligns audio and text modalities.

03

Regularizers stabilize long-sequence fusion and preserve structural consistency.

Abstract

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.