Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin

TL;DR
HILBERT is a multimodal framework that uses reciprocal contrastive training and regularization to learn structured, balanced audio-text document representations from long sequences, improving performance in low-resource, imbalanced tasks.
Contribution
It introduces a reciprocal dual contrastive objective and regularizers for stable long-sequence multimodal embedding, advancing cross-modal alignment under severe imbalance.
Findings
HILBERT achieves superior performance on imbalanced multi-class audio-text tasks.
The reciprocal contrastive objective effectively aligns audio and text modalities.
Regularizers stabilize long-sequence fusion and preserve structural consistency.
Abstract
We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
