Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Vicky Zayats, Peter Chen, Melissa Ferrari, Dirk Padfield

TL;DR
Zipper introduces a multi-tower decoder architecture that effectively fuses multimodal generative models, especially speech and text, by leveraging cross-attention and pre-trained unimodal decoders, achieving strong performance with limited aligned data.
Contribution
The paper presents Zipper, a novel multi-tower decoder architecture that enables flexible fusion of independently pre-trained unimodal models for multimodal generation tasks.
Findings
Competitive performance with limited aligned data in speech-text fusion
Maintains unimodal capabilities by freezing modal towers
Pre-trained speech backbone improves text-to-speech generation
Abstract
Integrating multiple generative foundation models, especially those trained on different modalities, into something greater than the sum of its parts poses significant challenges. Two key hurdles are the availability of aligned data (concepts that contain similar meaning but is expressed differently in different modalities), and effectively leveraging unimodal representations in cross-domain generative tasks, without compromising their original unimodal capabilities. We propose Zipper, a multi-tower decoder architecture that addresses these concerns by using cross-attention to flexibly compose multimodal generative models from independently pre-trained unimodal decoders. In our experiments fusing speech and text modalities, we show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data. We also showcase the flexibility of our model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCoding theory and cryptography · Error Correcting Code Techniques · graph theory and CDMA systems
