Learning the joint distribution of two sequences using little or no paired data
Soroosh Mariooryad, Matt Shannon, Siyuan Ma, Tom Bagby, David Kao,, Daisy Stanton, Eric Battenberg, RJ Skerry-Ryan

TL;DR
This paper introduces a variational noisy channel model for learning joint distributions of two sequences, like text and speech, with minimal paired data, enabling effective cross-modal association in low-resource settings.
Contribution
It proposes a novel variational inference method with a KL encoder loss for training on unpaired categorical data, guiding sequence-to-sequence modeling with limited paired samples.
Findings
Tiny amounts of paired data (5 minutes) suffice for learning associations.
The model effectively leverages large unpaired datasets.
Guides architecture design under conditional independence assumptions.
Abstract
We present a noisy channel generative model of two sequences, for example text and speech, which enables uncovering the association between the two modalities when limited paired data is available. To address the intractability of the exact model under a realistic data setup, we propose a variational inference approximation. To train this variational model with categorical data, we propose a KL encoder loss approach which has connections to the wake-sleep algorithm. Identifying the joint or conditional distributions by only observing unpaired samples from the marginals is only possible under certain conditions in the data distribution and we discuss under what type of conditional independence assumptions that might be achieved, which guides the architecture designs. Experimental results show that even tiny amount of paired data (5 minutes) is sufficient to learn to relate the two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Algorithms and Data Compression
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Variational Inference · Sequence to Sequence
