Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks
Yun Wang, Juncheng Li, Florian Metze

TL;DR
This paper compares max and noisy-or pooling functions within multiple instance learning for weakly supervised sequence tasks, revealing max pooling's superior ability to localize events in speech and sound detection.
Contribution
It provides a theoretical explanation for the differing behaviors of max and noisy-or pooling functions in sequence learning tasks.
Findings
Max pooling effectively localizes phonemes and sound events.
Noisy-or pooling fails to localize events.
Theoretical analysis explains the differences in pooling functions' performance.
Abstract
Many sequence learning tasks require the localization of certain events in sequences. Because it can be expensive to obtain strong labeling that specifies the starting and ending times of the events, modern systems are often trained with weak labeling without explicit timing information. Multiple instance learning (MIL) is a popular framework for learning from weak labeling. In a common scenario of MIL, it is necessary to choose a pooling function to aggregate the predictions for the individual steps of the sequences. In this paper, we compare the "max" and "noisy-or" pooling functions on a speech recognition task and a sound event detection task. We find that max pooling is able to localize phonemes and sound events, while noisy-or pooling fails. We provide a theoretical explanation of the different behavior of the two pooling functions on sequence learning tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
