Weakly Supervised Multiple Instance Learning for Whale Call Detection and Temporal Localization in Long-Duration Passive Acoustic Monitoring
Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Kazuhiro Nakadai

TL;DR
This paper presents DSMIL-LocNet, a weakly supervised learning framework that effectively detects and localizes whale calls in long-duration acoustic data using only segment-level labels, improving scalability in marine monitoring.
Contribution
Introduction of DSMIL-LocNet, a novel dual-stream MIL framework that processes long audio segments for whale call detection and localization with minimal supervision.
Findings
Longer audio contexts improve classification accuracy (F1: 0.8-0.9).
Medium-sized instances enhance localization precision (0.65-0.70).
MIL approach enables scalable marine acoustic monitoring.
Abstract
Marine ecosystem monitoring via Passive Acoustic Monitoring (PAM) generates vast data, but deep learning often requires precise annotations and short segments. We introduce DSMIL-LocNet, a Multiple Instance Learning framework for whale call detection and localization using only bag-level labels. Our dual-stream model processes 2-30 minute audio segments, leveraging spectral and temporal features with attention-based instance selection. Tests on Antarctic whale data show longer contexts improve classification (F1: 0.8-0.9) while medium instances ensure localization precision (0.65-0.70). This suggests MIL can enhance scalable marine monitoring. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMarine animal studies overview · Underwater Acoustics Research · Animal Vocal Communication and Behavior
