Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

A F M Saif; Lisha Chen; Xiaodong Cui; Songtao Lu; Brian Kingsbury; and Tianyi Chen

arXiv:2508.09228·eess.AS·August 14, 2025

Objective Soups: Multilingual Multi-Task Modeling for Speech Processing

A F M Saif, Lisha Chen, Xiaodong Cui, Songtao Lu, Brian Kingsbury, and Tianyi Chen

PDF

TL;DR

This paper explores hierarchical multi-objective optimization strategies for multilingual multi-task speech processing, demonstrating that separating recognition and translation tasks improves model performance and scalability.

Contribution

It introduces three multi-objective formulations with a lightweight layer-selection mechanism, showing hierarchical optimization outperforms flat methods in speech processing tasks.

Findings

01

Hierarchical MOO outperforms flat optimization in speech tasks.

02

Layer-selection reduces computational overhead.

03

Bi-level separation improves model accuracy.

Abstract

Training a single model for multilingual, multi-task speech processing (MSP) is severely hampered by conflicting objectives between tasks like speech recognition and translation. While multi-objective optimization (MOO) aims to align gradient updates, its effectiveness diminishes as the number of tasks grows, making it difficult to find a common descent direction. This raises a fundamental question: should highly conflicting objectives be optimized jointly or separated into a hierarchical structure? To address this question, this paper investigates three multi-objective MSP formulations, which we refer to as \textbf{objective soup recipes}. These formulations apply multi-objective optimization at different optimization levels to mitigate potential conflicts among all objectives. To ensure efficiency, we introduce a lightweight layer-selection mechanism that computes the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.