Performance and Complexity Trade-off Optimization of Speech Models During Training

Esteban G\'omez; Tom B\"ackstr\"om

arXiv:2601.13704·cs.SD·January 22, 2026

Performance and Complexity Trade-off Optimization of Speech Models During Training

Esteban G\'omez, Tom B\"ackstr\"om

PDF

Open Access

TL;DR

This paper introduces a novel reparameterization method that allows simultaneous optimization of speech model performance and computational complexity during training, avoiding heuristic pruning.

Contribution

It proposes a feature noise injection technique enabling joint training for performance and complexity trade-offs, unlike traditional post hoc pruning methods.

Findings

01

Effective in voice activity detection

02

Improves audio anti-spoofing models

03

Dynamically balances model size and accuracy

Abstract

In speech machine learning, neural network models are typically designed by choosing an architecture with fixed layer sizes and structure. These models are then trained to maximize performance on metrics aligned with the task's objective. While the overall architecture is usually guided by prior knowledge of the task, the sizes of individual layers are often chosen heuristically. However, this approach does not guarantee an optimal trade-off between performance and computational complexity; consequently, post hoc methods such as weight quantization or model pruning are typically employed to reduce computational cost. This occurs because stochastic gradient descent (SGD) methods can only optimize differentiable functions, while factors influencing computational complexity, such as layer sizes and floating-point operations per second (FLOP/s), are non-differentiable and require modifying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Stochastic Gradient Optimization Techniques