Neural Architecture Search For LF-MMI Trained Time Delay Neural Networks
Shoukang Hu, Xurong Xie, Mingyu Cui, Jiajun Deng, Shansong Liu,, Jianwei Yu, Mengzhe Geng, Xunying Liu, Helen Meng

TL;DR
This paper employs neural architecture search techniques to automatically optimize TDNN-F neural networks for speech recognition, achieving significant word error rate reductions and model size savings over traditional systems.
Contribution
It introduces NAS methods tailored for TDNN-Fs, integrating architecture learning with LF-MMI training, and demonstrates substantial performance improvements and resource efficiency.
Findings
Up to 1.2% absolute WER reduction
31% reduction in model size
State-of-the-art WERs on benchmark datasets
Abstract
State-of-the-art automatic speech recognition (ASR) system development is data and computation intensive. The optimal design of deep neural networks (DNNs) for these systems often require expert knowledge and empirical evaluation. In this paper, a range of neural architecture search (NAS) techniques are used to automatically learn two types of hyper-parameters of factored time delay neural networks (TDNN-Fs): i) the left and right splicing context offsets; and ii) the dimensionality of the bottleneck linear projection at each hidden layer. These techniques include the differentiable neural architecture search (DARTS) method integrating architecture learning with lattice-free MMI training; Gumbel-Softmax and pipelined DARTS methods reducing the confusion over candidate architectures and improving the generalization of architecture selection; and Penalized DARTS incorporating resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Neural Networks and Applications
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Differentiable Architecture Search
