A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport
Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

TL;DR
This paper introduces a differentiable sequence alignment framework using optimal transport, improving alignment accuracy in end-to-end speech recognition systems and offering a new loss function for better sequence modeling.
Contribution
The paper presents a novel optimal transport-based alignment method and a new loss function for sequence-to-sequence models, enhancing alignment accuracy in ASR tasks.
Findings
Significant improvement in alignment accuracy over CTC methods.
The proposed SOTD metric has desirable theoretical properties.
Experimental results demonstrate better alignment performance on multiple datasets.
Abstract
Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech…
Peer Reviews
Decision·Submitted to NeurIPS 2025
Strengths: * extremely well written paper, with clear mathematical presentation * good motivation for the work * experimental results are promising Weaknesses: * the proposed method lags on the ASR task
Strengths: (1) The paper has a good theoretical foundation. The reviewer hasn't found a noticeable flaw so far, although the reviewer may not have enough knowledge to fully check the correctness of all theoretical part. (2) The paper conducted enough experiments on TIMIT, AMI, and Librispeech, showing its ASR and alignment performance. For a theoretical paper, this experimental scale is enough (although further scaling could be more convincing, it's not necessary) Weakness: (1) The proposed me
The strengths of the papers are: 1. Novel method and interesting idea. 2. Good performance in terms of the alignment. The weakness of the paper is: 1. Degraded performance on ASR.
**Strength** * The idea of framing the alignment in ASR towards Optimal Transport (OT) is novel. * The alignment results are indeed better, and the peaky behavior in CTC-based ASR is mitigated. **Weakness** * The ASR results generally lag behind the baseline (CTC). * There is no explaination about why “the alignments are better, but the classification results are not”. * The proposed framework is evaluated on only one sequence-to-sequence task (ASR). If the authors could demonstrate it
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
