A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Yacouba Kaloga; Shashi Kumar; Petr Motlicek; Ina Kodrasi

arXiv:2502.01588·cs.LG·November 24, 2025

A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a differentiable sequence alignment framework using optimal transport, improving alignment accuracy in end-to-end speech recognition systems and offering a new loss function for better sequence modeling.

Contribution

The paper presents a novel optimal transport-based alignment method and a new loss function for sequence-to-sequence models, enhancing alignment accuracy in ASR tasks.

Findings

01

Significant improvement in alignment accuracy over CTC methods.

02

The proposed SOTD metric has desirable theoretical properties.

03

Experimental results demonstrate better alignment performance on multiple datasets.

Abstract

Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech…

Peer Reviews

Decision·Submitted to NeurIPS 2025

Reviewer 01Rating 5Confidence 4

Strengths

Strengths: * extremely well written paper, with clear mathematical presentation * good motivation for the work * experimental results are promising Weaknesses: * the proposed method lags on the ASR task

Reviewer 02Rating 5Confidence 3

Strengths

Strengths: (1) The paper has a good theoretical foundation. The reviewer hasn't found a noticeable flaw so far, although the reviewer may not have enough knowledge to fully check the correctness of all theoretical part. (2) The paper conducted enough experiments on TIMIT, AMI, and Librispeech, showing its ASR and alignment performance. For a theoretical paper, this experimental scale is enough (although further scaling could be more convincing, it's not necessary) Weakness: (1) The proposed me

Reviewer 03Rating 4Confidence 5

Strengths

The strengths of the papers are: 1. Novel method and interesting idea. 2. Good performance in terms of the alignment. The weakness of the paper is: 1. Degraded performance on ASR.

Reviewer 04Rating 4Confidence 4

Strengths

**Strength** * The idea of framing the alignment in ASR towards Optimal Transport (OT) is novel. * The alignment results are indeed better, and the peaky behavior in CTC-based ASR is mitigated. **Weakness** * The ASR results generally lag behind the baseline (CTC). * There is no explaination about why “the alignments are better, but the classification results are not”. * The proposed framework is evaluated on only one sequence-to-sequence task (ASR). If the authors could demonstrate it

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression

MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence