CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport   Alignment for Language Models with Different Tokenizers

Anh Duc Le; Tu Vu; Nam Le Hai; Nguyen Thi Ngoc Diep; Linh Ngo Van,; Trung Le; Thien Huu Nguyen

arXiv:2502.16806·cs.CL·March 4, 2025

CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers

Anh Duc Le, Tu Vu, Nam Le Hai, Nguyen Thi Ngoc Diep, Linh Ngo Van,, Trung Le, Thien Huu Nguyen

PDF

TL;DR

CoT2Align introduces a universal knowledge distillation framework that leverages optimal transport and reasoning-aware alignment to improve language model training across different tokenizers and vocabularies.

Contribution

It proposes a novel cross-chain of thought distillation method using optimal transport for sequence and layer alignment, addressing vocabulary mismatch issues.

Findings

01

Outperforms existing KD methods in reasoning tasks

02

Enhances robustness in domain-specific NLP applications

03

Effective across models with different tokenizers

Abstract

Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints. Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models. However, existing KD methods often assume shared vocabularies and tokenizers, limiting their flexibility. While approaches like Universal Logit Distillation (ULD) and Dual-Space Knowledge Distillation (DSKD) address vocabulary mismatches, they overlook the critical \textbf{reasoning-aware distillation} aspect. To bridge this gap, we propose CoT2Align a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer. Additionally, we extend Optimal Transport beyond token-wise alignment to a sequence-level and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsKnowledge Distillation