SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng; Yu Wan; Baosong Yang; Fei Huang; Wenjie Wang; Fuli Feng

arXiv:2507.14894·cs.CL·March 3, 2026

SASFT: Sparse Autoencoder-guided Supervised Finetuning to Mitigate Unexpected Code-Switching in LLMs

Boyi Deng, Yu Wan, Baosong Yang, Fei Huang, Wenjie Wang, Fuli Feng

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SASFT, a novel fine-tuning method using sparse autoencoders to significantly reduce unexpected code-switching in large language models while preserving their multilingual performance.

Contribution

The paper presents a mechanistic analysis of code-switching in LLMs and proposes SASFT, a new supervised fine-tuning approach that effectively mitigates language mixing by controlling feature pre-activation values.

Findings

01

SASFT reduces unexpected code-switching by over 50%.

02

Complete elimination of code-switching achieved in one model.

03

Maintains or improves performance on multilingual benchmarks.

Abstract

Large Language Models (LLMs) have impressive multilingual capabilities, but they suffer from unexpected code-switching, also known as language mixing, which involves switching to unexpected languages in the model response. This problem leads to poor readability and degrades the usability of model responses. However, existing work on this issue lacks a mechanistic analysis and shows limited effectiveness. In this paper, we first provide an in-depth analysis of unexpected code-switching using sparse autoencoders and find that when LLMs switch to a language, the features of that language exhibit excessive pre-activation values. Based on our findings, we propose $S$ parse $A$ utoencoder-guided $S$ upervised $F$ ine $t$ uning (SASFT), which teaches LLMs to maintain appropriate pre-activation values of specific language features during training.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. Clear story from SAE analysis to a concrete training modification. 2. The results are good to show the proposed method's effectiveness. 3. The paper is well-writen and easy to follow.

Weaknesses

1. GRPO is run with only 10k samples (1k per language). That seems light for a control-behavior objective. Consider stronger RL baselines , or simple supervised baselines that directly penalize language-ID tokens . Without stronger baselines, it’s hard to attribute gains purely to SASFT. Does the author could ensure the GRPO have true convergence? 2. The study focuses on zh/ru/ko. It’s unclear if results hold for low-resource scripts (e.g., Amharic, Khmer), closely-related Latin languages where

Reviewer 02Rating 6Confidence 5

Strengths

1. The paper proposes a Sparse Autoencoder-guided Supervised Finetuning (SASFT) approach that combines sparse autoencoders with supervised fine-tuning. 2. The effectiveness of SASFT is validated through extensive experiments across multiple language pairs and model families. The results demonstrate consistent mitigation of code-switching phenomena in diverse multilingual settings. 3. Experimental evidence indicates that the proposed method effectively reduces unintended language switches, there

Weaknesses

1. Limited Baseline Comparison：The paper only compares SASFT with GRPO, which, although relevant, is insufficient to establish the method’s relative advantage. 2. Lack of Mechanistic or Causal Analysis：While the paper empirically observes that the pre-activation values of target-language features increase prior to code-switching and validates this via directional ablation, this evidence remains correlational. The work does not provide a mechanistic explanation of why such activation leads to lan

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper addresses an underexplored yet practically important problem, unexpected code-switching, which impacts user experience and model usability. 2. Demonstrates consistent reductions in code-switching across various models and languages, outperforming previous methods (e.g., GRPO) in most settings. 3. The paper offers detailed analysis on factors such as layer depth and feature selection. 4. The paper is clearly written and easy to follow.

Weaknesses

1. The method for identifying language-specific features relies on rankings without justification. This introduces sensitivity to hyperparameter selection and limits the interpretability of the results, especially in multilingual settings where features may vary across tasks. 2. The paper reports a substantial +327% increase in Korean code-switching under the GRPO method, but does not provide sufficient explanation for this anomaly. A deeper analysis is needed to clarify the cause of such a dra

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification