SafeMath: Inference-time Safety improves Math Accuracy
Sagnik Basu, Subhrajit Mitra, Aman Juneja, Somnath Banerjee, Rima Hazra, Animesh Mukherjee

TL;DR
This paper introduces SafeMath, a safety alignment method for large language models that reduces harmful mathematical content in outputs without sacrificing mathematical reasoning accuracy, supported by a new dataset and analysis.
Contribution
The paper presents SafeMath, a novel safety alignment technique for LLMs in mathematical contexts, along with ToxicGSM, a dataset for studying harmful content in math problems.
Findings
SafeMath reduces harmful outputs effectively.
Safety enforcement does not significantly harm mathematical accuracy.
Disentangling linguistic harm from math reasoning is crucial.
Abstract
Recent research points toward LLMs being manipulated through adversarial and seemingly benign inputs, resulting in harmful, biased, or policy-violating outputs. In this paper, we study an underexplored issue concerning harmful and toxic mathematical word problems. We show that math questions, particularly those framed as natural language narratives, can serve as a subtle medium for propagating biased, unethical, or psychologically harmful content, with heightened risks in educational settings involving children. To support a systematic study of this phenomenon, we introduce ToxicGSM, a dataset of 1.9k arithmetic problems in which harmful or sensitive context is embedded while preserving mathematically well-defined reasoning tasks. Using this dataset, we audit the behaviour of existing LLMs and analyse the trade-offs between safety enforcement and mathematical correctness. We further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Intelligent Tutoring Systems and Adaptive Learning · Hate Speech and Cyberbullying Detection
