Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications
Yoon Pyo Lee

TL;DR
This paper introduces a novel interpretability methodology for LLMs in nuclear safety, identifying key neurons involved in domain-specific reasoning and demonstrating their causal role in model performance.
Contribution
It presents a new approach to understanding and verifying neural circuits in LLMs adapted for nuclear safety applications, enhancing transparency and trust.
Findings
Identified neurons significantly altered during domain adaptation.
Silencing all specialized neurons degraded model performance.
Impairment in generating technical, contextually accurate information.
Abstract
The integration of Large Language Models (LLMs) into safety-critical domains, such as nuclear engineering, necessitates a deep understanding of their internal reasoning processes. This paper presents a novel methodology for interpreting how an LLM encodes and utilizes domain-specific knowledge, using a Boiling Water Reactor system as a case study. We adapted a general-purpose LLM (Gemma-3-1b-it) to the nuclear domain using a parameter-efficient fine-tuning technique known as Low-Rank Adaptation. By comparing the neuron activation patterns of the base model to those of the fine-tuned model, we identified a sparse set of neurons whose behavior was significantly altered during the adaptation process. To probe the causal role of these specialized neurons, we employed a neuron silencing technique. Our results demonstrate that while silencing most of these specialized neurons individually did…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis · Topic Modeling
