FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation
Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu

TL;DR
FISHER is a unified foundation model designed to comprehensively represent multi-modal industrial signals, enabling improved analysis and abnormality detection across heterogeneous SCADA system data.
Contribution
The paper introduces FISHER, a novel model that unifies the modeling of diverse industrial signals using a teacher-student SSL framework and develops the RMIS benchmark for evaluation.
Findings
FISHER outperforms top SSL models with up to 4.2% performance gain.
FISHER demonstrates versatile capabilities across multiple health management tasks.
The study reveals effective scaling laws for industrial signal representations.
Abstract
With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper introduces a foundation model for multi-modal signal representation. This is the first attempt to learn a multi-modal, multi-sampling-rate for various signal datasets.
My primary concern lies in the lack of novelty. I could not identify any clear novel contribution beyond the use of multi-modal training. **[No Ablation Study]** The paper does not include any ablation studies, making it difficult to understand why the proposed model outperforms others. It would be beneficial to provide ablation results for factors such as multi-modality, multi-scale datasets, model architecture, and sequence length. **[Input Size]** During training, FISHER uses a fixed 10-sec
- Good problem formulation: clearly formalizing the M5 problem is a valuable contribution.
- Limited novelty: many modules in this paper can be found in references. For example, ViT, features emerging, and EMA are off-the-shelf modules. It’ll be great if the authors could explain more about what’s the unique contribution and novelty in this paper.
The authors should be commended for tackling a problem of significant practical importance. Creating a single model for the diverse and heterogeneous world of industrial signals is a valuable goal, and this work makes a convincing step in that direction. The key strength of the paper, in my view, is the conceptual novelty of the sub-band approach. It's an elegant and intuitive way to handle the variable-length nature of signals sampled at different rates. This strong methodological contribution
While the paper is promising, its clarity could be significantly improved in several areas, which currently hinders a full appreciation of the work. My primary concern is that the paper is not fully self-contained. For example, a key component of the training, the "mask cloning strategy," is mentioned without explanation, forcing readers to consult the external EAT paper to understand the methodology. Similarly, the crucial process of how signals yielding a variable number of sub-bands are actua
1. Industrial signal modeling is an underexplored domain compared to speech/audio; addressing the M5 heterogeneity problem with a scalable foundation model is meaningful. 2. The idea of modeling higher sampling rates as concatenated sub-band information is simple yet effective. 3. The proposed RMIS benchmark covers diverse tasks and modalities, potentially serving as a valuable testbed for future research.
1. The core architecture and SSL setup (teacher-student distillation via EMA, ViT backbone) are nearly identical to existing work, raising concerns that the novelty lies mainly in the data preprocessing stage. 2. The novelty of the approach is limited, since the method essentially reweights gradients in a way similar to prior balancing strategies, without a clearly distinguished contribution. 3. Figures could better illustrate sub-band composition to enhance interpretability.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction
