FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

Pingyi Fan; Anbai Jiang; Shuwei Zhang; Zhiqiang Lv; Bing Han; Xinhu Zheng; Wenrui Liang; Junjie Li; Wei-Qiang Zhang; Yanmin Qian; Xie Chen; Cheng Lu; Jia Liu

arXiv:2507.16696·cs.LG·February 16, 2026

FISHER: A Foundation Model for Multi-Modal Industrial Signal Comprehensive Representation

Pingyi Fan, Anbai Jiang, Shuwei Zhang, Zhiqiang Lv, Bing Han, Xinhu Zheng, Wenrui Liang, Junjie Li, Wei-Qiang Zhang, Yanmin Qian, Xie Chen, Cheng Lu, Jia Liu

PDF

Open Access 3 Models 4 Reviews

TL;DR

FISHER is a unified foundation model designed to comprehensively represent multi-modal industrial signals, enabling improved analysis and abnormality detection across heterogeneous SCADA system data.

Contribution

The paper introduces FISHER, a novel model that unifies the modeling of diverse industrial signals using a teacher-student SSL framework and develops the RMIS benchmark for evaluation.

Findings

01

FISHER outperforms top SSL models with up to 4.2% performance gain.

02

FISHER demonstrates versatile capabilities across multiple health management tasks.

03

The study reveals effective scaling laws for industrial signal representations.

Abstract

With the rapid deployment of SCADA systems, how to effectively analyze industrial signals and detect abnormal states is an urgent need for the industry. Due to the significant heterogeneity of these signals, which we summarize as the M5 problem, previous works only focus on small sub-problems and employ specialized models, failing to utilize the synergies between modalities and the powerful scaling law. However, we argue that the M5 signals can be modeled in a unified manner due to the intrinsic similarity. As a result, we propose FISHER, a Foundation model for multi-modal Industrial Signal compreHEnsive Representation. To support arbitrary sampling rates, FISHER considers the increment of sampling rate as the concatenation of sub-band information. Specifically, FISHER takes the STFT sub-band as the modeling unit and adopts a teacher student SSL framework for pre-training. We also…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 2

Strengths

The paper introduces a foundation model for multi-modal signal representation. This is the first attempt to learn a multi-modal, multi-sampling-rate for various signal datasets.

Weaknesses

My primary concern lies in the lack of novelty. I could not identify any clear novel contribution beyond the use of multi-modal training. **[No Ablation Study]** The paper does not include any ablation studies, making it difficult to understand why the proposed model outperforms others. It would be beneficial to provide ablation results for factors such as multi-modality, multi-scale datasets, model architecture, and sequence length. **[Input Size]** During training, FISHER uses a fixed 10-sec

Reviewer 02Rating 4Confidence 3

Strengths

- Good problem formulation: clearly formalizing the M5 problem is a valuable contribution.

Weaknesses

- Limited novelty: many modules in this paper can be found in references. For example, ViT, features emerging, and EMA are off-the-shelf modules. It’ll be great if the authors could explain more about what’s the unique contribution and novelty in this paper.

Reviewer 03Rating 4Confidence 2

Strengths

The authors should be commended for tackling a problem of significant practical importance. Creating a single model for the diverse and heterogeneous world of industrial signals is a valuable goal, and this work makes a convincing step in that direction. The key strength of the paper, in my view, is the conceptual novelty of the sub-band approach. It's an elegant and intuitive way to handle the variable-length nature of signals sampled at different rates. This strong methodological contribution

Weaknesses

While the paper is promising, its clarity could be significantly improved in several areas, which currently hinders a full appreciation of the work. My primary concern is that the paper is not fully self-contained. For example, a key component of the training, the "mask cloning strategy," is mentioned without explanation, forcing readers to consult the external EAT paper to understand the methodology. Similarly, the crucial process of how signals yielding a variable number of sub-bands are actua

Reviewer 04Rating 4Confidence 4

Strengths

1. Industrial signal modeling is an underexplored domain compared to speech/audio; addressing the M5 heterogeneity problem with a scalable foundation model is meaningful. 2. The idea of modeling higher sampling rates as concatenated sub-band information is simple yet effective. 3. The proposed RMIS benchmark covers diverse tasks and modalities, potentially serving as a valuable testbed for future research.

Weaknesses

1. The core architecture and SSL setup (teacher-student distillation via EMA, ViT backbone) are nearly identical to existing work, raising concerns that the novelty lies mainly in the data preprocessing stage. 2. The novelty of the approach is limited, since the method essentially reweights gradients in a way similar to prior balancing strategies, without a clearly distinguished contribution. 3. Figures could better illustrate sub-band composition to enhance interpretability.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction