Edge-Deployable Fish Feeding-State Quantification and Recognition via Frame-Pair Motion Encoding and EfficientFeedingNet
Yuchen Xiao, Weijia Ren, Yining Wang, Kaijian Zheng, Chunwei Bi, Shubin Zhang, Xinxing You, Liuyi Huang

TL;DR
This paper introduces a lightweight video-based system to detect when farmed fish are feeding, aiming to reduce waste and improve fish welfare in aquaculture.
Contribution
The novel contribution is a motion-based, edge-deployable framework with EfficientFeedingNet for real-time feeding-state recognition in aquaculture.
Findings
EfficientFeedingNet achieved 96.53% accuracy in feeding-state recognition.
Models trained on automatically labeled data outperformed human-labeled datasets by 13.13–18.46 percentage points.
The system runs at 143.24 fps on a Jetson Orin NX, enabling real-time deployment.
Abstract
This study presents a lightweight video-based method to recognize when a school of farmed fish is actively feeding, helping farmers avoid overfeeding that can waste feed, degrade water quality, and compromise fish welfare. We recorded overhead videos of juvenile black rockfish in farm tanks and measured how movement changes between two nearby video frames. We turned these movement changes into a single color image that shows where fish move and how strongly they move. Using the overall movement strength over time, we automatically determined the period when fish respond to feed and created an automatically labeled set of Feeding and Non-feeding examples. We then compared several lightweight image-based computer models and developed a fast model named EfficientFeedingNet for real-time use on low-cost farm devices. Models trained with the automatic labels achieved over 90 percent accuracy…
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12| Pipeline Category | Reference | Optical-Flow Source | Model Input | Temporal | Pros/Cons |
|---|---|---|---|---|---|
| Handcrafted flow indicators | [ | Classical sparse/dense optical flow |
Motion statistics (speed/direction/activity) Aggregated over space/time | None (simple smoothing/ |
Pros: Interpretable, low computation Cons: Feature- and scene-sensitive |
| Flow-based temporal deep models | [ | Learned optical flow network | Flow (±RGB) clips | 3D CNN/two-stage pipeline |
Pros: Models long-horizon dynamics Cons: Higher memory/latency |
| Frame-pair motion encoding | This work | Classical dense Farnebäck (frame-pair, interval Δ) | 2D motion–spatial map encoding direction/magnitude/layout |
No explicit temporal network Short-term encoded in 2D |
Pros: Low-latency and memory-efficient Cons: Limited long-horizon modeling |
- —National Key Research and Development Program of China
- —Key R&D Program of Shandong Province, China
- —Qingdao marine science and technology innovation demonstration project
- —Postdoctoral Fellowship Program of CPSF
- —Postdoctoral Innovation Program of Shandong Province
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWater Quality Monitoring Technologies · Optical Wireless Communication Technologies · Innovations in Aquaponics and Hydroponics Systems
1. Introduction
In aquaculture, feed is typically the largest operating cost and can account for more than 50% of total production costs [1,2]. Fish-demand-based feeding strategies are therefore critical for improving production efficiency and reducing feed waste. However, fish appetite is influenced by physiological status and environmental conditions, which makes it difficult to implement accurate, fish-demand-based feeding in practice [3]. This often leads to overfeeding, resulting in feed waste, deterioration of culture water quality, compromised fish welfare, and increased production costs and management burden [1,2]. Quantitative evidence from related behavior-based automated feeding control studies suggests that closing the loop with real-time visual feedback can yield practically meaningful gains. For example, a near-infrared machine-vision feeding decision system reported a ~10.77% reduction in feed conversion ratio (FCR; 1.95 ± 0.06 to 1.74 ± 0.06) compared with a feeding-table method, together with lower ammonia nitrogen levels (stabilizing at ~0.45 mg/L versus ~0.5–0.6 mg/L) [4]. In another study on juvenile Micropterus salmoides, an intelligent feeding method reduced FCR by ~17.07% (1.23 ± 0.09 to 1.02 ± 0.06) while increasing the specific growth rate by ~8.33% (2.64 ± 0.60 to 2.86 ± 0.71) [5]. These reported magnitudes help contextualize why accurate, edge-deployable feeding-state monitoring can be an important enabler for precision feeding and reduced waste in intensive aquaculture.
In this context, accurate feeding-state assessment is essential for developing data-driven feeding management in aquaculture [6,7,8]. In farm practice, feeding states are commonly assessed by behavioral observation. Farmers or experts infer feeding state/intensity from school-level cues such as aggregation toward the feeding area, elevated swimming activity, and changes in the shoal’s spatial distribution after feed delivery [9,10]. However, the criteria used for behavioral classification and the magnitude of feeding-related behaviors can be both species- and context-dependent [10,11]. As a result, observation-based grading (e.g., ordinal intensity levels or binary feeding/non-feeding labels) is often subjective and difficult to standardize across operators and farms [12,13,14,15]. Throughout this paper, “feeding-state quantification” refers to deriving a measurable, time-resolved indicator from video to delineate feeding-state intervals. “Spatial distribution” refers to the image-plane organization of school activity/aggregation patterns (i.e., where and how the shoal moves and aggregates).
To reduce subjectivity and enable scalable monitoring, deep-learning-based approaches have been widely explored for recognizing feeding-related states of fish schools from images or videos. Representative pipelines include (i) lightweight CNN-based image classifiers for feeding-state recognition (e.g., LeNet-style and MobileNet-style models) [12,13]; (ii) multi-task frameworks that jointly infer feeding activity and residual feed to support feeding evaluation [16]; and (iii) spatiotemporal attention models that exploit short video clips to capture temporal patterns [17]. These studies demonstrate the potential of deep learning for feeding-state recognition in controlled aquaculture settings.
Despite these advances, practical on-farm use remains challenging. First, many existing methods rely on static, single-frame appearance cues or computationally heavy clip-based temporal models. These approaches can be fragile under variable imaging conditions (e.g., illumination changes, turbidity, occlusion, and background disturbances) and are often difficult to deploy in real time on farm-side edge devices. Second, ground-truth labels are often assigned by human observation or manual thresholding, making them subjective and hard to reproduce across farms and operators. Third, deployment feasibility (e.g., latency and memory footprint on edge hardware) is rarely quantified in a standardized manner. These gaps motivate three research questions (RQs):
-
(RQ1)Can short-term inter-frame motion cues encoded from a frame pair support accurate binary feeding/non-feeding recognition using efficient 2D inference under edge constraints?
-
(RQ2)Can a measurable, time-resolved motion-intensity signal derived from video provide an objective basis to delineate feeding-state intervals and generate more reproducible labels than purely observer-based grading?
-
(RQ3)What accuracy–latency trade-offs can be achieved across representative lightweight backbones and on a farm-side edge device? To address RQ1–RQ3, we present an edge-oriented, motion-aware framework for fish feeding-state recognition with three contributions:
-
(1)We propose a frame-pair motion–spatial encoding that maps dense optical-flow information into a single 2D representation capturing motion direction, magnitude, and spatial distribution, enabling efficient inference with lightweight 2D classifiers.
-
(2)We introduce a reproducible perception-based dataset construction and labeling protocol (Perceptual Dataset) that automatically delineates the feeding-response interval for each feeding event and generates objective Feeding/Non-feeding labels, providing an objective benchmark alongside an observer-labeled Intuitive Dataset.
-
(3)We develop EfficientFeedingNet tailored to the proposed representation and report systematic evaluations, including edge-device benchmarking, to quantify the accuracy–latency trade-off under resource constraints.
The remainder of this paper is organized as follows. Section 2 reviews related work on fish feeding-state recognition and optical-flow-based motion representations and positions our approach. Section 3 describes the experimental setup, dataset construction, frame-pair motion encoding, and EfficientFeedingNet design. Section 4 reports controlled benchmarks with statistical tests and edge-device deployment results. Section 5 discusses implications, limitations, and future directions. Section 6 concludes the paper.
2. Related Works
Building on the practical challenges and research questions in Section 1, this section reviews prior work on fish feeding-state recognition and optical-flow-based motion representations. Table 1 summarizes representative optical-flow-based pipelines as a concise taxonomy to highlight key trade-offs and position our frame-pair motion encoding design.
In recent years, computer-vision and machine-learning techniques have been widely used to recognize feeding states and quantify feeding-related behaviors in diverse aquaculture scenarios, including recirculating aquaculture systems [6,12,13], sea-cage aquaculture [14,15], underwater aquaculture cabins on vessels [17,18], and pond-based aquaculture [19]. Existing approaches broadly fall into two groups: traditional image-processing pipelines combined with machine-learning models and more recent deep-learning-based methods.
In traditional image-processing and machine-learning pipelines, foreground segmentation is typically performed first, after which handcrafted descriptors—such as school aggregation, spatial distribution/position, and activity level—are extracted and mapped to feeding states [4,9,20,21]. While such handcrafted features can correlate with feeding responses in specific settings, their performance often depends on segmentation quality and scenario-specific thresholds or tuning. As a result, these pipelines can be sensitive to changing imaging conditions (e.g., illumination, turbidity, occlusion, and background disturbances) and may generalize poorly across farms, species, and rearing systems.
To reduce the dependence on segmentation and handcrafted feature engineering, deep learning learns representations directly from data and often improves accuracy in controlled environments. Nevertheless, two practical limitations remain common in deep-learning-based feeding-state recognition. First, many models still operate on single-frame or appearance-dominant inputs, which may under-represent the motion and aggregation dynamics closely coupled with feeding responses. Second, dataset labels are frequently assigned by human observation or manual thresholding. For example, some studies grade feeding intensity into categories such as “none”, “weak”, “medium”, and “strong” [6,12,13] based on qualitative criteria [22,23] or study-specific rules [16,17,18], which can be subjective and difficult to reproduce.
These limitations motivate incorporating explicit motion cues and seeking more reproducible labeling signals. Optical-flow estimation describes object motion between consecutive frames and provides a quantitative, pixel-wise motion field. In particular, dense optical flow captures fine-grained motion patterns across the entire image, which is useful for characterizing short-term swimming activity and school aggregation dynamics in aquaculture scenes [24,25]. Such motion information can be encoded into an image-like representation compatible with efficient 2D classifiers. Moreover, motion-intensity signals derived from optical flow provide an objective basis for delineating feeding responses in a more reproducible manner. Beyond aquaculture, optical flow has long been used in video understanding/action recognition as an explicit motion modality complementary to RGB appearance, supporting motion-driven recognition when static cues are insufficient. This provides a general computer-vision rationale for using optical flow to capture feeding-related motion cues and motivates efficient motion encodings compatible with lightweight 2D backbones.
Within optical-flow-based approaches, existing pipelines can be broadly grouped into three categories (Table 1). First, flow-derived handcrafted-indicator pipelines compute sparse/dense optical flow and summarize it into a small set of interpretable statistics (e.g., speed, direction, and activity intensity) for subsequent thresholding or conventional models. Representative examples include Zhao et al. [21], Wei et al. [6], and Zheng et al. [17] using Lucas–Kanade/sparse flow, as well as Måløy et al. [26] using Farnebäck dense optical flow. This line is lightweight and explainable, but its representational capacity is constrained by task-specific feature engineering and can be sensitive to scene changes.
Second, flow-based temporal deep models treat optical flow (often together with RGB) as an explicit motion modality and feed it into clip-based architectures (e.g., 3D CNNs or multi-stage pipelines) to capture longer temporal dependencies, as in Ubina et al. [14]. This design choice is consistent with the flow-stream practice in mainstream action-recognition models but typically incurs higher latency/memory due to clip buffering and temporal inference.
Third, short-term motion encoding for 2D inference compresses inter-frame motion into an image-like representation that preserves motion direction, magnitude, and spatial distribution, enabling single-stream 2D classification with lower latency. Overall, this taxonomy highlights a practical trade-off between representational capacity, temporal modeling horizon, and deployment cost; therefore, given our focus on edge deployment and reproducible monitoring, we adopt the third category.
Specifically, we compute classical dense optical flow between a frame pair (Farnebäck) and encode it as a compact motion–spatial representation that can be processed by lightweight 2D backbones. Compared with flow-derived handcrafted indicators (Category i), our encoding retains the dense spatial layout of motion patterns instead of reducing them to a few manually selected statistics, thereby improving representational capacity while keeping computation modest. Compared with flow+3D or multi-stage pipelines (Category ii) [14], our design avoids clip buffering and 3D inference, trading long-horizon temporal modeling for real-time, memory-efficient streaming inference that is more compatible with farm-side edge devices. Importantly, the same flow-derived motion-intensity signals are also used to support reproducible feeding-response delineation in our perceptual labeling protocol, linking the pipeline choice to both motion representation and labeling reproducibility. The next section details our Farnebäck-based frame-pair motion–spatial encoding and the corresponding EfficientFeedingNet design.
3. Materials and Methods
3.1. Experiment Setup and Dataset Preprocessing
3.1.1. Experiment Setup
Experiments aimed at identifying the feeding state of fish schools were conducted at an aquaculture farm located in Yantai City, Shandong Province. Juvenile Sebastes schlegelii (black rockfish) from the same batch at the same aquaculture farm were used as the experimental fish. Prior to the experiment, all fish were strictly screened, and only healthy individuals without external injuries or abnormal swimming behavior were included. In total, 100 fish were used (total length (TL) = 12.0 ± 0.72 cm; body weight (BW) = 30.8 ± 5.30 g; mean ± standard deviation (SD), n = 100). Fish were maintained in five existing breeding tanks (189 × 70 × 70 cm; water depth 60 cm), with 20 fish per tank. Water temperature (T, °C) and dissolved oxygen (DO, mg/L) were measured daily using a portable water-quality meter and averaged 9.0 ± 1.61 °C and 6.1 ± 0.41 mg/L (mean ± SD), respectively. Fish were fed daily at 13:00 with floating pellets (MARUBA, Tokyo, Japan) at a ration of 5% of the fish biomass per tank. Throughout the experiment, water changes were conducted daily at 08:00, and bottom sediments were removed at 20:00 to maintain stable rearing conditions. This routine management was implemented to help stabilize water quality and reduce the accumulation of suspended solids, thereby minimizing uncontrolled environmental variation and potential visual disturbances for video-based monitoring [27,28].
3.1.2. Video Data Acquisition
As shown in Figure 1, the video acquisition system consisted of an E27 spiral white energy-saving lamp (30 W, constant-current) and an infrared surveillance camera (fluorite EZVIZ CIHC) with a 4-megapixel resolution and a 4 mm focal length. Each camera was positioned approximately 1.8 m above the tank. The camera recorded videos at a fixed frame rate of 10 frames per second (fps). Illumination was provided by the E27 lamp, and the lighting setup (lamp type/power and mounting geometry) was kept constant throughout all recordings; illuminance (lux) at the water surface was not recorded. Video recordings were scheduled at 08:00, 13:00, and 21:00 daily to capture the swimming and feeding states of the fish. The 13:00 recording window coincided with the scheduled daily feeding, whereas the 08:00 and 21:00 recordings captured non-feeding baseline behaviors. Each session lasted 10 min, and data were collected over 30 consecutive days. A biofilter system provided aeration to maintain dissolved oxygen during the experiment.
3.1.3. Data Preprocessing
Using the OpenCV module [29], frames were extracted at 10 fps from the recorded videos. The images, originally at a resolution of 1920 × 1080 pixels, were resized to 800 × 500 pixels using the PIL library in Python. After cropping, extraneous background elements were removed to better highlight the fish activity regions. Additionally, images with significant noise or blurring were excluded from the datasets to maintain data quality. Data augmentation for training consisted of RandomResizedCrop (output size = 224, default scale range = 0.08–1.00, aspect-ratio range = 3/4–4/3) and RandomHorizontalFlip (p = 0.5) [30]. Images were then converted to tensors and normalized using ImageNet statistics (mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225]). The official documentation links for the open-source software used in this study, as well as the reference preprocessing scripts and minimal example data, are provided in the Supplementary Materials.
After preprocessing, we obtained standardized top-view frames (region-of-interest (ROI)-cropped and quality-controlled) from all recordings. To ensure fair sampling between the Intuitive and Perceptual datasets, we used the same source videos and the same temporal sampling scheme. Specifically, we uniformly sampled candidate timestamps at a fixed stride of Δ = 10 frames (≈1 s at 10 fps). At each sampled timestamp t, we extracted the corresponding preprocessed RGB frame as an Intuitive candidate sample (Section 3.1.4). In parallel, we paired the frame at t with the frame at t + Δ to compute dense optical flow and generate a motion–spatial map as a Perceptual candidate sample (Section 3.2). The two datasets therefore share the same recordings and temporal sampling, while differing in input representation (RGB frames vs. optical-flow maps) and labeling protocol (observer-labeled vs. perception-based). The choice of the frame interval/stride (Δ = 10 frames, ≈1 s at 10 fps) was empirical and reflects a trade-off between motion salience and optical-flow stability at our frame rate. A smaller Δ tends to yield weaker inter-frame displacement and lower contrast in motion–spatial maps, whereas a larger Δ increases displacement/occlusion and may degrade flow estimation and reduce temporal resolution.
3.1.4. Intuitive Dataset Setup (Observer-Labeled Baseline)
The Intuitive Dataset was constructed as a practice-driven baseline using expert visual observation of the top-view videos. Labels were assigned following a standardized set of behavioral criteria observable from overhead footage, grounded in established descriptions of feeding-related behavioral responses in fish [10,11] and our farming observations. In brief, the “Feeding” state corresponds to an active feeding-response pattern characterized by directed aggregation toward the feeding area together with a sustained elevation of school activity after feed delivery, whereas the “Non-feeding” state corresponds to baseline swimming without a persistent feeding-response pattern.
For each feeding event, we identified the feed-delivery onset time t__feed_ from the video timestamp (when pellets first entered the water/feeding operation started). We then annotated the start and end of the feeding-response phase, t__start_ and t__end_, based on the criteria in Note 1. Frames sampled within [t__start_, t__end_] were labeled as “Feeding”, and frames outside this interval were labeled as “Non-feeding”. A schematic timeline of the event-level labeling procedure and the key time points is provided in Supplementary Figure S1. Candidate samples were extracted from all videos at a fixed temporal stride (Δ = 10 frames). We then randomly selected samples from the Feeding and Non-feeding intervals to obtain a class-balanced dataset (5000 per class). Ambiguous boundary segments were resolved according to Note 1 and assigned to one of the two labels (Feeding/Non-feeding); no separate transition class was defined. All labels were first assigned by one experienced annotator, and a second annotator conducted a random-sample review for consistency; any disagreements were resolved by discussion and consensus, after which the final labeling protocol was fixed and applied to the dataset. This observer-labeled dataset is used only as a baseline for comparison with the perception-based dataset construction described in Section 3.2, and it is not claimed to be universally transferable across species or farming conditions. The Intuitive Dataset contains 10,000 images in total (5000 Feeding and 5000 Non-feeding); we used 4500 images per class for training and 500 images per class for fixed testing (Section 3.5).
Note 1. Observer-labeling criteria for the Intuitive Dataset.
Feeding onset (t__feed_): defined as the video timestamp when pellets first enter the water/feeding operation starts.Feeding: assigned when a coherent feeding-response pattern is observed after t__feed_, i.e., directed aggregation toward the feeding area together with a sustained elevation of school activity (e.g., frequent burst/turning).Non-feeding: assigned when the persistent feeding-response pattern is absent (including prefeeding periods and postfeeding periods where fish may linger near the feeding area but activity has returned to baseline).Confounds: high-motion events caused by chasing/fighting/play are labeled as Non-feeding unless they also show directed aggregation consistent with feeding.
3.2. Motion–Spatial Frame-Pair Encoding Feature Extraction and Quantification
3.2.1. Feature Extraction Using Optical Flow
The Perceptual Dataset proposed in this study was created using the optical flow method, primarily to capture motion features related to the feeding state of fish. Optical flow analyzes changes in pixel positions between consecutive frames of images to determine how pixels move from one frame to the next. Therefore, we first employed the Farnebäck dense optical-flow method [31] to derive a high-dimensional, approximate representation of the school’s collective motion dynamics. This approach enables an effective characterization of the fish group’s motion and spatial patterns at the global scale.
We assume that the image intensity within a local window can be approximated by a quadratic polynomial. We formulate a weighted least-squares problem over a local window around each pixel x, using a Gaussian weighting function to emphasize the center of the patch. Denoting the polynomial coefficient difference induced by the displacement as and , the objective becomes:
Here denotes the displacement to be estimated, is the Gaussian weight at pixel x, is the Jacobian matrix of partial derivatives of the quadratic model with respect to , collects the constant residuals terms after first-order (Taylor) expansion.
Minimizing yields a small linear system for at each pixel, which Farnebäck solves in closed form via normal equations. Embedded in a multi-scale pyramid with iterative refinement, this procedure produces a dense and smooth motion field suitable for subsequent feeding-state analysis. We visualize the dense optical flow as an image-like motion–spatial map, where hue encodes motion direction, brightness encodes normalized motion magnitude, and the pixel layout preserves the spatial distribution of motion. We refer to this frame-pair-derived representation as frame-pair motion encoding.
Dense optical flow was computed using OpenCV cv2.calcOpticalFlowFarneback [29] with pyrscale = 0.5, levels = 3, winsize = 15, iterations = 3, poly_n = 5, poly_sigma = 1.2, and flags = 0. These settings were chosen empirically to balance motion sensitivity, robustness, and runtime at our resolution and frame rate.
These Farnebäck hyperparameters were kept fixed for all experiments to ensure fair comparisons across models and datasets. We did not perform an exhaustive parameter sweep; rather, we adopted a single practical setting and verified qualitatively that it produced stable motion–spatial maps on representative videos. A quantitative sensitivity study of optical-flow hyperparameters is left for future work.
Frame-pair motion encoding was performed as a 2D representation. In our pipeline, the spatiotemporal information is not modeled by a clip-based temporal network. Instead, each sample is constructed from a frame pair by computing dense optical flow with the Farnebäck method. The resulting flow field is converted into an image-like motion map by encoding the vector orientation (θ(x) = arctan2(v(x), u(x))) and magnitude into a three-channel HSV-style representation. The resulting three-channel image is then used as input to the CNN. In this representation, motion direction is encoded by the hue/angle component, motion intensity is encoded by the brightness/magnitude component, and the spatial distribution of motion is preserved by the pixel layout (e.g., where high-activity regions appear in the tank view). Therefore, our frame-pair motion encoding intentionally sacrifices long-horizon temporal modeling in exchange for real-time, memory-efficient streaming inference, which better matches real-time edge deployment constraints in aquaculture. Each optical-flow sample was converted into a three-channel motion–spatial map (optical-flow visualization) and then resized/cropped to 224 × 224; it was converted to a tensor and normalized using the ImageNet statistics described in Section 3.1.3, consistent with the preprocessing of RGB-frame inputs. However, it does not explicitly model multi-second temporal dependencies; we discuss this limitation and lightweight temporal aggregation extensions in Section 5.4.
3.2.2. Quantification of Frame-Pair Encoding Feature
To further interpret the motion–spatial maps and examine how their visual patterns relate to feeding states, we analyzed their color statistics in the HSV color space. Optical-flow maps were converted from BGR to hue, saturation, and value (HSV). After normalizing RGB intensities to [0, 1], we computed the HSV components using the standard formulation. Hue H encodes motion direction (in degrees, H ∈ [0°, 360°)), saturation S reflects chromatic purity, and value V corresponds to brightness (i.e., motion magnitude), as defined in Equations (4)–(6).
where , , and denote the normalized RGB components, , and = .
3.2.3. V-Value Definition and Perceptual Labeling Rule
For each optical-flow sample, we convert the optical-flow visualization image from BGR to HSV and extract its value (V) channel (x,y). We define the scalar V-Value of sample k as the spatial mean of the V channel over the ROI:
where denotes the HSV value-channel intensity at pixel of the k-th cropped motion–spatial map and is the full pixel domain of that cropped map (i.e., the entire image after cropping), with being the number of pixels.
For each feeding event, the V-Value sequence is smoothed using a centered moving-average filter (window w = 11 samples). Optical flow is computed between frame pairs separated by Δ = 10 frames (≈1 s at 10 fps), so each V-Value sample corresponds to approximately 1 s. We then apply 1D k-means clustering (k = 2) [32] to the smoothed sequence to obtain two cluster centers (c_low and c_high). The higher-center cluster is treated as the feeding-response state and the corresponding mid-point threshold T = (c_low + c_high)/2 is used for visualization. To enforce temporal continuity, the feeding-response phase is defined as the longest contiguous segment assigned to the feeding-response state, with start and end indices t__start_ and t__end_. Optical-flow samples within [t__start_, t__end_] are labeled as “Feeding”, and samples outside this interval are labeled as “Non-feeding”, forming the Perceptual Dataset. To match the Intuitive Dataset size, we constructed a balanced Perceptual Dataset of 10,000 samples (5000 Feeding and 5000 Non-feeding) and used 4500 images per class for training and 500 images per class for fixed testing. To facilitate replication, a reproducibility package is provided as Supplementary Materials, including a minimal example dataset and the scripts used for frame extraction and preprocessing, motion–spatial map generation, and feeding-state interval delineation.
3.3. EfficientFeedingNet
We adopt EfficientNet as the baseline because it achieves a favorable accuracy–efficiency trade-off via compound scaling of network depth, width, and input resolution [33]. EfficientNet is built from MBConv blocks with depthwise separable convolutions and inverted residual connections, which reduce parameters and floating-point operations (FLOPs) while maintaining representational capacity. The network consists of a stem convolution, seven MBConv stages, and a classification head (Table 2).
Building on the baseline model, to better adapt the network to our Perceptual Dataset, the original SE module [34] was replaced by a HybridAttention module at each MBConv stage. As shown in Figure 2, the proposed module adopts ECA [35] for efficient channel-wise modeling and, on its output, further introduces a lightweight spatial attention composed of channel-wise mean and max aggregation followed by a 3 × 3 convolution [36], yielding a progressive channel to spatial fusion. Compared with SE, this design maintains discriminative power while substantially reducing parameter count and computational overhead, and it enhances channel selectivity and spatial localization, making it well-suited for lightweight network scenarios. The complete mathematical expression of the HybridAttention module is shown in Equations (8)–(10).
We replaced the SiLU (Swish) activation in the MBConv modules with Mish [37] (Equation (11)). Mish is smoother around x ≈ 0 and retains small negative responses, which can improve gradient flow and sensitivity to weak, boundary-blurred motion patterns in motion–spatial maps. In our experiments, this change contributed to more stable optimization and improved feeding-state discrimination.
With these modifications, the standard MBConv block was converted into an Improved MBConv block tailored to motion–spatial maps. As shown in Figure 3, the Improved MBConv preserves the lightweight design while enhancing channel selectivity and spatial localization. Replacing SiLU with Mish yields smoother feature distributions and improves sensitivity to subtle, boundary-ambiguous motion cues. The ECA branch models channel interactions with negligible overhead, reducing the parameter and memory-access costs of SE, while the 3 × 3 spatial branch adds local spatial selectivity with minimal additional computation, improving discrimination between Feeding and Non-feeding states in our setting.
The overall architecture of EfficientFeedingNet is shown in Figure 4. The network follows the standard EfficientNet design with a stem convolution, seven MBConv stages, and a classification head. The primary modifications are: (1) Improved MBConv blocks in each stage; (2) Mish activations replacing SiLU throughout; (3) a reduced dropout rate (0.05); and (4) truncated normal initialization (std = 0.02) for linear layers to improve training stability. The network maintains the same width/depth coefficients and stage configurations as the baseline, ensuring architectural consistency while achieving improved feature learning capabilities.
3.4. Model Comparison
Several representative backbones were selected for comparison with EfficientFeedingNet, including the CNN-based ResNet101 [38], the transformer-based ViT-B/16 (PureViT) [39], MobileViT [40], the recent Efficient Vision Mamba [41], and MobileViT-SENet [42]. Beyond being standard computer-vision baselines, these architectures were chosen because the same model families (or close variants) have been increasingly applied to aquaculture monitoring and fish-behavior analysis, including feeding-state recognition [13,17], transformer-based appetite/starvation grading [18], and MobileViT-based feeding-behavior recognition from video streaming and industrial deployments [43,44]. Therefore, they provide task-relevant benchmarks for the present application.
3.5. Model Configuration
The model was trained with the following settings: 100 epochs, batch size of 16, and an initial learning rate of 1 × 10^−3^. A cosine learning-rate schedule [45] was used to smoothly decay the learning rate from 1 × 10^−3^ to 1 × 10^−5^ (final LR fraction lrf = 0.01). We employed AdamW [46] as the optimizer (weight decay 1 × 10^−2^) and cross-entropy as the loss, with gradient clipping applied to all trainable parameters (max_norm = 1.0) for stability. For data preprocessing, the training set used RandomResizedCrop (224) and RandomHorizontalFlip (p = 0.5) followed by normalization; the validation set used Resize and CenterCrop with the same normalization. We monitored top-1 accuracy on the validation set and saved the checkpoint with the best validation accuracy for testing.
Evaluation metrics and statistical analysis: For binary classification, Feeding was treated as the positive class. Accuracy (Acc) was computed as:
Precision, recall, and F1-score were computed for the Feeding class as:
Metrics were computed on the fixed test set and reported as mean ± standard deviation (SD) over three independent runs. For key comparisons reported in Section 4, Welch’s t-test (two-sided) was used on test accuracy across runs, with p < 0.05 considered statistically significant.
Both the Intuitive and Perceptual datasets contain 10,000 images (5000 Feeding and 5000 Non-feeding). We used a fixed, class-balanced test set of 1000 images (500 per class); the remaining 9000 images (4500 per class) formed the training pool, from which a validation subset was sampled during training. Importantly, splitting was performed at the feeding-event (video-session/day) level rather than at the frame level: all samples from the same day were assigned to either the training/validation pool or the test set and never to both, preventing temporal leakage. The test set was constructed exclusively from held-out days. For consistency, the same day-wise separation was applied to the Non-feeding sessions (08:00 and 21:00), so both Feeding and Non-feeding test samples come from unseen days.
For each run, 10% of the training pool (450 images per class) was randomly selected as the validation set and the remaining 90% (4050 images per class) was used for training; the best checkpoint on the validation set was used for final evaluation on the fixed test set. All models were trained on a Windows 11 workstation equipped with an NVIDIA RTX 4080 GPU (16 GB memory). The training environment used Python 3.9.7, PyTorch 2.6.0 [30], CUDA 12.6, and OpenCV 4.11 [29].
4. Results
4.1. Results of Motion–Spatial Frame-Pair Encoding Feature Extraction and Quantification
4.1.1. Results of Frame-Pair Encoding Feature Extraction
Figure 5 illustrates the original frames and the corresponding optical-flow images, together with an example of the transformation process from the former to the latter. In the original images, the background remains static, while the primary changes involve the positions and aggregation levels of the fish within the aquaculture tank, reflecting their spatial characteristics under Feeding and Non-feeding states.
Optical-flow images are generated based on the changes between consecutive frames (as illustrated in Figure 5, from frame t_0_ to frame t_1_, with an interval of 10 frames between adjacent frames in this study). As discussed in Section 3.2.1, fish activity (e.g., swimming speed, acceleration, tailbeat frequency, and turning angles) is mapped to variations in optical flow. The colors in the image represent the direction of optical flow. For example, red indicates movement to the right, green indicates movement downward, blue indicates movement to the left, and yellow indicates movement upward. The brightness represents the magnitude or speed of the optical flow, with brighter areas indicating faster motion and darker areas indicating slower motion. Similarly, optical flow variations capture information on the spatial distribution and aggregation of the fish. Therefore, the Perceptual Dataset constructed from motion–spatial maps can be regarded as a frame-pair encoding of both motion and spatial features of the fish.
4.1.2. Results of Frame-Pair Encoding Feature Quantification
Figure 6 compares image features between Feeding and Non-feeding states. Optical-flow maps are shown in Figure 6a,b, and the corresponding HSV channels are shown in Figure 6c–h (H: c,d; S: e,f; V: g,h). The optical-flow maps reveal clear differences in the spatial extent and intensity of motion between Feeding and Non-feeding periods, supporting the association between feeding state and activity level.
To facilitate quantitative analysis, the optical-flow images (BGR) were converted to the HSV color space, and the three channels were visualized separately to examine their contributions (Figure 6c–h). The hue (H) channel mainly reflects changes in motion direction in the color-coded flow map. The saturation (S) channel can highlight salient motion regions, but it is also more sensitive to background artifacts and low-confidence flow in near-static areas, which may introduce additional noise. In contrast, the value (V) channel corresponds to image brightness and, under our optical-flow visualization scheme, increases with flow magnitude; therefore, it provides a more direct and stable proxy for overall motion intensity. Accordingly, we selected the V channel for feature quantification and used it to derive the scalar V-Value for subsequent analysis.
Meanwhile, we extracted V-Value features as the basis for distinguishing between Feeding and Non-feeding states in the fish group. A V-Value time-series plot reflecting the feeding state of the fish was generated to visualize and compare the differences between the two states, as shown in Figure 7.
4.2. Model Comparison Experiment
4.2.1. Model Comparison on Intuitive and Perceptual Datasets
To compare representative architectures on the Intuitive and Perceptual datasets, we conducted controlled model-comparison experiments. For all models, we fixed the input resolution (224 × 224) and the training budget (100 epochs). We also used the same data partitioning and evaluation protocol (Section 3.5), including a fixed, class-balanced test set and reporting mean ± standard deviation (SD) over three independent runs. Beyond these controlled settings, the remaining training details (e.g., data augmentation strategy and optimization hyperparameters such as optimizer and learning-rate schedule) followed the default or recommended configuration of each model’s reference implementation, and no additional per-model hyperparameter tuning was performed in this study.
To further compare the learning effectiveness of different models on the Intuitive and Perceptual datasets, we conducted controlled model-comparison experiments using the same set of models on both datasets. Figure 8 presents the average test accuracy of different models, where test accuracy was computed using the checkpoint with the best validation accuracy for each run. It is evident that all models achieved substantial performance improvements on the Perceptual Dataset. As indicated in Table 3, the test-accuracy improvement from the Intuitive to the Perceptual Dataset ranges from 13.13 to 18.46 percentage points, with Efficient Vision Mamba showing the largest gain (18.46 percentage points). These improvements are statistically significant for all models (Welch’s t-test, two-sided; all p ≤ 0.002).
Table 3 (rows 1–6) summarizes model performance on the Intuitive Dataset. PureViT achieves the highest mean validation accuracy (97.05 ± 0.11%), whereas EfficientFeedingNet achieves the highest mean test accuracy (80.33 ± 0.91%). Table 3 (rows 7–12) reports results on the Perceptual Dataset. All models exceed 95% mean validation accuracy and 90% mean test accuracy. EfficientFeedingNet performs best overall, achieving 99.44 ± 0.15% validation accuracy and 96.53 ± 0.09% test accuracy, with strong recall (95.80 ± 1.20%) and F1-score (96.51 ± 0.86%). MobileViT yields the highest mean precision (98.07 ± 1.33%).
Following Section 3.5, we performed Welch’s t-test (two-sided) on test accuracy across three runs. On the Perceptual Dataset, EfficientFeedingNet achieved the highest mean test accuracy (96.53 ± 0.09%). It significantly outperformed PureViT (Δ = 3.83 percentage points, p = 0.013) and Efficient Vision Mamba (Δ = 5.60 percentage points, p = 0.017). Compared with ResNet101, MobileViT, and MobileViT-SENet, EfficientFeedingNet achieved numerically higher accuracy, but the differences were not statistically significant at p < 0.05 (p > 0.05).
4.2.2. Comparison on Performance of Models
Figure 9 visualizes the performance metrics of different models on the Perceptual Dataset, and Table 4 summarizes test accuracy, parameter count, FLOPs, and per-image inference time. The inference latency (ms/image) was measured as forward-pass time only on a workstation with an NVIDIA GeForce RTX 4080 GPU (excluding data loading and pre/postprocessing). EfficientFeedingNet achieves the highest mean test accuracy (96.53 ± 0.09%) while also having the fewest parameters (3.37 M) and the lowest measured latency (7.90 ms per image). Although the MobileViT family attains competitive accuracy with low parameter counts and FLOPs, it is slower in practical inference. ResNet-101 and PureViT have substantially larger parameter counts (86.75 M and 85.80 M) and FLOPs (32.87 G and 35.14 G), increasing deployment cost. Efficient Vision Mamba has the lowest FLOPs (0.47 G) but the slowest measured inference (75.01 ms per image), likely due to architectural factors. Overall, EfficientFeedingNet offers a favorable trade-off between accuracy and efficiency.
4.2.3. Edge-Device Benchmarking
To evaluate on-farm deployability, we benchmarked the inference efficiency of EfficientFeedingNet on a workstation GPU and an embedded edge device (batch size = 1; input = 224 × 224). We report latency and throughput for transfer and forward pass (input binding/copy plus network inference) on precomputed motion encoding images, together with peak memory usage and power when available. These numbers exclude motion encoding generation and data loading, and they may vary with the software stack and I/O binding. The Jetson latency (7.0 ms/img) was measured on precomputed motion–spatial maps and includes only input transfer and network inference; optical-flow computation, motion encoding, and preprocessing are excluded. Note that Table 4 reports backbone-level forward-pass latency for model comparison, whereas Table 5 reports a separate deployment-oriented benchmark for EfficientFeedingNet; therefore, the absolute latency values are not directly comparable. All edge-device numbers reported in Table 5 were experimentally measured on the corresponding hardware (not projected); specifically, the Jetson Orin NX results were obtained by running an INT8 EfficientFeedingNet model with ONNX Runtime (batch size = 1) on precomputed motion–spatial map inputs. The Jetson software stack was Ubuntu 22.04.5 LTS (aarch64) with JetPack 6.2.1 (L4T R36.4.7; Linux 5.15.148-tegra), CUDA 12.6 (nvcc 12.6.85), and cuDNN 9.3.0.75. ONNX Runtime with the CUDA execution provider was used for INT8 inference on Jetson Orin NX.
As shown in Table 5, EfficientFeedingNet achieves 7.0 ms/img latency (143.24 FPS) on Jetson Orin NX (INT8, batch size = 1), with a peak memory footprint of 647.3 MB, indicating that model inference is well within typical embedded constraints and supports real-time on-device feeding-state recognition.
Notably, latency/throughput were measured on precomputed and preprocessed motion–spatial maps and include input binding/copy and network inference only; optical-flow computation, motion encoding, data loading, and other pre/postprocessing are excluded. INT8 inference was performed with ONNX Runtime (CUDA execution provider), without TensorRT-based acceleration/engine compilation.
In this work, “real-time” refers to sustained streaming operation at the effective decision rate of our frame-pair pipeline: one motion–spatial sample (one Feeding/Non-feeding prediction) is generated from a frame pair separated by Δ = 10 frames at 10 fps (≈1 s), i.e., ~1 sample/s per camera. Accordingly, the end-to-end per-sample latency in a streaming implementation includes optical-flow computation and motion encoding, any required preprocessing/I/O transfer, and network inference; Table 5 reports the experimentally measured transfer + inference time on precomputed motion–spatial maps, while the remaining time budget within the ~1 s sampling interval provides headroom for online motion encoding and sustained streaming.
Practical implications for on-farm monitoring: each motion–spatial sample is constructed from a non-overlapping frame pair separated by Δ = 10 frames at 10 fps (≈1 s; Section 3.1.3), corresponding to ~1 inference per second per camera. On the Jetson Orin NX, EfficientFeedingNet requires 7.0 ms per sample (143.24 inferences/s), i.e., ~0.7% of the 1 s time budget, leaving ample headroom for motion encoding and I/O. This memory usage (647.3 MB; Table 5) is modest relative to device capacity, supporting practical edge deployment.
4.3. Ablation Experiment on Models
To analyze the contribution of each component (Table 6), we evaluated four ablation variants: (1) EfficientNet-B0 (base): base EfficientNet-B0 with SE and SiLU; (2) w/o HybridAttention: removing HybridAttention; (3) w/o Mish (SiLU): replacing Mish with SiLU; and (4) w/o Mish & HybridAttention: using SiLU and removing HybridAttention.
Relative to the base model (93.37 ± 0.57% accuracy; 4.01 M parameters; 0.82 GFLOPs; 7.2 ms per image), our design reduces the parameter count by 15.8% with nearly unchanged compute (+0.5%) while improving accuracy by 3.16 percentage points and reducing run-to-run variability. Compared with the no-attention variant (95.00 ± 0.80%; 3.37 M; 0.82 GFLOPs; 5.2 ms per image), HybridAttention provides an additional 1.53 percentage-point gain at the cost of +2.7 ms per image, with the standard deviation decreasing from 0.80 to 0.09. Replacing SiLU with Mish further improves accuracy by 1.08 percentage points (SiLU baseline: 95.45 ± 0.65%; 0.83 GFLOPs; 7.2 ms per image) and stabilizes training. Overall, HybridAttention and Mish improve accuracy and stability with minimal overhead.
Following Section 3.5, we applied Welch’s t-test (two-sided) to test accuracy across three runs. EfficientFeedingNet significantly outperformed the EfficientNet-B0 base model (Δ = 3.16 percentage points, p = 0.009) and the variant without both Mish and HybridAttention (Δ = 3.70 percentage points, p = 0.035). Compared with the single-module removal variants (w/o HybridAttention and w/o Mish), EfficientFeedingNet achieved numerically higher accuracy, but the differences were not statistically significant at p < 0.05 (p > 0.05).
4.4. Interpretability of the Evaluation Results
4.4.1. Evaluation Based on Confusion Matrix
To more accurately evaluate the classification performance of different models in distinguishing between Feeding and Non-feeding fish behaviors, we further analyzed the classification outcomes using normalized confusion matrices, shown in Figure 10 for the Intuitive Dataset and Figure 11 for the Perceptual Dataset.
Overall, compared with Intuitive Dataset, all models exhibited a notable improvement in classification accuracy on the Perceptual Dataset for both Feeding and Non-feeding states. In particular, the accuracy for classifying feeding states exceeded 96% across all models, and the classification accuracy for Non-feeding states also improved to varying degrees. It is worth noting that EfficientFeedingNet has an accuracy rate of approximately 97% in the Feeding state and 96% in the Non-feeding state, demonstrating the best category balance performance among all models.
4.4.2. Evaluation Based on Grad-CAM
To interpret model decisions, we used Grad-CAM to generate class-discriminative attention heatmaps. In Figure 12, rows correspond to model types and columns correspond to dataset conditions (Intuitive vs. Perceptual) and prediction outcomes (correct vs. incorrect). From left to right, the columns show correctly classified Intuitive samples, misclassified Intuitive samples, correctly classified Perceptual samples, and misclassified Perceptual samples. For each example, the first label line indicates whether the prediction is correct (“true”) or incorrect (“false”), and the second line indicates the ground-truth class. All examples were selected from the test set. In the heatmaps, warmer/brighter colors indicate stronger contribution to the predicted class.
As shown by the Grad-CAM heatmaps on the Intuitive Dataset, most models attend to broad regions. Except for (c-1) and (e-1), correct classifications—such as (a-1), (b-1), (d-1), and (f-1)—and misclassifications—(b-2), (e-2), and (f-2)—often include irrelevant background alongside the fish shoal. Some misclassified examples, for instance (a-2), reveal insufficient attention coverage, while (c-2) and (d-2) exhibit spatial bias in the attended regions. In contrast, on the Perceptual Dataset the correctly classified cases focus precisely on the motion-activity regions of the fish shoal, with minimal attention to background areas. However, misclassifications such as (c-4), (d-4), (e-4), and (f-4) still show biased attention, and cases like (a-4) and (b-4) demonstrate correct attention yet incorrect classification. Overall, the models’ ability to localize fused motion–spatial features on the Perceptual Dataset explains the marked improvement in evaluation performance: by accurately capturing the fish activity regions, the models make more reliable feeding-state predictions.
5. Discussion
5.1. Efficiency of the Quantification of Feeding State Method
We propose a frame-pair motion encoding method for classifying fish feeding states using dense optical flow. Using the Farnebäck algorithm [31], we capture the school’s motion direction, intensity, and spatial aggregation as a compact motion–spatial map and feed this representation to standard 2D backbones. Across model-comparison experiments, all evaluated architectures achieve substantial performance gains on the Perceptual Dataset relative to the Intuitive Dataset, suggesting that the proposed representation and labeling protocol capture feeding-related behavioral cues in our experimental setting and integrate naturally with common deep-learning pipelines.
Compared with optical-flow-based pipelines that couple flow with clip-level temporal deep models (e.g., the two-stage design that combines an optical-flow network and a 3D network in [14]), our choice of “Farnebäck optical flow and 2D classification” is an explicit engineering trade-off for real-time deployment. Clip-based 3D/temporal models can exploit longer-horizon dynamics, but they typically require buffering multiple frames and introduce additional computation and memory overhead. In contrast, we compute classical dense Farnebäck flow between a frame pair and perform frame-pair motion encoding into a single 2D motion–spatial map, which can be processed by a single-stream lightweight 2D backbone. This design reduces the recognition module’s memory footprint and supports streaming inference (frame pairs can be processed online), while retaining the key cues needed for feeding-state recognition in our setting—namely activity intensity changes and the spatial aggregation of the shoal. Compared with optical-flow-derived handcrafted indicators that reduce the motion field to a few summary statistics [6,10,22], our frame-pair encoding retains the dense spatial layout of motion patterns, improving representational capacity while maintaining efficient 2D inference.
Furthermore, the proposed method is grounded in established behavioral observations. For swimming fish species, the Intuitive Dataset provides relatively distinct appearance cues related to the spatial positioning of the school; however, motion during feeding provides critical information that should not be overlooked. Research has demonstrated that fish exhibit heightened activity when hungry, with swimming speeds peaking at the onset of feeding [10]. After feeding, satiated fish experience a notable reduction in swimming speed, which becomes markedly lower than that of unsatiated fish. This underscores the strong correlation between swimming behavior and feeding state [11].
Recent work has assessed feeding state and related quantities from complementary perspectives, including tracking/graph-based appetite assessment [47], feed-pellet detection for intake quantification on CPU devices [48], lightweight instance segmentation for feeding-intensity estimation [49], and clustering-assisted benchmark construction to reduce label subjectivity [50]. These studies reflect a broader trend toward measurable and reproducible feeding indicators. In this context, our Perceptual labeling protocol derives Feeding/Non-feeding intervals from an optical-flow-based motion-intensity time series, and our frame-pair motion–spatial encoding enables efficient exploitation of motion cues with lightweight 2D classifiers under edge constraints.
Compared with clip-based spatiotemporal architectures [17] or flow+3D/multi-stage pipelines [14], our design avoids frame buffering and reduces computation/memory, trading long-horizon temporal modeling for lower latency and streaming inference. Unlike splash-based assessments that rely on above-water surface events [15], our method targets underwater school-motion dynamics and remains applicable when splashes are weak or not observable. Compared with tracking-based approaches [47], it avoids explicit individual tracking and is therefore simpler in crowded scenes but provides school-level (rather than individual-level) indicators. Video-only motion encoding also complements multi-modal edge-oriented pipelines (e.g., video + water-quality distillation) [43,44]. Because study settings and protocols vary widely, direct numerical comparisons across papers are not meaningful; thus, we focus on controlled in-paper benchmarks (Section 4) and qualitative trade-off analysis when positioning our approach [42,50].
In industrial recirculating systems, automatic feeders typically provide controllable parameters such as feed rate, pulse duration, and inter-pulse interval; therefore, feeding-state recognition is most useful when used as a real-time feedback signal to close the loop. Our edge-oriented pipeline outputs a feeding/non-feeding probability for each frame pair, which can be smoothed over a short window and combined with hysteresis thresholds to robustly detect feeding onset and cessation. These time points naturally define feeding duration, while ration size can be implemented through discrete feed pulses and adjusted online (e.g., continue/reduce/stop) as the predicted feeding probability decays. This control logic is consistent with on-demand strategies that translate recognized feeding states/behaviors into feeding-amount decisions [13] and with dynamic adjustment of feeding intervals and feeding endpoints [16]. Moreover, event-level summaries (e.g., time-to-cessation, integrated feeding probability, or response profiles derived from the motion-intensity signal) can inform subsequent feeding frequency/interval scheduling across meals/days, aligning with existing camera-based feeding decision systems [4] and recent intelligent feeding decision-making frameworks [51]. Although closed-loop farm-scale trials are beyond the scope of this study, these considerations highlight a practical pathway for integrating video-only feeding-state monitoring into existing feeder controllers. In practice, the proposed vision module can be integrated as a non-intrusive feedback layer (from camera to edge device to feeder controller), where the edge device streams feeding/non-feeding probabilities and/or simple control signals to existing feeder/PLC units via standard interfaces (e.g., Ethernet/serial or digital I/O), without changing the feeder hardware. A minimal closed-loop policy can map the smoothed feeding probability to actions such as continue/reduce/stop: for example, maintain pulsed feeding while the probability remains above an upper threshold for several consecutive samples, reduce pulse size or increase inter-pulse intervals in an intermediate zone, and stop when it falls below a lower threshold for several samples. For operational safety and robustness, the controller can enforce hard limits (maximum feeding duration/ration per event) and fall back to a preset schedule or manual confirmation when visual confidence is low (e.g., severe occlusion/lighting disturbance), while logging predictions and actions for post hoc review and parameter tuning.
Our approach is well suited to controlled recirculating/industrial aquaculture systems where stable imaging conditions can be maintained [6,28]. Nevertheless, broader deployment (e.g., outdoor ponds or offshore cages) may face additional challenges due to variable turbidity/illumination, reflections/bubbles, and viewpoint constraints [3,17,19,44]. These limitations and future directions are summarized in Section 5.4. Building on the above motion–spatial quantification and the Perceptual labeling protocol, we next discuss the design of our lightweight classifier under edge constraints.
5.2. Efficiency of EfficientFeedingNet
EfficientFeedingNet achieves the highest mean test accuracy among evaluated backbones on the Perceptual Dataset while remaining lightweight for edge deployment. We attribute its performance to three design choices. First, we use an EfficientNet-style MBConv backbone that provides a strong accuracy–efficiency prior [33]. Second, the proposed HybridAttention module replaces SE with low-cost channel interaction and lightweight spatial gating to better emphasize feeding-relevant regions [34,35,36]. Third, Mish improves sensitivity to weak, boundary-blurred motion cues in motion–spatial maps [37]. Together, these modifications improve accuracy with minimal additional computation.
In our ablation studies, the gains stem not from increased computation but from lightweight inductive biases better aligned with the data. HybridAttention replaces SE’s two-layer MLP with ECA’s convolutional channel interaction [35], avoiding the information bottleneck caused by channel reduction, and adds a 3 × 3 spatial gate on a two-channel (avg/max) map to introduce location-aware selectivity [36]. This design reduces parameters while more effectively amplifying local, low-amplitude, boundary-ambiguous feeding cues with virtually unchanged FLOPs. Compared with no attention, HybridAttention not only specifies what to emphasize (channel reweighting) but also where to emphasize (spatial gating), yielding greater robustness on boundary cases; the small latency increase is a reasonable trade-off for higher accuracy and lower variance. Moreover, Mish, being smoother near zero and preserving small negative responses, improves gradient flow and weak-signal separability and complements the mean/max statistics of HybridAttention, enabling higher and more stable accuracy under nearly identical computational cost and a lightweight model size.
At the same time, based on the model comparison results, we also observed that, for specific tasks, the recognition accuracy of the model does not necessarily increase with increases in model size and computational complexity. Among the models evaluated, ResNet-101 [38] is the deepest backbone. Its multi-stage residual (skip) connections mitigate vanishing-gradient issues and stabilize optimization across 101 layers, enabling the extraction of rich hierarchical representations that capture fine-grained cues relevant to the Feeding and Non-feeding decision. Consistent with these architectural advantages, ResNet-101 attains nearly 95% accuracy in our datasets. However, these gains come with non-trivial deployment costs: ResNet-101 typically contains on the order of 86.75 M parameters and 32.87 GFLOPs, resulting in higher inference latency and memory requirements than lighter backbones.
PureViT has the highest computational complexity and parameter count among the evaluated models, but it does not achieve the best test performance in our setting. This may reflect the higher data and pretraining requirements of transformer-based models, which can be less data-efficient than CNN backbones on limited datasets [39,52]. We also observe that PureViT is more prone to confusing subtle non-feeding behaviors with feeding, suggesting that task-specific inductive biases or targeted attention may be needed. Future work will evaluate cross-farm generalization (e.g., changes in lighting, turbidity, and camera viewpoints) and explore lightweight architectural refinements better suited to fish-behavior recognition.
The MobileViT series [40] combines convolutional layers with lightweight vision transformers, enabling it to integrate local spatial features and global contextual information. This is advantageous for our task because feeding-related motion and aggregation cues can be spatially distributed across the motion–spatial map. By capturing longer-range spatial dependencies, MobileViT can better represent collective movement patterns of the shoal. However, transformer-style global attention can increase practical inference latency compared with purely convolutional backbones, despite the relatively small parameter count.
Recent work also emphasizes practical edge deployment in aquaculture feeding monitoring. For example, Zhang et al. [43] proposed an improved MobileViTv3-based architecture for feeding-behavior recognition from video streaming with a lightweight footprint, and Zhang et al. [44] developed a multi-modal knowledge distillation framework that fuses video and water-quality signals to obtain a compact student model for embedded deployment under turbid and crowded conditions. These directions are complementary to our design: our approach remains video-only and focuses on explicit short-term motion encoding and systematic edge benchmarking, whereas multi-modal/distillation pipelines may provide additional robustness when auxiliary sensors are available.
Efficient Vision Mamba [41] shows the lowest accuracy on the Intuitive (RGB) Dataset but exhibits the largest improvement when evaluated on the Perceptual Dataset (18.46 percentage points). This suggests that its mixer and multi-stage fusion mechanisms are particularly effective at exploiting dense motion patterns in optical-flow-based inputs, whereas appearance-only cues in raw frames are less informative for this architecture. Nevertheless, a small number of failure cases remain, especially postfeeding lingering and non-feeding high-activity interactions that resemble feeding; these patterns are further analyzed in Section 5.3.
5.3. Analysis of Misclassification
Confusion matrices show that all evaluated models make more errors on the Intuitive Dataset than on the Perceptual Dataset. The remaining errors are dominated by false positives (Non-feeding to Feeding), indicating that distinguishing non-feeding high-activity behaviors and postfeeding transition periods remains the main challenge under the current binary formulation.
In both datasets, misclassifications predominantly correspond to false positives (Non-feeding to Feeding). Grad-CAM visualizations suggest that the model often attends to high-activity regions of the shoal; however, a qualitative review of misclassified test samples and their corresponding video segments reveals several recurring challenging patterns. (E1) Non-feeding high-activity social interactions (e.g., chasing/fighting/play) can produce strong, localized motion fields and transient aggregation that resemble feeding responses; similar confusions have been reported in aquaculture behavior-recognition studies [53]. Sudden jumping events, which have also been treated as a distinct non-feeding behavior in related work [8], can further increase false positives under a binary scheme. (E2) Postfeeding lingering near the feeding area can preserve aggregation patterns even after the feeding response subsides, yielding borderline cases around the Feeding/Non-feeding transition [10,21]. (E3) Weak or dispersed feeding responses (e.g., low appetite or low density) may exhibit only subtle motion/aggregation changes and can be misclassified as Non-feeding. (E4) Visual confounds (e.g., bubbles, reflections, or illumination fluctuations) may introduce optical-flow artifacts that obscure behavior-related motion cues. These error modes suggest several directions to improve robustness. For example, we can extend binary labels toward a multi-class setting (e.g., adding a transition/confound category) [6,12,42] and incorporate lightweight temporal aggregation (e.g., sliding-window smoothing/majority voting over consecutive frame pairs) to suppress transient bursts. We can also leverage complementary cues such as multi-modal signals [44] or tracking/graph-based interaction descriptors when feasible [47].
We also clarify the scope of the “quantification” in this work. The V-Value and the motion–spatial map provide a school-level description of motion intensity and spatial aggregation patterns for feeding-state recognition. They are not intended to quantify fine-grained ethological measurements such as social communication, inter-individual distance, or group-cohesion indices, which typically require reliable individual identification and multi-object tracking. Incorporating tracking-derived cohesion/interaction metrics (e.g., nearest-neighbor distance, dispersion/polarization, or graph-based interaction descriptors) is a promising direction for future work and may further help disambiguate non-feeding high-activity social interactions from true feeding responses [9,47].
5.4. Limitations and Future Work
Despite the encouraging results, several limitations should be noted. First, videos were collected in a controlled recirculating aquaculture setting with a fixed viewpoint and illumination. Generalization across farms, species, and challenging conditions (e.g., turbidity/lighting changes, reflections/bubbles, varying stocking densities, and camera angles) requires further validation [3,17,19,28,44]. Second, Δ and the optical-flow hyperparameters were fixed empirically; systematic sensitivity analysis and tuning will be explored in future work. Third, our current formulation targets school-level binary recognition and single frame pairs (Δ ≈ 1 s). It therefore does not explicitly model multi-second dynamics or individual-level behavioral metrics (e.g., cohesion/interaction), which may benefit from lightweight temporal aggregation and tracking-based measurements [9,47]. Fourth, because Perceptual labels are derived from the event-level V-Value time series (after smoothing and k-means clustering [32]), we will report a V-Value-only baseline and perform cross-condition validation to quantify potential feature–label coupling. Finally, downstream management outcomes (e.g., feeding efficiency, uneaten feed, water quality, and welfare indicators) were not directly measured; future farm-scale closed-loop trials are needed to evaluate practical impacts [1,22,27,28].
6. Conclusions
In this work, we introduced a frame-pair motion encoding framework and constructed a Perceptual Dataset that together shift fish feeding-state analysis from subjective, spatial-only classification toward objective, data-driven quantification. Compared with the Intuitive Dataset, all evaluated models achieved higher test accuracy on the Perceptual Dataset, with an improvement of 13.13–18.46 percentage points. These results indicate that perception-based labeling and motion–spatial representations can reduce subjectivity and better capture feeding-related dynamics in our experimental setting.
Moreover, we developed EfficientFeedingNet, a lightweight neural network tailored to motion–spatial map inputs. The proposed model achieves a favorable accuracy–efficiency trade-off and shows potential for real-time feeding-state recognition on edge devices in fish-farm scenarios.
From a practical aquaculture perspective, the proposed framework can be deployed to provide real-time, camera-based feedback for feeding management in captive systems, supporting decisions such as detecting feeding onset/cessation and adjusting or stopping feed delivery when the feeding response subsides. We note that downstream outcomes (e.g., feeding efficiency, uneaten feed, water-quality changes, and welfare indicators) were not directly measured in this study and should be validated in future farm-scale closed-loop trials.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1de Verdal H. Komen H. Quillet E. Chatain B. Allal F. Benzie J.A.H. Vandeputte M. Improving Feed Efficiency in Fish Using Selective Breeding: A Review Rev. Aquac.20181083385110.1111/raq.12202 · doi ↗
- 2Yang X. Zhang S. Liu J. Gao Q. Dong S. Zhou C. Deep learning for smart fish farming: Applications, opportunities and challenges Rev. Aquacult.202113669010.1111/raq.12464 · doi ↗
- 3Dong S. Dong Y. Huang L. Zhou Y. Cao L. Tian X. Han L. Li D. Advancements and hurdles of deeper-offshore aquaculture in China Rev. Aquac.20231664465510.1111/raq.12858 · doi ↗
- 4Zhou C. Lin K. Xu D. Chen L. Guo Q. Sun C. Yang X. Near infrared computer vision and neuro-fuzzy model-based feeding decision system for fish in aquaculture Comput. Electron. Agric.201814611412410.1016/j.compag.2018.02.006 · doi ↗
- 5Wei D. Zhang F. Ye Z. Zhu S. Ji D. Zhao J. Zhou F. Ding X. Effects of intelligent feeding method on the growth, immunity and stress of juvenile Micropterus salmoides Artif. Intell. Agric.2021511812410.1016/j.aiia.2021.04.001 · doi ↗
- 6Wei D. Bao E. Wen Y. Zhu S. Ye Z. Zhao J. Behavioral Spatiotemporal Characteristics-Based Appetite Assessment for Fish School in Recirculating Aquaculture Systems Aquaculture 202154573721510.1016/j.aquaculture.2021.737215 · doi ↗
- 7Xiao Y. Huang L. Zhang S. Bi C. You X. He S. Guan J. Feeding state quantification and recognition for intelligent fish farming application: A review Appl. Anim. Behav. Sci.202528510658810.1016/j.applanim.2025.106588 · doi ↗
- 8Yang Y. Yu H. Zhang X. Zhang P. Tu W. Gu L. Fish behavior recognition based on an audio-visual multimodal interactive fusion network Aquac. Eng.202410710247110.1016/j.aquaeng.2024.102471 · doi ↗
