Research on Intelligent Wood Species Identification Method Based on Multimodal Texture-Dominated Features and Deep Learning Fusion

Yuxiang Huang; Tianqi Zhu; Zhihong Liang; Hongxu Li; Mingming Qin; Ruicheng Niu; Yuanyuan Ma; Qi Feng; Mingbo Chen

PMC · DOI:10.3390/plants15010108·December 30, 2025

Research on Intelligent Wood Species Identification Method Based on Multimodal Texture-Dominated Features and Deep Learning Fusion

Yuxiang Huang, Tianqi Zhu, Zhihong Liang, Hongxu Li, Mingming Qin, Ruicheng Niu, Yuanyuan Ma, Qi Feng, Mingbo Chen

PDF

Open Access

TL;DR

This paper introduces a deep learning method that uses texture and spectral features to accurately identify wood species, improving on traditional manual methods.

Contribution

A novel fusion method combining multimodal texture-dominated features and deep learning for wood species identification is proposed.

Findings

01

The proposed joint model achieved an overall classification accuracy of 90.27%.

02

The model outperformed single-modal approaches by approximately 8% on average.

03

The method uses hyperspectral imaging and deep learning fusion for robust wood identification.

Abstract

Aimed at the problems of traditional wood species identification relying on manual experience, slow identification speed, and insufficient robustness, this study takes hyperspectral images of cross-sections of 10 typical wood species commonly found in Puer, Yunnan, China, as the research object. It comprehensively applies various spectral and texture feature extraction technologies and proposes an intelligent wood species identification method based on the fusion of multimodal texture-dominated features and deep learning. Firstly, an SOC710-VP hyperspectral imager is used to collect hyperspectral data under standard laboratory lighting conditions, and a hyperspectral database of wood cross-sections is constructed through reflectance calibration. Secondly, in the spectral space construction stage, a comprehensive similarity matrix is built based on four types of spectral similarity…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species11

Homo sapiens(human · species)Quercus aliena(species)Quercus acutissima(sawtooth oak · species)Betula pendula(European white birch · species)Eucalyptus rudis(flooded gum · species)Cinnamomum parthenoxylon(species)Pterocarpus santalinus(species)Senna siamea(species)Betula platyphylla(Asian white birch · species)Phoebe rufescens(species)Fraxinus malacophylla(species)

Chemicals4

oxygen CLAHE hydrogen CO2

Diseases1

injury to

Figures24

Click any figure to enlarge with its caption.

Funding3

—Key R&D Program Project of Yunnan Province, China
—Yunnan Fundamental Research Projects
—Research on Intelligent Perception and Mending Methods for Rotary-Cut Veneer in Plywood

Keywords

wood cross-sectionhyperspectral imageST-formermultimodal fusioncomplementary collaborative learning

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWood and Agarwood Research · Advanced Neural Network Applications · Remote Sensing and LiDAR Applications

Full text

1. Introduction

Wood has always been one of the essential renewable materials in people’s lives [1]. Due to its natural characteristics such as beautiful texture and color, good elasticity, light weight, and easy processing, wood plays a pivotal role in fields such as structural materials, decoration, and energy fuels and is one of the main substrates in people’s production and life [2,3,4]. However, driven by interests, counterfeiting and shoddy phenomena often occur in the wood trading process. This behavior of passing off inferior products as high-quality ones and passing off fake products as genuine ones has brought very serious negative impacts on the wood trading market [5,6,7]. At the same time, wood illegally logged accounts for as high as 10% of the total wood cut worldwide, causing great damage to the environment [8]. To protect the legitimate rights and interests of consumers, the use of convenient and accurate intelligent methods for wood species identification has become one of the research directions of many scholars [9].

Wood species identification methods are mainly divided into macro and micro categories. Macro identification judges the species through the color, texture, and smell of wood. Although this method is simple, it is prone to misjudgment because the features may not be significant or may be easy to counterfeit. Micro identification identifies tree species by means of the microscopic structure of wood (such as pore characteristics, wood ray characteristics, etc.). According to the different magnification, it can be further divided into high-magnification identification and low-magnification identification [10,11,12,13]. In recent years, innovative technologies such as near-infrared spectroscopy (NIRS), isotope and mineral element methods, and DNA molecular markers have provided important approaches for wood species identification [14].

As an analytical technology based on molecular vibration and rotational energy level transitions, NIRS achieves qualitative and quantitative analysis of substance components by detecting the absorption spectrum of molecules in the near-infrared region [15,16]. Richardson et al. [17] collected near-infrared spectral data of wood powder from different regions and converted the data with a mathematical relationship calibration model, which realized the spectral conversion between intact wood and wood powder. Zhuang et al. [18] conducted a feasibility study on the origin traceability of Pterocarpus santalinus based on peak and valley feature extraction. The main component content of the same wood from different origins is different. The vibrational absorption of hydrogen-containing groups in the near-infrared spectrum can reflect this difference information. The experimental results show the high sensitivity and accuracy of this method. Raobelina et al. [19] extracted key spectral feature variables of wood samples and used principal component analysis to reduce the dimensionality of the original high-dimensional wavenumber variables, which significantly reduced the number of wavenumber variables involved in modeling. Experimental results show that compared with traditional discriminant models, this method greatly reduces the model complexity while maintaining high classification accuracy, providing a reliable analytical method for wood species identification. Peng et al. [20] used three methods to preprocess the spectrum to eliminate interference information and then used a variety of machine learning algorithms to achieve good stability of the neural network model and high accuracy of wood species identification.

As a natural organic material, wood experiences isotope fractionation during its growth process under the combined effects of climatic conditions, ecological environment, and metabolic activities in organisms [21]. This fractionation phenomenon is manifested in significant differences in the abundance of isotopes from different pathways in wood [22]. Chen et al. [23] proposed that C3 and C4 plants have different fractionations when fixing CO_2_ through photosynthesis and are affected by environmental factors (such as drought and light intensity), thereby distinguishing wood from arid and humid areas (stomatal closure of plants under drought conditions leads to enrichment). Landry et al. [24] showed that oxygen isotopes (δ^18^O) reflect the isotopic composition of water sources (precipitation, groundwater), which is affected by temperature, altitude, and latitude, distinguishing wood from different river basins or altitudes. He et al. [25] revealed that plant water comes from precipitation, which is related to latitude and the continental effect. Tonouewaj et al. [26] confirmed, through the study of nearly 1000 wood samples collected from different regions such as Cameroon, Congo, Gabon, and Indonesia, that the elemental fingerprint method is highly feasible and accurate in distinguishing the origin of wood at national and even more refined regional scales. Kanbayashi et al. [27] adopted inductively coupled plasma mass spectrometry (ICP-MS) technology to carry out high-sensitivity detection and analysis of trace elements. Stukonyte et al. [28] proposed a rapid non-destructive testing method for wood based on X-ray fluorescence spectroscopy (XRF). Hao et al. [29] proposed an analytical method based on laser ablation inductively coupled plasma mass spectrometry (LA-ICP-MS), which can realize high-precision and high-resolution detection of the distribution characteristics of mineral elements in tree growth rings.

The DNA of different wood species shows significant similarity on the whole. It is these seemingly small differences that provide an effective way to distinguish woods that are almost indistinguishable in structure and appearance. Yan et al. [30] proposed a more refined wood classification method, aiming to break through the limitation of the traditional classification system that only stays at the “genus” level. Tran et al. [31] focused on selecting three DNA barcode sequences with important identification value, namely, rpoC1, trnH-psbA, and ITS, as molecular markers, established a complete molecular biological detection system, and conducted scientific and accurate traceability identification of the origin information of wood samples. Mo et al. [32] designed and constructed a wood species traceability system based on high-resolution barcode technology, highlighting the application value of DNA barcode technology in the field of wood identification. Ajdary et al. [33] constructed a wood DNA fingerprint database by analyzing the unique DNA sequence polymorphism, single nucleotide polymorphisms (SNPs), and microsatellite markers of wood samples, combined with bioinformatics analysis and machine learning algorithms, so as to realize the accurate identification of different woods.

The advantages and disadvantages of the commonly used methods for identifying wood species are shown in Table 1.

In view of the above problems, this study proposes an intelligent wood species identification method based on the fusion of multimodal texture-dominated features and deep learning. The overall flow of the research method in this paper is shown in Figure 1. Firstly, the SOC710-VP hyperspectral imaging instrument was used to collect hyperspectral data under standard laboratory lighting conditions, and a wood cross-sectional hyperspectral database was constructed through reflectance calibration. In the spectral space construction stage, two Max–Min strategies are used to screen representative bands, and multi-scale wavelet fusion is performed to generate high-resolution fused images. In the texture space construction stage, three types of texture feature matrices are generated based on the PCA first principal component map. In the single model training phase, spectral branch SpectralFormer++and texture branch TextureFormer are constructed, and a multi-mode interest point guidance mechanism is introduced to realize fine modeling and feature complementation in texture domain. Finally, in the complementary collaborative learning stage, the ST former model is constructed, and the trained SpectralFormer++and TextureFormer weights are imported. Only the fusion weights are optimized to achieve category-adaptive spectral–texture feature fusion.

2. Data Collection and Preprocessing

2.1. Wood Samples

This study selects hyperspectral images of cross-sections of 10 different tree species widely distributed in Puer, Yunnan, China, as the research objects. These samples are collected from different tree bodies to ensure the diversity and representativeness of the research data.

2.2. Collection Method

The specific steps are as follows. First, a circular saw is used to cut along the cross-section from the breast height of the tree (about 1.3 m high) to form a cylindrical wood sample with a thickness of about 30 cm. Subsequently, an SOC710-VP hyperspectral imager and a microscope are used to collect hyperspectral images of the selected samples, and the magnification of the microscope is set to 40–50×. Figure 2 shows the completed image collection environment.

2.3. Dataset Preprocessing

Since the samples have circular cross-sections, the acquired images display rectangular frames, and the four-corner areas usually do not contain effective wood cross-sectional texture information, which is likely to introduce background noise. To ensure the reliability and practicality of the experimental data, ENVI Classic 5.6 is used to crop the collected wood cross-section images to the inscribed rectangular region of the wood cross-section. Figure 3 shows a comparison between the original image (42 bands) and the cropped image (42 bands). Figure 4 presents the collected hyperspectral images of the 10 wood species samples and their corresponding RGB images.

3. Methods and Analysis

In digital image processing, the structure of image data exhibits a significant hierarchical organization depending on the image type [34]. During hyperspectral image processing, each pixel at a given spatial location contains complete spectral information, manifested as a spectral reflectance curve composed of L discrete sampling points. This curve accurately characterizes the spectral properties of the spatial location across different wavelengths [35,36,37].

3.1. Multimodal Texture-Dominated Spectral Space Construction

3.1.1. A Hyperspectral Band Selection Method Based on Four Types of Indicators

Before processing hyperspectral image data, dimensionality reduction is usually required. Currently, dimensionality reduction methods for hyperspectral images are mainly divided into two categories: feature extraction and feature selection. In terms of feature extraction, the main algorithms include principal component analysis (PCA), Fisher Linear Discriminant Analysis (FLD), and Minimum Noise Fraction (MNF) [38,39,40].

To comprehensively consider the image features of wood hyperspectral images, this study proposes a hyperspectral band selection method based on four categories of indicators. Compared with traditional methods, this method shows significant advantages in subsequent feature extraction and classification. To further investigate the characteristics of the 3028 cropped images, this study conducted a band similarity analysis on a hyperspectral cube consisting of 128 bands for each sample. Four analytical techniques were employed in the analysis, namely, Fast Frequency Domain Differential Mapping (FDDM), Difference Hash Similarity (DHashM), Mutual Information (MIM), and Structural Similarity Index (SSIM). The integrated use of these techniques aims to provide a detailed description and evaluation of the differences between bands from multiple dimensions, ensuring the accuracy and completeness of the analysis results.

To address the inconsistency in value scales among different indicators, the similarity results of each indicator are first normalized, and a comprehensive similarity matrix $[eqn]$ is then constructed through weighted fusion to characterize the overall similarity between any two spectral bands.

After obtaining the comprehensive similarity matrix $[eqn]$ , a multi-strategy band selection mechanism is introduced. This mechanism includes two forms: Strategy A (regional quota + Max–Min) and Strategy B (coverage reward Max–Min), to ensure that the selected bands cover the visible, red, and near-infrared regions, while achieving a balance between information redundancy and diversity.

(1)Strategy A: Partition quota and maximum–min distance method (Max–Min)

First, based on the wavelength range calibrated by the spectrometer (372.53–1038.57 nm), the entire spectrum is divided into three segments: the visible light region (VIS), the red light region (RED), and the near-infrared region (NIR). Then, based on the number of bands and their distribution characteristics in each segment, a quota ratio of 3:3:4 is set, and within each segment, several bands are selected using the maximum–min distance method (Max–Min).

Let the comprehensive difference matrix $[eqn]$ , and the selected band set be $[eqn]$ . Then, in each iteration, the band that satisfies the condition of having the largest minimum distance is selected:

[eqn]

This continues until the set number of bands $[eqn]$ is reached. This strategy effectively avoids redundancy caused by excessively dense bands and ensures the representativeness of each spectral segment.

(2)Strategy B: Coverage-aware Max–Min method

To further improve the coverage of the selected bands across the three major spectral ranges, a coverage reward factor $[eqn]$ is introduced, and a scoring function is defined:

[eqn]

where $[eqn]$ denotes the spectral region (VIS, RED, or NIR) to which band $[eqn]$ belongs, and $[eqn]$ denotes the set of spectral regions that have already been covered by the selected band set.

3.1.2. Band Fusion Method Based on Wavelet Transform

After obtaining the most representative set of bands from the spectral data of each sample, this study designs a multi-band fusion algorithm based on multi-scale wavelet transform to further integrate the complementary information contained in different bands, performing joint spatial and frequency domain optimization of the spectral information.

Let the hyperspectral cube obtained after band filtering be $[eqn]$ , where $[eqn]$ represents the number of bands. Based on the band index sets obtained from strategies A and B, the corresponding band images $[eqn]$ are extracted, where $[eqn]$ . This study uses the Daubechies-4 wavelet basis function and sets the decomposition level $[eqn]$ to ensure a balance between resolution and stability.

3.1.3. Spectral Feature Extraction

After successfully acquiring wavelet fusion images, this study employs an interest point detection algorithm based on frequency domain demeshing suppression and multi-feature fusion scoring to extract the most representative texture regions from wood cross-sectional images.

(1)Frequency domain notch degrid

Because periodic interference fringes often appear during the imaging process of experimental equipment, this paper first performs notch filtering on the input fused image $[eqn]$ in the frequency domain.

The image is Fourier transformed into $[eqn]$ , the grid peak center $[eqn]$ is automatically estimated on the amplitude spectrum, and a Gaussian notch mask is constructed:

[eqn]

where $[eqn]$ is the suppression intensity (0.85) and $[eqn]$ is the notch radius (3.5). After mask multiplication and inverse transformation, a smooth image $[eqn]$ is obtained, which effectively eliminates regular noise.

(2)Local enhancement and denoising

In the spatial domain, nonlocal means denoising (NL-Means, intensity 4.0) is employed, followed by joint enhancement using CLAHE (contrast-limited adaptive histogram equalization) and an anti-sharpening mask.

CLAHE adaptively enhances the brightness distribution using an $[eqn]$ local mesh, with a clip limit of 2.0 to prevent excessive noise amplification. Subsequently, an anti-sharpening mask (radius 1.2, enhancement 1.3) and $[eqn]$ correction $[eqn]$ are applied to improve local texture contrast. For samples with uneven illumination, a homomorphic filtering mode can be selected, decomposing the logarithmic domain into low-frequency illuminance and high-frequency reflectance components and amplifying the reflectance term to correct brightness.

(3)Multi-feature scoring map construction

To comprehensively measure the texture saliency of local regions, a weighted scoring map $[eqn]$ is defined:

[eqn]

where $[eqn]$ represents the linear normalization operation.

(4)Point of interest extraction

After selecting a multi-feature scoring map $[eqn]$ , a candidate point set is generated using Harris corner detection combined with local peak search. Let there be a relative threshold $[eqn]$ and a minimum spacing $[eqn]$ . While ensuring uniform spatial distribution, the $[eqn]$ highest-scoring points are selected as the final interest points, sorted from highest to lowest significance value based on the scoring map.

3.2. Multimodal Texture-Dominated Texture Space Construction

To fully explore the differential information of wood cross-sectional texture structure, this paper establishes a two-dimensional feature space dominated by multimodal texture based on hyperspectral data. First, using the first principal component of the original hyperspectral image as the grayscale base image, three types of complementary texture features are extracted from it: Sobel edge features, second-order geometric moment features, and Gabor energy features.

(1)Principal component grayscale background image generation

The first principal component was extracted from the original hyperspectral cube $[eqn]$ using principal component analysis (PCA):

[eqn]

PCA employs a randomized singular value decomposition solver, retaining only the principal variance directions to obtain a standardized grayscale image $[eqn]$ , which serves as the underlying image for texture feature extraction.

(2)Sobel edge features

The gradients are calculated in the horizontal and vertical directions using the Sobel operator:

[eqn]

After normalizing the gradient magnitudes in both directions, the gradients are stacked to form a two-channel edge feature map:

[eqn]

(3)Second-order geometric moment characteristics

To characterize the local structural distribution of wood grain, a two-dimensional weight kernel of window size $[eqn]$ is constructed, and the second-order geometric moments are calculated:

[eqn]

where

[eqn]

After convolution, the matrices are normalized and stacked to form a three-channel moment feature map:

[eqn]

This mode can reflect the anisotropy and geometric distribution of texture.

(4)Gabor energy characteristics

To characterize the texture frequency characteristics at different directions and scales, a Gabor filter $[eqn]$ with six directions is constructed, and its kernel function is defined as follows:

[eqn]

where

[eqn]

The filtered energy for each direction is normalized and stacked to form a six-channel feature map:

[eqn]

This feature can capture directional, periodic textures such as wood rings and rays.

(5)Intramodal Interest Point Filtering and Saving

Calculate the comprehensive score matrix on each modal feature map (gradient magnitude for Sobel modes, L2 norm for moment modes, and maximum energy response for Gabor modes) and select the top $[eqn]$ salient points from each matrix.

3.3. Single Model Training

3.3.1. Spectralformer++

To further improve the physical interpretability and classification robustness of wood spectral feature modeling, this paper proposes SpectralFormer++ based on the SpectralFormer model [41]. This model optimizes the input modeling and feature normalization based on the characteristics of spectral data while keeping the Transformer backbone unchanged, thereby significantly improving the model performance and stability. As shown in Figure 5, the spectral derivative prior is extended through the first-order and second-order difference spectral input channels, enabling the model to perceive changes in spectral morphology at the input stage, thus enhancing its ability to represent fluctuations in curvature, peak shape, and slope.

To further standardize the numerical scale of input features and stabilize the training process, SpectralFormer++ introduces a lightweight linear embedding layer on top of the differential multi-channel input, uniformly projecting the three-channel input into a 64-dimensional embedding space:

[eqn]

Layer normalization is applied at the output to eliminate amplitude deviations between different bands. This structure enables the model to express the dynamic features of the spectrum at the input stage, simultaneously sensing the slope changes at the edges of the absorption band and the curvature features of the peak region, thereby more accurately distinguishing the differences in spectral response of different wood components.

3.3.2. TextureFormer

To further enhance the spatial representation and structural robustness of wood cross-sectional texture features, this paper proposes the TextureFormer model based on the hierarchical shift window architecture of the Swin Transformer [42,43,44]. This model optimizes input modeling, front-end fusion, and attention guidance while maintaining the core hierarchical structure and local attention mechanism. The improved model significantly enhances local texture modeling accuracy and overall classification robustness without increasing computational cost.

(1)Multi-channel input modeling based on Fusion–Sobel–Moments–Gabor

Wood cross-sections exhibit significant anisotropy and multi-scale texture features. A single modality is insufficient to simultaneously describe complex structures such as vessel outlines, ray distribution, and periodic textures. Therefore, TextureFormer constructs four-modal complementary features at the input and stacks them along the channel dimension to form a 12-channel tensor $[eqn]$ :

[eqn]

This multimodal construction allows the model to explicitly obtain complementary information on “global layering, edge orientation, geometry, and periodic texture” at the input layer, providing rich priors for subsequent fusion and attention allocation. After normalization and alignment, all modalities correspond completely in space, providing geometric consistency for subsequent interest point guidance.

(2)Texture Stem: Lightweight Blending and Channel Recalibration

To achieve high-quality fusion of 12-channel features without reducing spatial resolution, this paper designs the Texture Stem module shown in Figure 6.

[eqn]

Local fusion: The first layer of $[eqn]$ convolution mixes the 12 modal channels into a 64-dimensional feature space, completing the initial coupling of multi-source features:

[eqn]

Spatial refinement: Depthwise separable $[eqn]$ convolution enhances spatial consistency while maintaining low parameter count:

[eqn]

Channel recalibration via SE mechanism:

[eqn]

High-dimensional embedding: The final $[eqn]$ convolution increases the feature dimension to $[eqn]$ :

[eqn]

(3)Interest-guided group attention mechanism

Although the Texture Stem fuses 12-channel multimodal texture features into a unified high-dimensional representation, the translation invariance of convolutional operations tends to weaken or even eliminate the spatial saliency distributions of different texture modalities. As a result, the model may fail to preserve the key spatial locations of structural texture features, such as edge orientations, frequency responses, and geometric abrupt changes. Therefore, multimodal texture fusion performed solely in the pixel domain is insufficient to explicitly model spatial saliency information that is critical for structural discrimination.

To compensate for the lack of explicit spatial saliency priors in convolution-based feature fusion, we propose a Modality-Guided Grouped Multi-Head Attention (MG-GMH) mechanism to explicitly model the spatial importance of key texture regions. Rather than encoding texture content itself, MG-GMH introduces structured spatial saliency constraints to guide the attention mechanism toward regions that are more critical for wood structural discrimination.

(1)Interest-Based Saliency Modeling

For each sample, candidate interest points are extracted independently from four texture modalities, namely, fusion texture, Sobel, moments, and Gabor features. Each interest point is represented as $[eqn]$ , where $[eqn]$ denotes the spatial coordinate and $[eqn]$ denotes the original confidence score. To jointly consider confidence strength and spatial dispersion, a True Score is defined as

[eqn]

where $[eqn]$ is the Euclidean distance between point $[eqn]$ and a previously selected point $[eqn]$ , $[eqn]$ is a normalization constant, and $[eqn]$ controls the trade-off between confidence and spatial diversity. Based on the True Score, a fixed number of interest points are selected for each modality.

The selected interest points are then projected onto the Transformer patch grid with stride $[eqn]$ , and a sparse token-level saliency map is constructed as

[eqn]

where $[eqn]$ denotes the token-level saliency distribution of modality $[eqn]$ . This process aligns interest points from continuous spatial coordinates to discrete Transformer tokens, providing modality-specific saliency priors for subsequent attention guidance.

(2)Modality-Based Attention Head Grouping

To prevent saliency interference across different texture modalities, MG-GMH introduces a modality-level grouping constraint at the attention head level. Given an attention head index $[eqn]$ , its corresponding modality is determined as

[eqn]

This grouping strategy ensures that each group of attention heads is guided exclusively by the saliency prior of its assigned modality, thereby achieving modality-specific structural decoupling within the attention mechanism.

(3)Saliency Bias Injection

Within each attention window, the token-level saliency vector of modality $[eqn]$ is denoted as

[eqn]

An additive saliency bias matrix is constructed as

[eqn]

and injected into the attention logits of the $[eqn]$ -th attention head:

[eqn]

The bias is broadcast along the key dimension, allowing tokens with higher saliency values to receive larger attention weights after normalization. In this manner, MG-GMH explicitly guides the attention mechanism to focus on structurally salient texture regions, as illustrated in Figure 7.

3.4. Complementary and Collaborative Learning Training

3.4.1. Method Overview

To fully leverage the complementarity of SpectralFormer++ and TextureFormer, this section adopts the complementary collaborative learning paradigm ST-former, which trains only the inter-class fusion weights. After the two branches have fully converged on their respective data and tasks, their parameters are fixed, and only the one-dimensional fusion coefficient set $[eqn]$ for each class is trained to complete the subsequent complementary fusion.

Given the spectral branch logit $[eqn]$ and the texture branch logit $[eqn]$ of class $[eqn]$ , the fused logit is defined as

[eqn]

3.4.2. Training Objective and Loss Function

Only $[eqn]$ are updated by minimizing the cross-entropy loss:

[eqn]

To prevent $[eqn]$ from collapsing to extreme values (i.e., overly biased toward 0 or 1), a mild regularization term is introduced:

[eqn]

The final objective is as follows:

[eqn]

In practice, we found that $[eqn]$ stabilizes training while avoiding suppression of personalized bias. If the majority of $[eqn]$ values approach approximately 0.5 after training, $[eqn]$ can be appropriately decreased.

4. Experiment and Results

4.1. Experimental Analysis of Wood Spectral Space Construction

4.1.1. Experimental Analysis of Hyperspectral Band Selection

Taking the hyperspectral cube of the cross-section of Quercus aliena Blume wood as an example, similarity matrices are first calculated via four types of metrics (FDDM, DHashM, MIM, SSIM), and heatmaps are generated, as shown in Figure 8.

In this study, the equal weighting method is employed to allocate weights to various evaluation indicators, specifically assigning identical weight coefficients to the four core indicator categories. This weighting design aims to ensure that the characteristic attributes of each dimension receive equally important consideration during the analysis, $[eqn]$ thereby preventing features of certain dimensions from being weakened or exaggerated in the final evaluation results due to weight discrepancies. With this balanced weight allocation strategy, the contribution of features from different dimensions to the overall evaluation system can be maintained relatively balanced, making the final evaluation results more comprehensive and objective, as shown in Figure 9.

The multi-strategy band selection mechanism applies appropriate weighting to uncovered spectral regions based on differences, ensuring that the selected bands are more evenly distributed across the full spectrum and provide more comprehensive information. Finally, the top 10 band pairs with the lowest similarity are extracted from the comprehensive similarity matrix, as shown in Table 2.

Overall, the bands selected by Strategy A are relatively evenly distributed across the visible, red, and near-infrared regions, effectively reflecting the main reflection characteristics of wood in different bands. In contrast, Strategy B, due to the introduction of a coverage reward factor, yields selected bands that are more dispersed across the full spectrum, with stronger information complementarity.

4.1.2. Experimental Analysis of Wavelet Transform Band Fusion

During image reconstruction, the fused subband set $[eqn]$ is processed via inverse wavelet transform to obtain a single fused image $[eqn]$ . To eliminate brightness differences between various samples and improve visualization effects, the fusion results were subjected to linear normalization. The results are converted to uint8 format, with the pixel range normalized to $[eqn]$ . Finally, fused images are generated according to Strategies A and B, respectively, as shown in Figure 10.

4.1.3. Experimental Analysis of Spectral Feature Extraction

To avoid the influence of a single fusion scheme on subsequent detection, the wavelet fusion results of Strategies A and B are, respectively, enhanced and subjected to multi-feature weighted evaluation, yielding two score maps, $[eqn]$ and $[eqn]$ , as shown in Figure 11.

It should be noted that the Harris corner response is not only used as a corner strength feature for weighted calculation in the multi-feature score map but also employed for generating candidate points during the interest point extraction stage. The former reflects the saliency of local intensity changes, while the latter determines candidate locations; the two cooperate with each other to achieve high-precision interest point detection, as shown in Figure 12.

Finally, the 128-dimensional reflectance spectral features of the corresponding pixels are extracted from the original hyperspectral image at the interest point locations. The resulting spectral features possess strong physical interpretability and can reflect differences in the spectral responses of different wood components. Figure 13 shows the average spectral reflectance curves of ten tree species.

4.2. Experimental Analysis of Wood Texture Space Construction

The Sobel modality can effectively characterize structural features such as wood pores, fiber boundaries, and vessel distributions. As shown in Figure 14, the left panel displays the multi-feature score distribution map of the Sobel modality, while the right panel presents the corresponding interest point detection results.

The Moments modality characterizes the local texture energy distribution and shape orientation through second-order geometric moments, effectively capturing the scale differences and geometric configuration features of wood structure. As shown in Figure 15, the left image shows the score distribution map of the Moments modality, and the right image shows the results of its interest point detection.

The Gabor modality reflects the directional and frequency characteristics of textures and can effectively characterize the periodic radial distribution of wood textures. As shown in Figure 16, the left panel shows the multi-feature score distribution map of the Gabor modality, while the right panel shows the corresponding interest point detection results; the interest points are mainly concentrated in the high-frequency regions of texture energy.

4.3. Experimental Analysis of Single Model Training

(1)Training and Performance Evaluation of the SpectralFormer++

The AdamW optimizer (learning rate $[eqn]$ ), batch size = 256, and cross-entropy loss function are adopted for end-to-end training, with the number of training epochs set to 300. As shown in Figure 17, the training loss continuously decreases, while the training accuracy and validation accuracy steadily increase and tend to converge in the later stages, indicating that the model exhibits a stable training process and excellent generalization performance.

On the test set, SpectralFormer++ achieves an overall classification accuracy of 81.55%. As shown in Figure 18 (confusion matrix), categories such as Ailanthus, Eucalyptus Wild, and Red-Stemmed Nan exhibit high recognition accuracy on the diagonal, indicating that the model has excellent discriminative ability for tree species with distinct spectral features. However, for woods with similar spectral characteristics, such as Iron Knife Wood and Birch, White Birch, White Gun Barrel and Yellow Camphor Tree, a certain degree of confusion still exists, with their diagonal accuracies ranging from 0.54 to 0.74. This is mainly due to the highly similar reflectance characteristics of these tree species in the visible and near-infrared bands.

(2)Training and Performance Evaluation of the TextureFormer

To verify the effectiveness of the TextureFormer in multimodal texture feature recognition tasks, training and testing are conducted on the fused multimodal texture dataset. The AdamW optimizer (learning rate $[eqn]$ , batch size = 32, number of training epochs = 300) is adopted. During training, the model input is a 12-channel texture stacked image composed of four modalities (Fusion, Sobel, Moments, and Gabor), combined with interest point-guided coordinates as auxiliary input. As shown in Figure 19, the model’s loss continuously decreases during training, while the accuracy improves rapidly in the early stages of iteration and enters a stable phase after approximately 100 epochs.

On the test set, the overall classification accuracy of the TextureFormer model reaches 86.27%. Its confusion matrix is shown in Figure 20, demonstrating that the model performs well overall in recognition. From the figure, it can be seen that Quercus acutissima, White Gun Barrel, and Simao Pine achieved high classification accuracy, indicating that the model could effectively learn the key features of these tree species with distinctly different texture structures. However, there was still some confusion among certain samples. Particularly, categories such as Red Stemmed Nan, Eucalyptus wild, and White Birch—characterized by similar textures or grayscale distributions—showed relatively lower accuracy, indicating that the model still faced difficulty in distinguishing these subtle texture differences.

4.4. Complementary Collaborative Learning Parameter Training and Performance Evaluation

Figure 21 shows the $[eqn]$ evolution curves of ten major tree species. The results indicated that all categories of $[eqn]$ experienced a phase of rapid fluctuation at the beginning of training, followed by gradual convergence and stabilization in different ranges, with the overall distribution concentrated between 0.4 and 0.6. This suggested that the model achieves a relatively balanced weight allocation between the spectral and texture modalities. Among them, the $[eqn]$ values for Eucalyptus wild, Red Stemmed Nan, and other species are slightly lower, indicating that the model tends to rely more on SpectralFormer++ when distinguishing woods with prominent spectral features, whereas the higher $[eqn]$ values for Yellow Camphor Tree, Betula, and similar samples suggest that their recognition mainly depends on the texture branch TextureFormer. Overall, the convergence of $[eqn]$ confirms that the strategy of training only the fusion parameters effectively avoids overfitting and oscillation issues and provides good interpretability at the category level.

The test results of the joint model are shown in Figure 22. Compared with single-modality models, the diagonals of the row-normalized confusion matrix is more concentrated, and the overall classification accuracy increases to 92.27%. For categories such as Ailanthus and Quercus acutissima, the diagonal values exceed 0.9, indicating that the model can fully utilize the complementary information of spectral morphology and texture structure to achieve high-confidence discrimination. For categories like Iron Knife Wood and White Birch—characterized by similar textures and adjacent spectral features—partial confusion still exists, but their error rates are significantly reduced compared with single-modality models.

In addition, classes such as Birch, White Gun Barrel, and Simao Pine, which are easily confused in single-branch models, show significant improvement after joint training, with the proportion of off-diagonal elements reduced by more than half, indicating that the adaptive weight adjustment mechanism effectively enhances the efficiency of modality collaboration. Overall, the complementary collaborative model significantly improves the model’s generalization capability and classification robustness while considering both spectral and texture features, providing reliable experimental support for subsequent multimodal wood recognition. In summary, the ST-former model, through learnable inter-class fusion weights, effectively balances the importance of spectral and texture features, not only improving overall recognition accuracy but also maintaining interpretability and stability in the model structure, offering a reliable fusion approach for future multimodal wood recognition.

5. Discussion

The research method of this article is the perfect cross application of computer vision, spectroscopy, and artificial intelligence in wood science. This research method has unique advantages over wood species identification methods such as NIRS, stable isotopes and mineral elements, DNA, etc. Compared with the NIRS method [19,45,46], the method studied in this paper can learn the most discriminative microscopic features from the hyperspectral images of wood cross-sections, with higher recognition accuracy and more robust models. Compared with stable isotope and mineral element methods [24,47,48], the research method in this paper can establish a complex nonlinear mapping model between isotope/element data and tree species and origin, which can explore deeper and more complex patterns and achieve more accurate identification of wood species. Compared with the DNA method [32,49,50], the research method in this article is completely non-destructive and has a faster testing time. However, DNA sampling usually requires drilling or cutting a small amount of wood chips for testing.

The classification results shown in Figure 22 indicate that, even under the spectral–texture joint modeling framework, noticeable confusion still exists among certain tree species, suggesting that the joint model retains inherent limitations when distinguishing specific categories. Further analysis reveals that species with lower classification accuracy generally exhibit highly similar spectral reflectance curve shapes, while their cross-sectional textures show only subtle differences in vessel distribution, directional patterns, and structural scale. Under such conditions, neither spectral nor texture modalities can provide sufficiently discriminative information, and multimodal fusion alone is unable to fundamentally overcome the limited separability of the original features. As a result, the performance improvement of the joint model for highly similar categories remains constrained.

From a modeling perspective, the current joint framework primarily relies on one-dimensional spectral sequences and two-dimensional cross-sectional texture features for classification. While this design is effective for many tree species, it imposes an upper bound on representational capacity when involving species that are highly similar in structure and composition. In particular, higher-level structural differences within wood, such as cell-scale organization, micro-anatomical characteristics, or three-dimensional spatial distributions, are not explicitly modeled in the current approach. This limitation further restricts the discriminative capability of the joint model for certain species.

In addition, environmental factors associated with tree growth are not incorporated into the current recognition framework, although such factors may introduce discriminative information that is not captured by spectral and texture features alone. Previous studies have shown that growth environment conditions, including climate factors, temperature variation, precipitation, and soil nutrient availability, can influence wood formation processes and are partially reflected in wood structural and material characteristics [21,51]. For tree species with highly similar spectral and texture features, variations induced by growth environments may provide supplementary cues that help reduce inter-class ambiguity.

Based on the above analysis, future work aimed at improving classification accuracy across all categories—particularly for highly similar species—may consider extending the current spectral–texture joint framework in several directions. First, incorporating samples collected from diverse growth regions and environmental conditions could enhance the model’s ability to capture intra-class variability. Second, integrating higher-resolution texture information, micro-anatomical features, or three-dimensional structural representations may help compensate for the limitations of one-dimensional spectral and two-dimensional texture features. Finally, environmental parameters such as climatic and nutrient indicators could be introduced as sample-level conditional information or as auxiliary constraints at the decision stage, thereby complementing existing features from a growth-mechanism perspective and improving discrimination among highly similar tree species.

6. Conclusions

This paper addresses the issues associated with traditional wood identification methods, such as low efficiency, susceptibility to noise interference, and difficulty in distinguishing similar tree species, and proposes an intelligent wood species identification method based on the integration of multimodal texture-dominant features and deep learning. Based on hyperspectral imaging data, this method constructs deep feature representations from two perspectives (spectral domain and texture domain) and combines learnable modal fusion weights to achieve the unity of high precision and interpretability in wood recognition. In terms of spectral feature modeling, this paper designs the SpectralFormer++ model and introduces a spectral derivative prior and a front-end embedding normalization mechanism, effectively alleviating the instability caused by band redundancy and amplitude differences and improving the morphological expression ability of spectral features and the convergence stability of the model. For texture feature modeling, the TextureFormer model is proposed, which fully captures the fine-grained texture structures and directional patterns of wood cross-sections through a fusion-based Texture Stem and a grouping attention mechanism for multimodal input. Based on the output results of the two models, the ST-former complementary collaborative fusion framework is further constructed, and inter-class adaptive weight learning is realized, enabling the model to dynamically balance the contributions of spectral and texture features for different wood categories.

The experimental results show that the overall classification accuracy of the proposed joint model on 10 typical wood datasets from Yunnan reaches 90.27%, which is approximately 8% higher than that of single-modal models. Among these species, Eucalyptus rudis Endl, Phoebe rufescens H. W. Li, and other tree species with significant texture differences achieve the highest recognition accuracy, while Betula platyphylla Sukaczev, Senna siamea (Lam.) H. S. Irwin & Barneby, and other tree species with similar spectra exhibit a significantly lower confusion rate. This proves the effectiveness of the complementary synergistic mechanism in feature fusion and modal decoupling. The inter-class distribution of parameters further elucidates the model’s interpretability: the values for texture-dominated tree species are significantly higher than those for spectrum-dominated classes, indicating that the model can adaptively learn the feature dependencies of different tree species.

In general, the spectral–texture collaborative learning framework proposed in this paper achieves a balance between recognition accuracy, stability, and interpretability for wood species recognition tasks and provides a novel and promising idea for multimodal intelligent recognition with potential for application in the field of hyperspectral wood recognition. Future work can extend this framework to larger-scale datasets with more tree species and integrate 3D spectral structures or microstructural images to develop a more versatile and robust wood recognition system.

Bibliography51

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Chen Y. Meng Y. Zhang J. Xie Y. Guo H. He M. Shi X. Mei Y. Sheng X. Xie D. Leakage Proof, Flame-Retardant, and Electromagnetic Shield Wood Morphology Genetic Composite Phase Change Materials for Solar Thermal Energy Harvesting Nano-Micro Lett.20241619610.1007/s 40820-024-01414-4PMC 1109900238753068 · doi ↗ · pubmed ↗
2Mo L. Crowther T.W. Maynard D.S. Johan V.D.H. Ma H.Z. Lalasia B.M. Liang J.J. Sergio D.M. Nabuurs G.J. Reich P.B. The global distribution and drivers of wood density and their impact on forest carbon stocks Nat. Ecol. Evol.202482195221210.1038/s 41559-024-02564-939406932 PMC 11618071 · doi ↗ · pubmed ↗
3Khoo P.S. Rizal M.A.M. Ilyas R.A. Yajid M.A.M. Shukur A.H. Yahya M.Y. Wahit M.U. Unveiling Favorable Mechanical Properties of Lignocellulosic Wood—Reinforced Thermoplastic Composites as Future Green and Sustainable Materials Fibers Polym.2025261425144810.1007/s 12221-025-00874-8 · doi ↗
4Korjakins A. Sahmenko G. Lapkovskis V. A Short Review of Recent Innovations in Acoustic Materials and Panel Design: Emphasizing Wood Composites for Enhanced Performance and Sustainability Appl. Sci.202515464410.3390/app 15094644 · doi ↗
5De Ligne L. Fredriksson M. Thygesen L.G. Thybring E.E. Influence of degradation products from thermal wood modification on wood-water interactions J. Mater. Sci.2025603346336410.1007/s 10853-025-10675-2 · doi ↗
6Zhu Y. Li L. Wood of trees: Cellular structure, molecular formation, and genetic engineering J. Integr. Plant Biol.20246644346710.1111/jipb.1358938032010 · doi ↗ · pubmed ↗
7Fu Z. Lu Y. Wu G. Bai L. Daniel B.R. Lyu J. Liu S. Rojas O. Wood elasticity and compressible wood-based materials: Functional design and applications Prog. Mater. Sci.202514710135410.1016/j.pmatsci.2024.101354 · doi ↗
8Hussein A. Bulbul A. Wu Q. Lin H. Kameshwar H. Mohammad S. Effect of partial delignification and densification on chemical, morphological, and mechanical properties of wood: Structural property evolution Ind. Crops Prod.2024213118430