Streamlining Multimodal Data Fusion in Wireless Communication and Sensor Networks
Mohammud J. Bocus, Xiaoyang Wang, Robert. J. Piechocki

TL;DR
This paper introduces a VQVAE-based multimodal data fusion method that effectively compresses and reconstructs diverse data types, including in 5G scenarios, with minimal performance loss and low computational requirements.
Contribution
It develops a simple, effective VQVAE-based model for multimodal data fusion and extends it to 5G CSI feedback, enabling efficient data compression for wireless communication.
Findings
High-quality reconstruction of paired MNIST-SVHN and WiFi spectrogram data
Effective data compression in 5G CSI feedback with minimal performance loss
Learned discriminative feature space for various input data types
Abstract
This paper presents a novel approach for multimodal data fusion based on the Vector-Quantized Variational Autoencoder (VQVAE) architecture. The proposed method is simple yet effective in achieving excellent reconstruction performance on paired MNIST-SVHN data and WiFi spectrogram data. Additionally, the multimodal VQVAE model is extended to the 5G communication scenario, where an end-to-end Channel State Information (CSI) feedback system is implemented to compress data transmitted between the base-station (eNodeB) and User Equipment (UE), without significant loss of performance. The proposed model learns a discriminative compressed feature space for various types of input data (CSI, spectrograms, natural images, etc), making it a suitable solution for applications with limited computational resources.
| WiFi CSI (2 receivers) | Paired MNIST-SVHN | CSI feedback (2 receivers) | ||||
| Input to each encoder | Spectrogram data: (B1224224) | Encoder 1: MNIST: (B13232) | ||||
| Encoder 2: SVHN: (B33232) | Pre-processed CSI feedback data: (B2288) | |||||
| No. of channels = 2 because of real and imaginary components. | ||||||
| 288 corresponds to 28 subcarriers and 8 transmit antennas after pre-processing. | ||||||
| Encoder | Decoder | Encoder | Decoder | Encoder | Decoder | |
| Conv2d 64(4,4), stride=(2, 2), padding=(1,1) | ||||||
| ReLU | ||||||
| Conv2d 128(4,4), stride=(2, 2), padding=(1,1) | ||||||
| ReLU | ||||||
| Conv2d 128(3,3), stride=(1, 1), padding=(1,1) | Conv2d 128(3,3), stride=(1, 1), padding=(1,1) | Conv2d 64(4,4), stride=(2, 2), padding=(1,1) | ||||
| ReLU | ||||||
| Conv2d 128(4,4), stride=(2, 2), padding=(1,1) | ||||||
| ReLU | ||||||
| Conv2d 128(3,3), stride=(1, 1), padding=(1,1) | Conv2d 128(3,3), stride=(1, 1), padding=(1,1) | |||||
| Conv2d 64(4,3), stride=(2, 1), padding=(1,1) | ||||||
| ReLU | ||||||
| Conv2d 128(2,3), stride=(2, 1), padding=(1,1) | Conv2d 128(3,3), stride=(1, 1), padding=(1,1) | |||||
| Residual Stack | ||||||
| (no. of residual blocks=2) | ReLU | |||||
| Conv2d 32(3,3), stride=(1, 1), padding=(1,1) , bias=False | ||||||
| ReLU | ||||||
| Conv2d 128(1,1), stride=(1, 1) , bias=False | ||||||
| ReLU | ||||||
| Conv2d 32(3,3), stride=(1, 1), padding=(1,1) , bias=False | ||||||
| ReLU | ||||||
| Conv2d 128(1,1), stride=(1, 1) , bias=False | ||||||
| Encoder output dimension | B1285656 | B12888 | B12888 | |||
| Pre-VQ-Conv layer | Conv2d 128(1,1), stride=(1, 1) | Conv2d 128(1,1), stride=(1, 1) | Conv2d 128(1,1), stride=(1, 1) | |||
| ConvTranspose2d 64(4,4), stride=(2, 2), padding=(1,1) | ||||||
| ReLU | ||||||
| ConvTranspose2d 2(4,4), stride=(2, 2), padding=(1,1) | ConvTranspose2d 64(4,4), stride=(2, 2), padding=(1,1) | |||||
| ReLU | ||||||
| ConvTranspose2d 2(4,4), stride=(2, 2), padding=(1,1) | ConvTranspose2d 64(3,3), stride=(2, 1), padding=(1,1) | |||||
| ReLU | ||||||
| ConvTranspose2d 2(2,3), stride=(2, 1), padding=(1,1) | ||||||
| Compression rate, | = 2(12242248) / (56569) = 28.44 | |||||
| (consdering =512 embeddings in the codebook, 2 input spectrograms and 8 bits per channel) | = [(132328) + (332328)] / (889) = 56.89 | |||||
| (consdering =512 embeddings in the codebook, MNIST data with 1 channel, SVHN data with 3 channels, and 8 bits per channel) | =2(2288(48)) / (889) = 49.78 | |||||
| (considering pre-processed CSI feedback data, 2 receivers each with 2 channels (real and imag), =512 embeddings in the codebook, and one float number occupies 4 bytes=32 bits) |
| Data input type | F1-macro | ||
|---|---|---|---|
|
0.916205 | ||
|
0.936949 | ||
|
0.947357 | ||
|
0.925473 | ||
| Multimodal VQVAE latent vector | 0.905211 |
| Method | Compression rate, | Cosine similarity, | NMSE | ||
| Indoor | Outdoor | Indoor | Outdoor | ||
| CsiNet [38] | 4 | 0.99 | 0.91 | -17.36 | -8.75 |
| 16 | 0.93 | 0.79 | -8.65 | -4.51 | |
| 32 | 0.89 | 0.67 | -6.24 | -2.81 | |
| 64 | 0.87 | 0.59 | -5.84 | -1.93 | |
| ConvCsiNet [39] | 16 | 0.98 | 0.85 | -13.79 | -6.00 |
| 32 | 0.95 | 0.82 | -10.10 | -5.21 | |
| ShuffleCsiNet [39] | 16 | 0.97 | 0.82 | -12.14 | -5.00 |
| 32 | 0.94 | 0.74 | -9.41 | -3.50 | |
| Ours | 32 | 0.98 | 0.93 | -14.52 | -9.99 |
| 114 | 0.96 | 0.85 | -10.63 | -6.06 | |
| 128 | 0.95 | 0.84 | -10.41 | -5.56 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndoor and Outdoor Localization Technologies · Speech and Audio Processing · Distributed Sensor Networks and Detection Algorithms
Streamlining Multimodal Data Fusion in Wireless Communication and Sensor Networks
Mohammud J. Bocus 1, Xiaoyang Wang 2, Robert. J. Piechocki1
{[email protected], [email protected], [email protected]}
1Department of Electrical and Electronic Engineering, University of Bristol, UK
2Department of Computer Science, University of Exeter, UK
Abstract
This paper presents a novel approach for multimodal data fusion based on the Vector-Quantized Variational Autoencoder (VQVAE) architecture. The proposed method is simple yet effective in achieving excellent reconstruction performance on paired MNIST-SVHN data and WiFi spectrogram data. Additionally, the multimodal VQVAE model is extended to the 5G communication scenario, where an end-to-end Channel State Information (CSI) feedback system is implemented to compress data transmitted between the base-station (eNodeB) and User Equipment (UE), without significant loss of performance. The proposed model learns a discriminative compressed feature space for various types of input data (CSI, spectrograms, natural images, etc), making it a suitable solution for applications with limited computational resources.
Index Terms:
VQVAE, WiFi CSI, CSI feedback, deep learning, multimodal data fusion.
I Introduction
Multimodal fusion is an important aspect of modern artificial intelligence and machine learning systems. It is a process of combining data from multiple sensors to create a comprehensive understanding of the environment. In various applications, such as robotics, autonomous vehicles, and Internet of Things (IoT), multiple sensors are used to capture information from the environment, including vision, audio, lidar, radar, sonar, GPS and more. By combining this data, a more accurate and robust representation of the environment can be created. Multimodal sensor fusion is important because it helps to overcome the limitations of individual sensors and allows for more reliable and robust decision-making. However, compression of multimodal data is also needed for increasing efficiency, decreasing the cost of storage and transmission, and facilitating real-time processing of substantial datasets in a variety of applications.
For example, in 5G networks, Channel State Information (CSI) feedback plays a critical role in the communication system. To enhance communication performance, 5G networks make use of sophisticated multi-antenna techniques such as massive MIMO, which necessitate accurate CSI feedback. This feedback is utilized to modify the transmission parameters at the transmitter to account for the fluctuating wireless channel conditions and improve communication quality. Due to the large number of antennas employed in 5G networks, significant amounts of CSI data are generated. To maintain efficient operation, 5G networks need to apply advanced and smart compression techniques to minimize the size of the CSI feedback data, thereby reducing the latency and overhead of the feedback process.
In the scope of multimodal sensor fusion and compression, we propose a multimodal Vector-Quantized Variational Autoencoder (VQVAE) model that can handle multiple modalities within a single model. We first evaluate our straightforward and yet highly efficient model on paired MNIST-SVHN data as a feasibility check for fusion and reconstruction. We then extend our model for two different use cases. In the first case, we apply the multimodal VQVAE model to WiFi spectrogram data to obtain a compressed and discriminative feature space for passive sensing and Human Activity Recognition (HAR) applications. In the second case, the proposed model is evaluated in a 5G communication network perspective, more specifically, we use our model to compress CSI feedback data efficiently that are transmitted from a User Equipment (UE) to a gNodeB (base-station), while maintaining excellent reconstructed channel estimate quality.
This paper is organised as follows: Related works on multimodal data fusion are presented in Section II. The background, methodology and system design are described in Section III. Section IV provides detailed information on the experimental setup and corresponding results. Finally, conclusions are drawn in Section V.
II Related Work
In this section, we review previous research on the topic of multimodal generative modeling and sensor fusion. The goal of Multimodal Variational Autoencoders (MVAEs) is to learn a joint representation for different kinds of modalities in a self-supervised way, without the need for manual labeling of large amounts of data [1]. However, obtaining a unified representation from multiple modalities can be challenging as they are often of different data types, having different distributions, levels of sparsity, and dimensions [2]. To learn a shared representation across multiple modalities, the authors of [3] employ a joint inference network. To tackle the challenge of a missing modality, they train individual (single-modal) inference networks for each modality as well as a bi-modal inference network to learn the joint posterior through the use of the Product-of-Experts (PoE) method. MVAE [4], which is similarly based on PoE, only takes into account a partial combination of the observed modalities, resulting in a smaller number of parameters and increased computing efficiency. On the other hand, the Mixture-of-Experts (MoE) technique is used in [5] to learn the shared representation across several modalities. To get the best of both worlds, [6] attempts to integrate the benefits of both MoE and PoE in their model, which is referred to as MoPoE (Mixture-of-Products-of-Experts)-VAE. In [1], the authors proposed a technique for multimodal sensor fusion. The method consists of a two-stage process, whereby a multimodal generative model is trained on unlabelled data in the first stage. Then, in the second stage, this trained generative model serves as a reconstruction prior and the search manifold for various sensor fusion tasks such as multisensory classification, denoising, and recovery from subsampled (compressed) observations.
When it comes to multimodal sensor fusion for human activity detection employing Radio-Frequency (RF), inertial, and/or vision sensors, the bulk of publications have either investigated feature-level fusion or decision-level fusion [1]. For example, [7] uses a hybrid Deep Neural Network (DNN) model to perform multimodal fusion at the decision level by exploiting the advantages of both WiFi and vision-based sensors. [8] describes a method for activity recognition that makes use of four different sensor modalities, including WiFi fingerprints, inertial and motion capture measurements, and skeletal sequences. The measurements from each modality are transformed into images and the fusion of the multimodal data is formulated as a matrix concatenation. A multimodal Human Activity Recognition (HAR) system that uses WiFi and wearable sensor modalities to jointly infer human behaviours was proposed by the authors in [9]. They gather measurements of the user’s body motions through a wearable Inertial Measurement Unit (IMU) and WiFi CSI data. Their method consists of calculating the magnitude of the inertial data for each sensor of the IMU and the time-variant Mean Doppler Shift (MDS) from the processed CSI data. The magnitude and the MDS are then independently used to extract different temporal and frequency domain features. The authors adopt the feature-level fusion, whereby feature vectors from the same activity sample are concatenated sequentially. Finally, supervised machine learning methods are employed to classify human activities. The authors of [10] leverage the transformer architecture for multimodal sensor fusion. They use different signal processing techniques to extract multiple image-based features from Passive WiFi Radar (PWR) and CSI data such as spectrograms, scalograms and Markov Transition Field (MTF). As compared to the conventional transformer architecture which divides an image into small patches, the authors instead use a different technique whereby each patch represents a different image-based feature. They developed both supervised and self-supervised models and demonstrated their excellent performance on the HAR task compared to traditional Convolutional Neural Networks (CNNs). Other approaches used in HAR applications include contrastive learning methods [11, 12, 13], which necessitate either multiple views per sensor modality or robust data augmentation methods to generate pairs of negative and positive samples.
Recently, some models which leverage discrete representation through vector quantisation have been proposed. Such examples are VQVAE [14] and VQVAE-2 [15]. VQVAE is a type of generative model that combines the principles of autoencoders and vector quantization to generate high-quality, compact representations of data. VQVAE is used in various applications, such as image and audio synthesis, to generate high-quality, compressed representations of data that can be used for further analysis or manipulation. The VQVAE model is made up of three parts; an encoder that converts an image into latent variables, a shared codebook that is used to quantize these continuous latent vectors to a set of discrete latent variables (each vector is replaced with the nearest vector from the codebook), and a decoder that uses the indices of the vectors from the codebook to reconstruct back the image. VQ-VAE-2 extends the original VQVAE by implementing a two-level hierarchical encoder-decoder model (with multi-scale latent maps) which uses an autoregressive prior, namely, PixelCNN to sample diverse high resolution samples [15].
In this work, we extend the VQVAE model to a multimodal setting. More specifically, our proposed model can take multimodal data as input, then it compresses the data to a shared low-dimensional discrete latent representation space and reconstruct the data from the quantized output of the vector quantizer with low reconstruction error using corresponding decoders.
III Methodology and System Design
III-A VQVAE
The VQVAE model consists of an encoder and decoder network, a Vector Quantization (VQ) layer and a reconstruction loss function [16]. The encoder takes as input the data sample , and outputs the vector . The VQ layer maintains an embedding table, , which consists of vectors of dimension , to quantize the encoder outputs. The VQ layer outputs an index and the corresponding embedding , which is closest to the input vector in Euclidean distance. Given an input signal (e.g. an RGB image), the encoder first encodes it as a tensor, where and denote the height and width of the latent representation, respectively. Then, every dimensional vector is quantized using a nearest-neighbor lookup on the embedding table [17] as per the following:
[TABLE]
where refers to a spatial location. The decoder then uses the embedding to reconstruct the input data, . Since the quantization operation is non-differentiable, the gradient is approximated using the straight-through estimator. That is, the gradient from the first layer of the decoder is passed directly to the last layer of the encoder, bypassing the codebook. The latter is updated via exponential moving average of the encoder outputs. The loss function is represented as
[TABLE]
where denotes the stop gradient function. The first term is the reconstruction loss while second term is the commitment loss which is used to regularize the encoder to produce vectors close to the embeddings in order to minimize the quantization error [16]. The embedding table is updated independently from the encoder and decoder by minimizing [17]. It should be pointed out that in order to reconstruct the input, only the indices are required, thus achieving a high compression rate. Compression ratio can be defined as the ratio of the codeword dimension to the original data dimension [18]. For RGB images, the compression rate, , is given by .
III-B Proposed Multimodal VQVAE Model
Our proposed multimodal VQVAE model is shown in Fig. 1. Our model follows the same principles as the original VQVAE model. For input modalities, there will be encoders and decoders. Each modality data will go through their respective encoders. In the VQ stage, each encoder’s output will be reshaped, flattened and the distance is computed using the codebook. Then the mean distance is computed across all input modalities. After the VQ stage, the quantized output serves as input to each corresponding modality decoder to reconstruct their data. We propose a simple but yet very effective framework for multimodal data fusion, as we shall see in section IV where we carry out various experiments on paired MNIST-SVHN data and real WiFi spectrogram data, as well as simulated 5G communication CSI feedback data.
III-C WiFi CSI-based Sensing
WiFi-CSI based sensing systems have been implemented for many applications. For example, human motions within an indoor environment affect the propagation of wireless signals transmitted by the passive WiFi sensors [19]. These applications include HAR [20, 21, 22], fall detection [23, 24], sign language recognition [25], gesture recognition [26, 27], occupancy detection [28], crowd counting [29], respiration monitoring [30], among others. Specific IEEE 802.11 Network Interface Cards (NICs), such as the Intel 5300 [31] or Atheros [32], can be used to retrieve CSI data. These WiFi devices leverage Orthogonal Frequency Division Multiplexing (OFDM) at the physical layer. The CSI is represented as a 3D matrix of complex values holding information about the wireless signal characteristics, including propagation delay, amplitude attenuation, and phase shift of multiple propagation paths [10]. WiFi-based sensing is an active area of research. In future real-world large-scale applications, CSI data will be transmitted to a cloud server for computation and data record, thereby creating new challenges for WiFi sensing in terms of reducing communication cost and simultaneously performing model inference [33].
The OPERAnet dataset [34], which contains freely accessible data from WiFi-based systems, is used in this study. The dataset also contains Kinect and ultra-wideband data. The dataset was gathered with the goal of analysing HAR and localization methods using data from synchronised RF devices and vision-based sensors. The various sensors recorded measurements for six human activities carried out by six participants. These activities include sitting down on a chair, standing from chair, lying down on the floor, standing from floor, body rotating, and walking. The aforementioned activities were completed by the participants in two separate rooms at different locations. The CSI data were collected across 3 transmit and 3 receive antennas over 30 subcarriers, giving rise to 270 complex CSI values per packet and the sampling rate was set at 1.6 kHz. The data were also captured using two synchronised WiFi CSI receivers. As a result, a substantial volume of data must be processed. Therefore, the computational complexity of such data may be reduced by using dimensionality reduction techniques like Principal Component Analysis (PCA). Additionally, the resulting data may be subjected to the Short Time Fourier Transform (STFT) to produce spectrograms that resemble those produced by Doppler radars. The interested reader can learn more about the signal processing pipeline for WiFi CSI in our earlier studies [20, 21, 22]. The conversion of raw WiFi CSI data to spectrograms can be regarded as a pre-processing step to data compression. Using the multimodal VQVAE model, the data can be further compressed and used in downstream tasks like human activity classification. Future CSI-based sensing systems will require both a compressed and discriminative feature space for sensing and recognition applications [33].
III-D CSI Compression in Communication
Massive Multiple-Input Multiple-Output (MIMO) technology has been extensively embraced as a top-tier solution for 5G connectivity. The MIMO system can greatly lessen multi-user interference by utilising CSI at base stations [33]. To do this, the CSI is collected at the UE which is ultimately sent back via a feedback communication link to the base station [35]. The CSI feedback overhead consumes a sizable portion of the uplink bandwidth, especially when there are a lot of transmit antennas. Many research have been presented to decrease feedback overhead for CSI encoding and decoding in a MIMO system, for example, using LASSO L1-solver [36] or compressive sensing [37]. However, because the channel matrix is only roughly sparse, the simple prior cannot completely recover compressed CSI [33, 38]. In [38], an unsupervised deep learning algorithm (closely related to the autoencoder) is proposed to effectively use the channel structure from training samples in the contexts of CSI sensing and recovery. The algorithm, named CsiNet, basically learns to transform CSI into a near-optimal number of representations (or codewords), and vice-versa (inverse transformation). Comparing CsiNet to current compressed sensing (CS)-based approaches, the reconstruction quality of the recovered CSI is much better. In [39], the authors introduce two new structures, ConvCsiNet (based on a CNN autoencoder) and ShuffleCsiNet (based on ConvCsiNet), for CSI compression. Both structures outperform the previously proposed CsiNet in terms of reconstruction quality as measured by the Normalized Mean Squared Error (NMSE). While ShuffleCsiNet has a lower complexity compared to ConvCsiNet, it still remains more complex than CsiNet.
III-E CSI Feedback System
In this section, we introduce the concept of CSI feedback for a conventional 5G radio network, as illustrated in Fig. 2. Our objective is to show how our multimodal VQVAE model in Fig. 1 can be adapted to compress CSI feedback information (raw channel estimate) over a Clustered Delay Line (CDL) channel. CSI parameters are values linked to a channel’s status that are extracted from the channel estimate array in typical 5G radio networks. These parameters include Rank Indicator (RI), Precoding Matrix Indices (PMI) with different codebook sets and Channel Quality Indication (CQI) [40]. In Fig. 2, the UE uses the CSI Reference Signal (CSI-RS) to measure and calculate the CSI parameters. The UE then sends (as feedback) the CSI parameters to the base-station (gNodeB) so that the latter can adapt the downlink data transmission in terms of MIMO precoding, number of transmission layers, code rate, and modulation scheme[40]. In order to reduce the amount of overhead in the CSI feedback data, the UE must process the raw channel estimate. While the authors of [38, 39] assume a single receiving antenna at the UE, we, on the other hand, propose a model which is applicable to MIMO contexts. In our approach, we aim for the UE to compress the channel estimate array using a multimodal VQVAE model and then feed it back to the gNodeB. The latter then decompresses and processes the received channel estimate to schedule the downlink data link parameters accordingly.
III-E1 5G channel generation
We use the MATLAB 5G Toolbox to generate a 5G downlink channel, following the example from [40]. The main parameters used to generate the CDL channel are as follows: RMS Delay spread of 300 ns, maximum Doppler of 5 Hz, 52 resource blocks each consisting of 12 subcarrriers, subcarrier spacing of 15 kHz, 14 symbols per slot, 8 transmit antennas and 2 receive antennas. After simulating the channel, the perfect channel estimate matrix, is represented as an [] array for each slot, where , , , and correspond to the number of subcarriers, symbols, receive antennas and transmit antennas, respectively.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] R. Piechocki, X. Wang, and M. Bocus, “Multimodal sensor fusion in the latent representation space,” Scientific Reports , vol. 13, Feb. 2023.
- 2[2] J. Gao, P. Li, Z. Chen, and J. Zhang, “A survey on deep learning for multimodal data fusion,” Neural Computation , vol. 32, no. 5, pp. 829–864, 2020.
- 3[3] M. Suzuki, K. Nakayama, and Y. Matsuo, “Joint multimodal learning with deep generative models,” Preprint at https://arxiv.org/abs/1611.01891, 2016.
- 4[4] M. Wu and N. Goodman, “Multimodal generative models for scalable weakly-supervised learning,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems , ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 5580–5590.
- 5[5] Y. Shi, S. N, B. Paige, and P. Torr, “Variational mixture-of-experts autoencoders for multi-modal deep generative models,” in Advances in Neural Information Processing Systems , H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019.
- 6[6] T. M. Sutter, I. Daunhawer, and J. E. Vogt, “Generalized multimodal ELBO,” Preprint at https://arxiv.org/abs/2105.02470, 2021.
- 7[7] H. Zou, J. Yang, H. P. Das, H. Liu, Y. Zhou, and C. J. Spanos, “Wi Fi and vision multimodal learning for accurate and robust device-free human activity recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) , 2019, pp. 426–433.
- 8[8] R. Memmesheimer, N. Theisen, and D. Paulus, “Gimme signals: Discriminative signal encoding for multimodal activity recognition,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , 2020, pp. 10 394–10 401.
