Task-oriented Explainable Semantic Communications

Shuai Ma; Weining Qiao; Youlong Wu; Hang Li; Guangming Shi; Dahua Gao,; Yuanming Shi; Shiyin Li; and Naofal Al-Dhahir

arXiv:2302.13560·eess.SP·February 28, 2023·IEEE Trans. Wirel. Commun.

Task-oriented Explainable Semantic Communications

Shuai Ma, Weining Qiao, Youlong Wu, Hang Li, Guangming Shi, Dahua Gao,, Yuanming Shi, Shiyin Li, and Naofal Al-Dhahir

PDF

Open Access

TL;DR

This paper introduces an explainable, robust semantic communication framework that enhances transmission efficiency by extracting and transmitting only task-relevant, interpretable features, supported by theoretical bounds and a practical prototype.

Contribution

It proposes a novel explainable semantic communication system integrating bit-level design, feature disentanglement, and task relevance, with theoretical analysis and a real-time prototype.

Findings

01

Significant improvement in transmission efficiency.

02

Effective feature selection enhances robustness.

03

Theoretical bounds on semantic channel capacity are established.

Abstract

Semantic communications utilize the transceiver computing resources to alleviate scarce transmission resources, such as bandwidth and energy. Although the conventional deep learning (DL) based designs may achieve certain transmission efficiency, the uninterpretability issue of extracted features is the major challenge in the development of semantic communications. In this paper, we propose an explainable and robust semantic communication framework by incorporating the well-established bit-level communication system, which not only extracts and disentangles features into independent and semantically interpretable features, but also only selects task-relevant features for transmission, instead of all extracted features. Based on this framework, we derive the optimal input for rate-distortion-perception theory, and derive both lower and upper bounds on the semantic channel capacity.…

Tables8

Table 1. TABLE I: Key Notations and Meanings

Variables

Meanings

S = {s_{k}}_{k = 1}^{K}

Semantic information with

K

features

s_{k}

The

k

th semantic feature

X

Source data

Z = {z_{l}}_{l = 1}^{L}

The extracted semantic feature vector with

L

features

z_{l}

The

i

-th extracted semantic feature

ℒ

Semantic feature index set

ℒ_{sel}

Selected semantic feature index set

X_{s} = {z_{l}}_{l \in ℒ_{sel}}

Selected semantic features

Y_{s} = {{\hat{z}}_{l}}_{l \in ℒ_{sel}}

Estimated semantic features

\hat{Z}

Reconstructed feature set

\hat{X}

Decoded data

Table 2. TABLE II: Key Acronyms and Meanings

Acronyms

Meanings

JSCC

Joint source-channel coding

VAE

Variational autoencoder

KL

Kullback- Leibler

ANGC

Additive non-Gaussian noise channel

ELBO

Evidence lower bound

GPU

Graphics processing unit

PSNR

Peak signal-to-noise ratio

Table 3. TABLE III: Hardware parameters of the semantic communication prototype.

GPU	500MHz VideoCore VI
CPU	quad-core Cortex-A72
System on Chip	Broadcom BCM2711 $@$ 1.5GHz
memory	4GB DDR4
Wi-Fi	2.4 $/$ 5.0 GHz IEEE 802.11ac wireless
Screen	$800 \times 480$ display

Table 4. TABLE IV: Transmission performance comparison over ANGC

Table 5. TABLE V: Transmission performance comparison over Rayleigh fading channel

Table 6. TABLE VI: Proposed semantic communication with feature selection

Table 7. TABLE VII: Performance of the proposed semantic communication prototype on MNIST dataset

Table 8. TABLE VIII: Performance of the proposed semantic communication prototype on CelebA dataset

Equations77

H (S) = - k = 1 \sum K p_{sou} (s_{k}) lo g_{2} p_{sou} (s_{k}) .

H (S) = - k = 1 \sum K p_{sou} (s_{k}) lo g_{2} p_{sou} (s_{k}) .

p_{data} (x) = s_{1}, ..., s_{K} \sum p_{s2d} (x ∣ {s_{k}}_{k = 1}^{K}) k = 1 \prod K p_{sou} (s_{k}) .

p_{data} (x) = s_{1}, ..., s_{K} \sum p_{s2d} (x ∣ {s_{k}}_{k = 1}^{K}) k = 1 \prod K p_{sou} (s_{k}) .

H (X) = - x \sum p_{data} (x) lo g_{2} p_{data} (x) .

H (X) = - x \sum p_{data} (x) lo g_{2} p_{data} (x) .

H (X) = H (S) + H (X ∣ S) - H (S ∣ X) .

H (X) = H (S) + H (X ∣ S) - H (S ∣ X) .

p_{fea} (z) = x \sum p_{d2f} (z ∣ x) p_{data} (x) .

p_{fea} (z) = x \sum p_{d2f} (z ∣ x) p_{data} (x) .

p_{fea} (z) = l = 1 \prod L p_{fea} (z_{l}),

p_{fea} (z) = l = 1 \prod L p_{fea} (z_{l}),

X_{s} = {z_{l}}_{l \in L_{sel}} .

X_{s} = {z_{l}}_{l \in L_{sel}} .

x_{b} = Quan (x_{s}),

x_{b} = Quan (x_{s}),

H (x_{b}) \leq dim (x_{b}) lo g_{2} M .

H (x_{b}) \leq dim (x_{b}) lo g_{2} M .

p_{N_{Q}} (x) = \frac{1}{b - a}, a \leq x \leq b,

p_{N_{Q}} (x) = \frac{1}{b - a}, a \leq x \leq b,

p_{N_{P}} (x) = \frac{1}{σ _{P} 2 π} exp (- \frac{x ^{2}}{2 σ _{P}^{2}}) .

p_{N_{P}} (x) = \frac{1}{σ _{P} 2 π} exp (- \frac{x ^{2}}{2 σ _{P}^{2}}) .

p_{rdata} (\overset{x}{^}) = z \sum p_{f2d} (\overset{x}{^} ∣ z) p_{rfea} (z) .

p_{rdata} (\overset{x}{^}) = z \sum p_{f2d} (\overset{x}{^} ∣ z) p_{rfea} (z) .

p_{des} (s) = p_{d2s} (s ∣ x) p_{rdata} (x) .

p_{des} (s) = p_{d2s} (s ∣ x) p_{rdata} (x) .

R (D, P)

R (D, P)

s.t.

d (p (x), r (x)) \leq P,

x \sum q (x ∣ x) = 1, \forall x \in X .

q^{*} (x ∣ x) = \frac{r ( x )}{γ ( x )} exp (μ \frac{p ( x )}{r ( x )} - α (x - x)^{2}),

q^{*} (x ∣ x) = \frac{r ( x )}{γ ( x )} exp (μ \frac{p ( x )}{r ( x )} - α (x - x)^{2}),

r^{*} (x) = x \sum p (x) q (x ∣ x) .

r^{*} (x) = x \sum p (x) q (x ∣ x) .

C_{s} = p (x_{s}) max I (X_{s}; Y_{s}) .

C_{s} = p (x_{s}) max I (X_{s}; Y_{s}) .

Y_{s}

Y_{s}

\overline{Y}_{s} = X_{s} + \overline{N}_{s} .

\overline{Y}_{s} = X_{s} + \overline{N}_{s} .

C_{s, eq} = \frac{1}{2} lo g (1 + \frac{P _{x_{s}}}{σ _{s}^{2}}),

C_{s, eq} = \frac{1}{2} lo g (1 + \frac{P _{x_{s}}}{σ _{s}^{2}}),

C_{s, eq} \leq C_{s} \leq C_{s, eq} + d_{KL} (p_{n_{s}} (x), p_{\overline{n}_{s}} (x)),

C_{s, eq} \leq C_{s} \leq C_{s, eq} + d_{KL} (p_{n_{s}} (x), p_{\overline{n}_{s}} (x)),

ϕ, θ max lo g p_{θ} (x) .

ϕ, θ max lo g p_{θ} (x) .

lo g p_{θ} (x)

lo g p_{θ} (x)

= z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z , x )}{p _{θ} ( z ∣ x )} d z

= z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )} d z + z \int q_{ϕ} (z ∣ x) lo g \frac{q _{ϕ} ( z ∣ x )}{p _{θ} ( z ∣ x )} d z

= z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )} d z + d_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z ∣ x))

\geq z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z , x )}{q _{ϕ} ( z ∣ x )} d z,

= z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( x ∣ z ) p _{θ} ( z )}{q _{ϕ} ( z ∣ x )} d z

= z \int q_{ϕ} (z ∣ x) lo g p_{θ} (x ∣ z) d z + z \int q_{ϕ} (z ∣ x) lo g \frac{p _{θ} ( z )}{q _{ϕ} ( z ∣ x )} d z

= E_{q_{ϕ} (z ∣ x)} [lo g p_{θ} (x ∣ z)] - d_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z))

ϕ, θ max E_{q_{ϕ} (x ∣ z)} [lo g p_{θ} (x ∣ z)] - β d_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)),

ϕ, θ max E_{q_{ϕ} (x ∣ z)} [lo g p_{θ} (x ∣ z)] - β d_{KL} (q_{ϕ} (z ∣ x) ∣∣ p_{θ} (z)),

c_{η} (p (x) ∣∣ p_{θ} (x ∣ \overset{z}{^})) = - \frac{η + 1}{η} \int p (x)^{η} d x + \int p_{θ} (x ∣ \overset{z}{^})^{1 + η} d x .

c_{η} (p (x) ∣∣ p_{θ} (x ∣ \overset{z}{^})) = - \frac{η + 1}{η} \int p (x)^{η} d x + \int p_{θ} (x ∣ \overset{z}{^})^{1 + η} d x .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCOVID-19 diagnosis using AI · Wireless Signal Modulation Classification · Anomaly Detection Techniques and Applications

Full text

Task-oriented Explainable Semantic Communications

Shuai Ma, Weining Qiao, Youlong Wu, Hang Li, Guangming Shi, , Dahua Gao, Yuanming Shi, Shiyin Li, and Naofal Al-Dhahir Shuai Ma is with Pengcheng Laboratory, Shenzhen, 518066, China (e-mail: [email protected]).

Abstract

Semantic communications utilize the transceiver computing resources to alleviate scarce transmission resources, such as bandwidth and energy. Although the conventional deep learning (DL) based designs may achieve certain transmission efficiency, the uninterpretability issue of extracted features is the major challenge in the development of semantic communications. In this paper, we propose an explainable and robust semantic communication framework by incorporating the well-established bit-level communication system, which not only extracts and disentangles features into independent and semantically interpretable features, but also only selects task-relevant features for transmission, instead of all extracted features. Based on this framework, we derive the optimal input for rate-distortion-perception theory, and derive both lower and upper bounds on the semantic channel capacity. Furthermore, based on the $\beta$ -variational autoencoder ( $\beta$ -VAE), we propose a practical explainable semantic communication system design, which simultaneously achieves semantic features selection and is robust against semantic channel noise. We further design a real-time wireless mobile semantic communication proof-of-concept prototype. Our simulations and experiments demonstrate that our proposed explainable semantic communications system can significantly improve transmission efficiency, and also verify the effectiveness of our proposed robust semantic transmission scheme.

Index Terms:

Explainable semantic communications, feature selection, semantic communications prototype

I Introduction

With the advent of augmented reality (AR), virtual reality (VR), holographic communications, autonomous vehicular networks, and industrial Internet of Things (IIoT), it is envisioned that existing networks may soon reach a resource bottleneck due to stringent requirements [1, 2], such as ultra-high data rate, ultra-reliability, and low latency. To meet the above-mentioned requirements, investigations on the sixth generation communications (6G) are well underway and promise more powerful capacities than the fifth-generation communications (5G) [3]. From the first generation communications (1G) to 5G, the communication networks primarily focus on finding new resources and technologies to expand the channel capacity[4]. One approach is to seek the usage of large bandwidth, such as terahertz (THz) communications and visible light communication (VLC). Another approach is to explore the spatial domain, like ultra-massive MIMO and intelligent metasurfaces. However, given the hardware and physical limitations, the channel capacity may not keep increasing at the rate we desire to satisfy the aforementioned beyond-5G applications [5, 6].

In recent years, semantic communications, in which only task-relevant information is extracted and transmitted to the receiver, have received increasing attention by both academia and the industry [7, 8, 9, 10, 11, 12, 13]. Rather than increasing the channel capacity as in the conventional techniques, semantic communications exploit the computing power at the transceivers to alleviate the cost of transmission resources. The classic Shannon information theory focuses on “How accurately can the symbols be transmitted?”, which ignores the meaning of the transmitted messages. Instead, semantic communications [14] consider “How precisely do the transmitted symbols convey the desired meaning?” Thus, it is possible to improve the system efficiency at the semantic level, not only at the pure bit level.

The classical separation theorem [15] states that, as the data size goes to infinity, separating source coding and channel coding can achieve the optimal performance over a memoryless communication channel. However, for finite number of bits transmission, the performance of such separated structure will degrade. This issue also arises in semantic communications. Various deep learning (DL) based joint source-channel coding (JSCC) schemes have been investigated for text [12, 16, 17], image [18, 19, 20, 21, 22, 23], speech [24, 25], and multimodal data [26] transmission. Specifically, for text semantic transmission, the JSCC schemes have been designed by exploiting architectures like the recurrent neural network (RNN) [16], Transformer [17, 27], autoencoder (AE)[28], adaptive Universal Transformer[29], and deep neural network (DNN) [30]. For image semantic transmission, a masked auto-encoder (MAE) architecture with Transformer was designed in [18] to combat adversarial samples noise. Convolutional neural networks (CNNs) based JSCC schemes were designed for the time-invariant and fading wireless channels in [19]. Neural error correcting and source trimming (NECST) codes were studied in [22]. For finite bit transmission, an attention DL based JSCC method was designed in [23]. By exploring the channel output feedback, an AE-based JSCC scheme was developed in [20] to improve the quality of image transmission. By combining an AE with orthogonal frequency division multiplexing (OFDM), a JSCC wireless image transmission scheme was presented in [21] over multipath fading channels. By leveraging reinforcement learning (RL), a joint semantics-noise coding (JSNC) mechanism was designed in [31]. A DNN based JSCC scheme was designed in [32] for adaptive rate control in wireless image transmission. Based on AE, a SNR-adaptive deep JSCC scheme is proposed in [33] for multi-user wireless image transmission. To tackle the variational information bottleneck, the authors in [34] investigated task-oriented communication for edge inference, where a low-end edge device extracts the feature vector of a local data sample and transmits to a powerful edge server for processing. Besides, for the speech semantic transmission, AE based wave-to-vector architecture and squeeze-and-excitation (SE) attention network have been developed in [24] and [25], respectively. For visual question answering, the memory-attention-composition neural network was designed in [26] for multi-modal data semantic communications.

However, most of the existing works on semantic communications [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] are based on DL techniques, in which the DL model is basically a black box. Thus, the extracted semantic feature vectors in these works are unexplainable (hidden) representations, and the uninterpretability of the extracted features restricts further processing and exploitation of semantic features. For example, due to the uninterpretability, the unintended features will also be transmitted to the receiver, which wastes transmission resources and reduces the efficiency of semantic communications.

Moreover, most of the existing semantic communication investigations [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] completely redesign the source and channel module over the conventional system, which are impractical and not compatible with the existing communication networks. Because there is a large number of practical standards and hardware for 5G physical layer, it will lead to a huge waste of resources and costs by replacing physical layer techniques with DL-based semantic JSCC techniques. Therefore, how to design efficient and 5G-compatible practical semantic communications is a critical issue.

To address the above the two key challenges of the semantic communications, we propose an explainable and robust semantic communication framework in this paper, which is compatible with existing communication systems. We show that the proposed framework can achieve a higher transmission efficiency than the existing inexplicable semantic communication systems. The main contributions of this paper are summarized as follows:

•

We propose an explainable and easy-to-implement semantic communication framework based on the bit-level communication systems, which includes a novel semantic encoder, as well as the corresponding decoder, feature selection and semantic channel. The innovation of the proposed framework is threefold: i) The semantic encoder/decoder aims to, not only extract the independent and explainable semantic information as semantic source coding, but also alleviate the ambiguity of the semantic information influenced by the quantization and channel noise as semantic channel coding; ii) The feature selection module follows the semantic encoder, to choose only the task-relevant features for transmission, which can further reduce the transmission load; iii) The framework has an explicit definition of semantic channels, which incorporates the key modules of the bit-level communication systems. Specifically, the semantic channel takes both quantization error (or noise) and physical channel noise into account since those noise sources may lead to semantic information ambiguity, and the semantic channel capsulizes the conventional bit-level communication systems, which implies that the proposed framework can be more easily implemented compared to the JSCC schemes.

•

Then, we propose two information-theoretic metrics for our semantic communication framework. In terms of the information compression of the semantic encoder, we derive the the optimal distribution of the reconstruction signal of the rate-distortion-perception function for semantic information extraction. Moreover, to quantify the semantic information transmission, we derive both upper and lower bounds for the semantic channel capacity, which are shown to be tight when the quantization noise tends to zero.

•

Based on our framework, we further propose a feasible design of the explainable semantic communication system. Specifically, this design includes a robust $\beta$ -VAE lightweight unsupervised learning network, where a weighted parameter $\beta$ is added to the Kullback-Leibler (KL) divergence term of the variational autoencoder (VAE) network loss function, in order to make the latent representations effectively disentangled. Moreover, to enhance transmission robustness, the semantic channel noise is added to the extracted features during semantic networks training.

•

Finally, we implement the above semantic communication design, and propose a wireless mobile semantic communication proof-of-concept prototype. Applying the portable Raspberry Pi 4 Model B and Wi-Fi, the developed prototype can run the proposed robust $\beta$ -VAE semantic system in real time. Our experiments demonstrate that our proposed semantic communication system can achieve better performance than existing benchmarks.

The rest of this paper is organized as follows. The explainable semantic communications framework is presented in Section II. Section III provides the information-theoretic metrics of semantic communications. In Section IV, we propose a $\beta$ -VAE based robust and explainable semantic communications system. In Section V, we present the semantic communication system prototype design and implementation. In Section VI, we evaluate the proposed explainable semantic communication system. Finally, we conclude the paper in Section VII. Table I and II presents the means of the key notations and key acronyms in this paper, respectively.

II Explainable Semantic Communication Framework

Most existing studies replace the traditional source coding and channel coding modules by deep learning-based studies source-channel coding, which greatly changes the structure of the existing communication systems. In this paper, we propose a semantic communication framework incorporating the key modules of the conventional communication system (e.g., 5G).

As shown in Fig. 1, the proposed explainable semantic communication framework includes a semantic source, sender knowledge base, semantic encoder, semantic channel, receiver knowledge base, semantic decoder, and semantic destination. Note that the proposed framework introduces a semantic-level transmission on the top of bit-level transmission. Clearly, such a framework does not require the extra redesign over the existing physical standards, protocols and products, which makes the application of semantic communications more practical. Next, we will describe each module in detail.

II-A Knowledge Bases

The knowledge base contains all the necessary information that can facilitate the communication at the semantic level. Specifically, the knowledge base includes background knowledge and training dataset. The background knowledge is used to facilitate the semantic feature extraction and selection in the semantic transmitter. The training dataset is used for training the parameters of the semantic encoder and decoder. The sender may choose different semantic knowledge bases according to different tasks, scenarios and recipients. For example, when the communication is triggered between people in different countries, it may be necessary to sample multiple language databases. In general, the sender and the receiver share some common knowledge, which may act as a special kind of side information to improve coding efficiency.

II-B Semantic Sources

The semantic source produces original data, such as pictures, videos, voices, and texts. The generated data contains certain semantic information to be shared with the semantic destination. Assume that the semantic information includes $K$ features $S=\left\{{{s_{k}}}\right\}_{k=1}^{K}\sim{p_{{\rm{sou}}}}\left(s\right)$ (data generative factors), where ${s_{k}}$ denotes the $k$ th semantic feature, and the joint probability distribution is ${p_{{\rm{sou}}}}\left({{s}}\right)$ . Further, assume that the $K$ features $\left\{{{{s}_{k}}}\right\}_{k=1}^{K}$ are independent, i.e., ${p_{{\rm{sou}}}}\left({{s}}\right)=\prod\limits_{k=1}^{K}{{p_{{\rm{sou}}}}\left({{s_{k}}}\right)}$ , where ${{p_{{\rm{sou}}}}\left({{s_{k}}}\right)}$ denotes the probability distribution of ${{s_{k}}}$ . Thus, the entropy of the semantic source is given as

[TABLE]

The semantic source can generate the data $X\sim{p_{{\rm{data}}}}\left(x\right)$ , which can be images, text, sound or video. Generally, the generated data need to include both the intended features and some redundant features to make the whole semantic data complete. Thus, the data generation is defined as ${p_{{\rm{s2d}}}}\left({{{x}}|\left\{{{{s}_{k}}}\right\}_{k=1}^{K}}\right)$ , and the probability distribution function (PDF) of data $X$ is given as

[TABLE]

The entropy of the semantic data $x$ is given as

[TABLE]

Based on (1) and (2), $H\left({{X}}\right)$ can be further expressed as

[TABLE]

II-C Semantic Encoder

Based on the knowledge base, the generated message $X$ will be processed by the semantic encoder, which is a joint semantic source and channel encoder. More specifically, it extracts semantic information or the semantic features of the message $X$ , and outputs the disentangled and explainable features $Z$ , which can be viewed as a semantic source encoder. On the other hand, in order to reduce the ambiguity incurred by the quantization error and channel noise, the semantic encoder needs to improve the robustness against the semantic channel noise, which can be viewed as a semantic channel encoder.

The semantic encoder extracts a low-dimensional semantic features vector $Z\sim{p_{{\rm{fea}}}}\left(z\right)$ from the data ${X}$ . Let ${p_{{\rm{d2f}}}}\left({z|x}\right)$ denote the conditional PDF of the feature $z$ given data $x$ . Thus, the PDF of the extracted feature (sub-vectors) ${p_{{\rm{fea}}}}\left(z\right)$ is given as

[TABLE]

The encoder is required to regulate the extracted features into $L$ independent features $Z=\left\{{{z_{l}}}\right\}_{l=1}^{L}$ , which satisfy

[TABLE]

where $p\left({{{z}_{l}}}\right)$ denote the PDF for feature ${{z}_{l}}$ . In summary, the extracted feature ${\bf{z}}$ is required to have $L$ disentangled interpretable semantic features $\left\{{{{{z}}_{l}}}\right\}_{l=1}^{L}$ , whose corresponding neural network output is explainable and understandable by the human. For convenience, we let ${\mathcal{L}}\buildrel\Delta\over{=}\left\{{1,...,L}\right\}$ denote the index set of the disentangled semantic features. Note that such a requirement can be met if the semantic encoder is designed in a sophisticated manner. In Section IV, we will introduce a feasible system design that has such capability.

II-D *Feature

Selection*

It should be noted that the obtained features $Z$ could contain more information than what the receiver is interested in. Thus, after extracting the disentangled features $\{z_{l}\}_{l=1}^{L}$ , only the subset of features $\{z_{l}\}_{l=1}^{L}$ that are of interest to the receiver should be transmitted, and the rest of the features can be viewed as the “redundancy”. We will present more discussions of this issue via experiments in Section VI.

Given the task requirement, let ${\mathcal{L}_{{\rm{sel}}}}\subseteq{\mathcal{L}}$ denote the selected feature index set, then the selected set of features is given as

[TABLE]

Thus, feature selection will reduce the amount of data sent, and the corresponding reduction is ${\left\{{{z_{l}}}\right\}_{l\in{\cal L}\backslash{{\cal L}_{{\rm{sel}}}}}}$ . Then, ${X_{\rm{s}}}$ will be sent to the semantic channel.

II-E Semantic Channel

After the feature selection module, the task-oriented features are selected and ready to send. Since the quantization error and channel noise both could incur semantic information ambiguity, we define the semantic channel with channel law $p({y_{s}|x_{s}})$ as a virtual channel including the signal quantizer and the bit-level communication system, as shown in Fig. 1. Here, ${Y_{\rm{s}}}=\{{\widehat{z}}_{l}\}_{l\in{{\cal L}_{{\rm{sel}}}}}\sim{p_{{\rm{r}}}}\left({{y_{\rm{s}}}}\right)$ represents the set of estimated features after the transmissions over the semantic channel, and ${\widehat{z}}_{l}$ denotes the estimated feature of $z_{l}$ .

Generally, the semantic noise could include various factors including source errors, feature extraction errors, knowledge base ambiguities, adversarial injections, quantization noise, physical channel noise, etc. In our framework, the semantic channel noise $N_{s}$ is the distortion between the selected semantic feature ${X_{\rm{s}}}={\left\{{{z_{l}}}\right\}_{l\in{\mathcal{L}_{\rm{sel}}}}}$ and the estimated semantic feature ${Y_{\rm{s}}}=\{{\widehat{z}}_{l}\}_{l\in{{\cal L}_{{\rm{sel}}}}}$ , which mainly depends on the quantization noise and physical channel noise.

II-E1 Quantization Noise

The quantization noise is caused by the traditional communication operation modules, such as the source encoder (or decoder) or channel encoder (or decoder), which may also lead to semantic ambiguity. In order to reduce the number of transmitted bits, the semantic feature ${{{\bf{x}}_{\rm{s}}}}$ will be converted to a compressible binary stream using few bits. To represent ${{{\bf{x}}_{\rm{s}}}}$ with a finite number of bits, we need to map it to a discrete space. Specifically, a finite quantizer maps the semantic feature ${{{x}}_{\rm{s}}}$ to ${{{x}}_{\rm{b}}}$ , whose values are then quantized to $M$ levels $C=\left\{{{c_{1}},...,{c_{M}}}\right\}$ , i.e.,

[TABLE]

where ${\rm{Quan}}\left(\cdot\right)$ is a quantization operator. Since the number of dimensions $\dim\left({{{{x}}_{\rm{b}}}}\right)$ and the number of levels $L$ are finite, the entropy of quantized semantic data is given as

[TABLE]

In this paper, we consider the uniform distributed quantization noise ${{{n}}_{\rm{Q}}}$ , i.e.,

[TABLE]

where $a$ and $b$ are the lower and upper bounds of quantization noise ${{{N}}_{\rm{Q}}}$ .

II-E2 Physical Channel Noise

The physical channel noise exists ubiquitously in physical communications and is caused by physical channel impairments, such as additive white Gaussian noise (AWGN), interference, etc. It is noted that the errors caused by channel propagation usually occur before channel decoding and can be corrected by channel decoding. Assume that the physical channel noise ${{N}}_{\rm{P}}$ follows a Gaussian distribution with zero-mean and variance $\sigma_{\rm{P}}^{2}$ , i.e.,

[TABLE]

II-F Feature Completion

After obtaining the estimated features ${{{Y}}_{\rm{s}}}={\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in{{\cal L}_{{\rm{sel}}}}}}$ through the semantic channel transmission, the destination will use the estimated features and side information in the knowledge base, to compute the target function of the task. Although the unintended features subset ${\left\{{{z_{l}}}\right\}_{l\in\mathcal{L}\backslash{\mathcal{L}_{{\rm{sel}}}}}}$ are not transmitted, the receiver may generate the corresponding unintended features ${\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in\mathcal{L}\backslash{\mathcal{L}_{{\rm{sel}}}}}}$ by exploiting the knowledge base. Then, by combining intended features ${\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in{\mathcal{L}_{{\rm{sel}}}}}}$ and unintended features ${\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in\mathcal{L}\backslash{\mathcal{L}_{{\rm{sel}}}}}}$ , we may obtain the completed semantic features $\widehat{Z}={\left\{{{{\hat{z}}_{l}}}\right\}_{l\in\mathcal{L}}}$ with distribution ${{p_{{\rm{rfea}}}}\left({\widehat{z}}\right)}$ . For example, considering a semantic communication system for staff clothing image transmission, the intended semantic features of the receiver are clothing features, and the receiver is not interested in the staff’s gender, skin color, and hairstyle. Therefore, the receiver can generate unintended semantic features based on the shared knowledge base, such as the staff’s gender, skin color and hairstyle. Note that, the generated unintended semantic features at the receiver may be different from the corresponding features of the image at the transmitter. Then, the receiver combines the received clothing features with its own generated unintended features.

II-G Semantic Decoder

The semantic decoder aims to recover the data from the disentangled features $\widehat{Z}$ that are semantic explainable, which is the inverse function of the semantic encoding. Again, this inverse function needs the help of the knowledge base for model training such that the decoder can “understand” the features $\widehat{Z}$ .

Similar to the encoding process, we use conditional PDF ${p_{{\rm{f2d}}}}\left({\widehat{x}|\widehat{{z}}}\right)$ to describe the semantic decoding process. The PDF of the decoded data is ${p_{{\rm{rdata}}}}\left({\widehat{x}}\right)$ , and the decoded data is ${\widehat{{x}}}$ . The data reconstruction for a given feature vector is given as

[TABLE]

II-H Semantic Destination

Finally, the receiver recovers the semantic information based on the decoded data ${\widehat{{X}}}$ , and the corresponding process can be described by ${p_{{\rm{d2s}}}}\left({\widehat{{s}}|\widehat{x}}\right)$ , where the final semantic information is denoted by ${\widehat{s}}$ . At last, the probability of such semantic information can be written as

[TABLE]

So far, we have presented the complete semantic communication framework. The key modules are the semantic encoder and the feature selection. Their functions can be realized by the careful model design. We will present a detailed system design in Section IV, which is a feasible realization of this framework.

III Information-Theoretic Metrics of Semantic Communications

In this section, we propose two metrics for the framework illustrated by Fig. 1. Here, we focus on two procedures: the encoding and the transmission.

III-A Rate-Distortion-Perception Function

The semantic encoding may include many different tasks, and these tasks may have relevant or different criteria. For example, there is data distortion for the traditional data reconstruction task, and distribution distortion for generative learning tasks.

Let $p(x)$ be the distribution of the input source, $r(\widehat{x})$ be the distribution of the reconstruction signal, and $q(\widehat{x}|x)$ be a conditional distribution on $\mathcal{X}\times{\mathcal{X}}$ . The information rate-distortion-perception function $R(D,P)$ [35] for a source $X\sim p(x)$ is defined as

[TABLE]

where the distortion function $\Delta:\mathcal{X}\times{\mathcal{X}}\to\mathbf{R}^{+}$ satisfying $\Delta(x,\widehat{x})=0$ if $x=\widehat{x}$ , and perception function $d\big{(}p(x),r(\widehat{x})\big{)}$ is a non-negative divergence between probability distributions $p(x)$ and $r(\widehat{x})$ satisfying $d(p,q)=0$ if $p(x)=r(x)$ .

So far, for a general source, the optimal distribution of the reconstruction signal $r(\widehat{x})$ of problem (14) has not been derived yet. For a binary source, the three-way tradeoff between rate, distortion, and perception was investigated in [35] with Hamming distance distortion and total-variation distance perception. While for a Gaussian source, the achievable distortion-perception region was established in [36] under squared error distortion and squared Wasserstein-2 distance.

Hence, we investigate how to find the optimal of $R({D,P})$ for a general source under the mean square distortion (i.e., $\Delta(x,\widehat{x})=|x-\widehat{x}|^{2}$ ) and KL divergence perception (i.e., $\quad{d}\left({p\left(x\right),r({\widehat{x}})}\right)=\quad{d}_{\text{KL}}\left({p\left(x\right),r({\widehat{x}})}\right)\triangleq\sum_{x}p\left(x\right)\log\frac{p(x)}{r({x})}$ ). We first introduce the following lemma.

Lemma 1.

Consider the mean square distortion (i.e., $\Delta(x,\widehat{x})=|x-\widehat{x}|^{2}$ ) and KL divergence perception (i.e., $\quad{d}\left({p\left(x\right),r({\widehat{x}})}\right)=\sum_{x}p\left(x\right)\log\frac{p(x)}{r({x})}$ ). The corresponding optimal distribution $q^{*}(\widehat{x}|x)$ to problem (14) for a given output distribution $r(\widehat{x})>0$ is

[TABLE]

where $\widetilde{\gamma}\left(x\right)=\sum\limits_{\widehat{x}}{r({\widehat{x}})\exp\left({\mu\frac{{p\left(x\right)}}{{r({\widehat{x}})}}-\alpha{{\left({x-\widehat{x}}\right)}^{2}}}\right)}$ . The corresponding optimal distribution $r^{*}(\widehat{x})$ to (14) for a given conditional distribution $q(\widehat{x}|x)>0$ is

[TABLE]

Proof: Please find the proof in Appendix A.∎

Using Lemma 1, we can apply a process of alternating minimization, called the Blahut–Arimoto algorithm [37]. Specifically, in the initialization setup, choose some positive values $\alpha,\mu$ and the initial output distribution ${r}^{(0)}(\widehat{x})$ . In each iteration $k$ , compute the optimal $q^{(k)}(\widehat{x}|x)$ according to (15) for given $r^{(k-1)}(x)$ , and then compute the optimal $r^{(k)}(x)$ according to (16).

III-B Lower and Upper Bounds on Semantic Channel Capacity

The channel capacity quantifies the maximum rate of information transmission for the considered system. According to the framework in Fig. 1, we define the semantic channel capacity as the maximum semantic information that can be transferred through the semantic channel $p({y_{\rm{s}}|x_{\rm{s}}})$ . Following the standard achievability and converse proof techniques, we obtain the semantic channel capacity in our framework as:

[TABLE]

In the conventional bit-level wireless communication system, the channel capacity is usually represented by the Shannon capacity formula with additive Gaussian distributed noise. In our framework, the semantic channel noise $N_{s}$ mainly depends on the quantization noise and physical channel noise, and follows non-Gaussian distribution in general. Thus, in our framework, the semantic channel is an additive non-Gaussian noise channel (ANGC), and we assume that the estimated semantic features ${Y_{\rm{s}}}$ can be represented as

[TABLE]

Although the specific distribution of ${N_{s}}$ is unknown, the variance of the semantic noise ${n_{\rm{s}}}$ can be obtained by measurement. In this paper, we assume that the covariance of the semantic noise ${n_{\rm{s}}}$ is $\sigma_{\rm{s}}^{2}$ .

Due to the non-Gaussian distributed semantic noise ${n_{\rm{s}}}$ , the classic Shannon capacity formula (based on Gaussian distributed noise) cannot be directly applied to the semantic channel. To derive the semantic channel capacity, we first define equivalent Gaussian distributed semantic channel noise ${\overline{{N}}_{\rm{s}}}\sim\mathcal{N}\left({0,\sigma_{\rm{s}}^{2}}\right)$ with the same variance as ${{{N}}_{\rm{s}}}$ . Then, based on the equivalent semantic channel noise ${\overline{{N}}_{\rm{s}}}$ , the received signal of semantic channel ${\overline{{Y}}_{\rm{s}}}$ is given as

[TABLE]

Therefore, the channel capacity of the equivalent semantic channel is given as

[TABLE]

where ${P_{{{\rm{x}}_{\rm{s}}}}}$ denote the power of transmitted semantic data ${X_{\rm{s}}}$ .

Proposition 1 (Lower and upper bounds on the semantic channel capacity).

With the non-Gaussian distributed channel noise, the semantic channel capacity ${C_{\rm{s}}}$ is bounded by [38]

[TABLE]

where ${d_{{\rm{KL}}}}\left({{p_{{{\rm{n}}_{s}}}}\left(x\right),{p_{{{\overline{\rm{n}}}_{\rm{s}}}}}\left(x\right)}\right)=\int_{-\infty}^{\infty}{{p_{{{\rm{n}}_{\rm{s}}}}}\left(x\right)\log\frac{{{p_{{{\rm{n}}_{\rm{s}}}}}\left(x\right)}}{{{p_{{{\overline{\rm{n}}}_{\rm{s}}}}}\left(x\right)}}}{\rm{d}}x$ .

At last, we illustrate our theoretical results on the semantic channel capacity via numerical simulation. Fig. 2 (a) and (b) show the lower bound and the upper bound in (22) on semantic channel capacity versus SNR with semantic noise parameters $a=-1$ , $b=1$ and $\sigma_{\rm{P}}^{2}=0.01$ , and semantic noise parameters $a=-0.3$ , $b=0.3$ and $\sigma_{\rm{P}}^{2}=0.01$ , respectively. Fig. 2 (b) shows that the gap between the upper bound and lower bound is less than that in Fig. 2 (a). The reason is that when the the KL divergence between semantic noise ${\bar{N}_{s}}$ and the equivalent semantic channel noise ${{{\overline{\rm{n}}}_{\rm{s}}}}$ tends to [math], i.e., ${d_{{\rm{KL}}}}\left({{p_{{{\rm{n}}_{s}}}}\left(x\right),{p_{{{\overline{\rm{n}}}_{\rm{s}}}}}\left(x\right)}\right)\to 0$ , the gap between the lower bound and the upper bound in (22) tends to 0.

IV $\beta$ -VAE based Robust and Explainable Semantic Communication System

In this section, we present a feasible and efficient system design based on the proposed framework given in Fig. 1. Here, we propose a robust $\beta$ -VAE based semantic communications system, as shown in Fig. 3, which disentangles the hidden representation vector into multiple independent and semantically interpretable of features.

IV-A * Robust $\beta$ -VAE based Semantic Encoder/Decoder*

By exploiting a generative VAE model [39], we first optimize the semantic encoder ${{q_{\phi}}\left({{{z}}|{{x}}}\right)}$ with parameter set $\phi$ , and the semantic decoder ${p_{\theta}}\left({\widehat{x}|\widehat{z}}\right)$ for the receiver with parameter set $\theta$ . Mathematically, we aim to jointly optimize parameters $\phi$ and $\theta$ to maximize the log-likelihood of data $X$ as follows

[TABLE]

To efficiently handle optimization problem (22), we optimize the lower bound of the objective function $\log{p_{\theta}}\left({x}\right)$ [40]. Specifically, $\log{p_{\theta}}\left({x}\right)$ is lower bounded by

[TABLE]

where equation (23a) holds for the arbitrary distribution ${{q_{\phi}}\left({{z}|{x}}\right)}$ , and inequality (23e) holds due to ${d_{{\rm{KL}}}}\left({{q_{\phi}}\left({{z}|{x}}\right)||{p_{\theta}}\left({{z}|{x}}\right)}\right)\geq 0$ .

Unfortunately, maximizing the lower bound in (23h) directly cannot achieve interpretable and robust semantic communication systems design. To address this challenge, we multiply ${d_{{\rm{KL}}}}\left({{q_{\phi}}\left({{z}|{x}}\right)||{p_{\theta}}\left({z}\right)}\right)$ by a weighting parameter $\beta$ to obtain a disentangling and explainable semantic representation ${z}$ [39], for $\beta>1$ . Furthermore, to combat semantic noise and achieve robust semantic communication systems design, we replace ${{p_{\theta}}\left({{z}|{x}}\right)}$ with ${p_{\theta}}\left({x|\widehat{z}}\right)$ , where $\widehat{z}={\rm{g}}z+{n_{\rm{s}}}$ , ${\rm{g}}$ denotes fading channel gain, and ${{{{n}}_{\rm{s}}}}$ denotes the semantic noise. Specifically, the log-likelihood maximization problem (22) is reformulated as follows

[TABLE]

where the prior distribution ${p_{\theta}}\left(z\right)$ is assumed to follow a standard Gaussian distribution, i.e., ${p_{\theta}}\left(z\right)=\mathcal{N}\left({{\bf{0}},{\bf{I}}}\right)$ .Note that, in (24), the first term ${{\mathbb{E}}_{{q_{\phi}}\left({{x}|{z}}\right)}}\left[{\log{p_{\theta}}\left({{x}|{z}+{{{n}}_{\rm{s}}}}\right)}\right]$ is the expected likelihood with the cross entropy form, which can be regarded as reconstruction loss, while the second term regularizes ${q_{\phi}}\left({{z}|{x}}\right)$ to be close to prior ${p_{\theta}}\left({z}\right)$ , which can be regarded as regularization loss. To further enhance the robustness of the variational inference, we exploit $\eta$ -cross entropy ${{c}_{\eta}}\left({p\left(x\right)||{p_{\theta}}\left({x|\hat{z}}\right)}\right)$ [41, 42] as the reconstruction loss, instead of the cross entropy ${{\mathbb{E}}_{{q_{\phi}}\left({{x}|{z}}\right)}}\left[{\log{p_{\theta}}\left({{x}|{z}+{{{n}}_{\rm{s}}}}\right)}\right]$ , where

[TABLE]

Specifically, the objective function of the proposed robust semantic communication system is given as

[TABLE]

Thus, the robust $\beta$ -VAE training objective (26) encourages the latent distribution ${{q_{\phi}}\left({z|x}\right)}$ to efficiently represent semantic information about the data $x$ by jointly maximizing the $\eta$ -cross entropy ${{c}_{\eta}}\left({p\left(x\right)||{p_{\theta}}\left({x|\hat{z}}\right)}\right)$ and minimizing the $\beta$ -weighted KL term ${d_{{\rm{KL}}}}\left({{q_{\phi}}\left({{z}|{x}}\right)||{p_{\theta}}\left({z}\right)}\right)$ via unsupervised learning.

More specifically, we jointly optimize the semantic encoder parameter $\phi$ and semantic decoder parameter $\theta$ to maximize the objective function (26). The first term of (26) is the probability of reconstructing the input data $x$ , which corresponds to reconstruction loss. The second term is minimizing the KL divergence, which is the distance between the approximated posterior ${{q_{\phi}}\left({z|x}\right)}$ and the fixed Gaussian distribution ${p_{\theta}}\left(z\right)=\mathcal{N}\left({{\bf{0}},{\bf{I}}}\right)$ . By adopting the well chosen values of the parameter $\beta$ (usually $\beta>1$ ), the posterior ${{q_{\phi}}\left({z|x}\right)}$ is encouraged to match the Gaussian distribution ${p_{\theta}}\left(z\right)=\mathcal{N}\left({{\bf{0}},{\bf{I}}}\right)$ , which disentangles the hidden representation into multiple independent and semantically meaningful features $\left\{{{{z}_{l}}}\right\}_{l\in{\mathcal{L}}}$ . The parameter $\beta$ balances reconstruction accuracy and learned disentanglement quality. In general, a higher value of $\beta$ will produce a more disentangled representation, but may lead to lower reconstruction accuracy[39].

Note that, in the robust $\beta$ -VAE network, we let $\left\{{{\mu_{l}}}\right\}_{l=1}^{L}$ and $\left\{{{\sigma_{l}}}\right\}_{l=1}^{L}$ denote the mean and the corresponding standard deviation of the approximate posterior ${q_{\phi}}\left({{z}|{x}}\right)$ , respectively. Moreover, a reparametrization trick [39] is applied to estimate gradients of the objective function (24) with respect to the parameter $\phi$ , where random independent variables $\left\{{{\varepsilon_{l}}}\right\}_{l=1}^{L}$ are sampled from a standard Gaussian distribution, i.e., ${{\varepsilon_{l}}}\sim\mathcal{N}\left({{{0}},{\bf{1}}}\right)$ . Then, the output features of the semantic encoder $\left\{{{z_{l}}}\right\}_{l=1}^{L}$ are given as follows

[TABLE]

Thus, the feature $z_{l}$ is equivalent to being sampled from distribution ${\cal N}\left({{\mu_{l}},\sigma_{l}^{2}}\right)$ , where $l=1,...,L$ .

IV-B * Feature Selection and Completion*

With the disentangled and explainable features, the proposed semantic communications system further performs feature selection and completion at the transmitter and receiver, respectively. Specifically, since the receiver may only be interested in some of the features, the transmitter only sends the intended features ${\left\{{{{z}_{l}}}\right\}_{l\in{\mathcal{L}_{\rm{sel}}}}}$ according to their semantic meanings, rather than all of the extracted features $\left\{{{{z}_{l}}}\right\}_{l\in{\mathcal{L}}}$ , which can further reduce the amount of information transmission.

For the receiver, the proposed semantic source and channel decoder include semantic feature completion and feature reconstruction. Specifically, for the unintended features subset are not transmitted ${\left\{{{z_{l}}}\right\}_{l\in\mathcal{L}\backslash{\mathcal{L}_{{\rm{sel}}}}}}$ , the receiver generated the corresponding features ${\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in\mathcal{L}\backslash{\mathcal{L}_{{\rm{sel}}}}}}$ based on the receiver knowledge base, where both the dimensions and value ranges of sets ${z_{l}}$ and ${{{\widehat{z}}_{l}}}$ are the same.

Then, according to the completed semantic features $\widehat{Z}={\left\{{{{\hat{z}}_{l}}}\right\}_{l\in\mathcal{L}}}$ , the feature reconstruction module recovers the original data $\widehat{X}$ .

IV-C Proposed Architecture

The proposed lightweight semantic communication architecture includes a semantic encoder network and a semantic decoder network, as shown in Fig. 4, where the notation Conv2D 32 $@$ 3232 means that the network has 32 2-D convolutional filters of size 3232, and Dense 1*256 represents a dense layer with 256 neurons. The details of the semantic encoder and the decoder network architectures are given as:

IV-C1 Semantic encoder architecture

: Conv2D 32 $@$ 3232 $\to$ Conv2D 32 $@$ 1616 $\to$ Conv2D 64 $@$ 88 $\to$ Conv2D 64 $@$ 44 $\to$ Dense 1256 $\to$ 2 parallel Dense 132 $\to$ $\left\{{{z_{l}}}\right\}_{l=1}^{32}$ $\to$ ${\left\{{{{z}_{l}}}\right\}_{l\in{\mathcal{L}_{\rm{sel}}}}}$ ;

IV-C2 Semantic decoder architecture

: ${\left\{{{{\widehat{z}}_{l}}}\right\}_{l\in{{\cal L}_{{\rm{sel}}}}}}$ $\to$ $\left\{{{{\widehat{z}}_{l}}}\right\}_{l=1}^{32}$ $\to$ Dense 132 $\to$ Dense 1256 $\to$ ConvT2D 64 $@$ 44 $\to$ ConvT2D 32 $@$ 88 $\to$ ConvT2D 32 $@$ 1616 $\to$ ConvT2D 3 $@$ 3232.

Note that, based on the feature selection, the proposed semantic communication system only needs to send the features ${\left\{{{{z}_{l}}}\right\}_{l\in{\mathcal{L}_{\rm{sel}}}}}$ that the receiver is interested in, instead of sending all features $\left\{{{{z}_{l}}}\right\}_{l=1}^{32}$ .

V Prototype and Implementations

The proposed architecture and hardware platform design of the semantic communication system prototype are shown in Fig. 5 (a) and (b), which can be used to implement the proposed robust and explainable semantic communications system in Fig. 3. The prototype includes two semantic communication mobile users A and B. The trained robust $\beta$ -VAE network is implemented at the portable RaspberryPi 4 Model B processors to realize the semantic encoding/decoding and the feature selection/completion functions. The integrated Wi-Fi module fulfills the bit-level transmission. The decoded data can be shown through the display.

The detailed parameters of the prototype are provided in Table III. The Raspberry Pi is installed with an ARM Cortex-A72 $@$ quad-core 1.5GHz CPU and 4GB of DDR4 RAM, and is equipped with Pytorch-CPU and torchvision software. The communication between Raspberry Pi A and B is realized through WiFi, where the socket is used to send and receive data, and Visdom is used to realize visual communication.

VI Experiments and Discussions

In this section, we evaluate the proposed explainable semantic communications system using a graphics processing unit (GPU) and Raspberry Pi prototype, respectively. The GPU experiments in this work have been performed on 32 GB RAM i5-12600H, and 8 GB Nvidia GeForce 3060Ti GTX graphics card with Pytorch powered with CUDA 11.3. The experiments are performed via two standard datasets, i.e., MNIST Dataset and CelebA Dataset.

VI-A Demonstration via GPU

First, we evaluate the robustness of the proposed semantic communication system. Specifically, the peak signal-to-noise ratio (PSNR) performance of the proposed robust $\beta$ -VAE scheme with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ are demonstrated over the two channel models: the ANGC and a slow Rayleigh fading channel, where ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ mean that the trained SNRs of the schemes are $4{\rm{dB}}$ and $8{\rm{dB}}$ , respectively. Moreover, the PSNR performance of the deep joint source-channel coding (Deep-JSCC) scheme [19], $\beta$ -VAE scheme, and the JPEG compression scheme are presented for comparisons.

Fig. 6 (a) shows PSNR versus different test SNRs of the four schemes over ANGC, where semantic noise parameters $a=-0.1$ , $b=0.1$ and $\sigma_{\rm{P}}^{2}=1$ . We observe that the PSNR of JPEG compression is the lowest among the five schemes, and the PSNR of the robust $\beta$ -VAE schemes are higher than those of both Deep-JSCC and $\beta$ -VAE. In the low SNR regions, the PSNR of the robust $\beta$ -VAE with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ is the highest, and the PSNR of the robust $\beta$ -VAE with ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ is the higher than that of $\beta$ -VAE, which verifies the robustness of our proposed design. Since the training noise of ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ is higher than that of ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ , the performance of ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ is more robust, and thus the PSNR of ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ is higher. In the high SNR regions, the PSNR of $\beta$ -VAE, and robust $\beta$ -VAE models tend to be the same. The reason is that the effect of noise at high SNR can be ignored.

Fig. 6 (b) illustrates PSNR versus different test SNRs of the five schemes over the Rayleigh fading channel. Similar to Fig. 6 (a), the PSNR of JPEG compression is the lowest among the four schemes, and the PSNR of the robust $\beta$ -VAE with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ are higher than those of both Deep-JSCC and $\beta$ -VAE. Note that for ${\rm{SN}}{{\rm{R}}_{{\rm{test}}}}=8{\rm{dB}}$ , the PSNR of the robust $\beta$ -VAE with ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ is the higher than that of ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ . This because the training SNR of ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ is also $8$ dB. Comparing Fig. 6 (a) with ANGC, the PSNRs of the schemes in Fig. 6 (b) are lower due to Rayleigh random fading.

Table IV illustrates the transmission performance of JPEG compression, $\beta$ -VAE, and robust $\beta$ -VAE with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ over ANGC with semantic noise parameters $a=-0.1$ , $b=0.1$ and $\sigma_{\rm{P}}^{2}=1$ . The second column of Table IV shows the transmission performance of the JPEG compression scheme, where the transmitted semantics cannot be recognized from the received image. The third column shows the results of the $\beta$ -VAE scheme, where the transmission semantics can be recognized from the received image. The fourth and fifth columns show received images of the robust $\beta$ -VAE scheme with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ , and the quality is better than that of the $\beta$ -VAE scheme.

Table V illustrates transmission performance of the four schemes over the Rayleigh fading channel with semantic noise parameters $a=-0.1$ , $b=0.1$ and $\sigma_{\rm{P}}^{2}=1$ . The second column of table IV shows the transmission performance of the JPEG compression scheme, where the transmission semantics cannot be recognized from the received image. The third column shows the transmission performance of the $\beta$ -VAE scheme, where the transmission semantics can be recognized from the received image. The fourth and fifth columns show the transmission performance of the robust $\beta$ -VAE scheme with ${\rm{SNR}}_{{\rm{train}}}=4{\rm{dB}}$ and ${\rm{SNR}}_{{\rm{train}}}=8{\rm{dB}}$ , and the received image is better than that of the $\beta$ -VAE scheme.

VI-B Demonstration via Prototype

In this subsection, we demonstrate that the proposed explainable semantic communication system with feature selection can improve the transmission efficiency via our prototype.

Table VI shows the performance of the proposed explainable semantic communication system with feature selection. From Column 2 to Column 5, we present four examples to show how the explainable encoder and feature selection work in the transmission. In the second column, the intended feature to send is skin color. The proposed semantic communication system performs feature extraction on the input white-skinned women picture, and then only selects the white skin color feature for transmission. Although the woman in the receiving knowledge base has darker skin, the reconstructed image is changed to white-skin. In the third column, the intended feature is face orientation. The proposed semantic communication system can successfully reconstruct a picture with the same face orientation at the receiver. Similarly, the intended features of the third and fourth columns are gender and hairstyle, respectively, and the proposed semantic communication system can also recover the correct feature at the receiver.

Table VII compares a compression ratio, transmission time, PSNR and reconstructed image of the original image transmission scheme, JPEG compression scheme, $\beta$ -VAE scheme, and robust $\beta$ -VAE scheme over our proposed semantic communication prototype on MNIST dataset with high SNR. From Table VII, we observe that the compression ratio of the $\beta$ -VAE scheme and robust $\beta$ -VAE scheme is $78.4$ which is significantly higher than those of the JPEG compression scheme (1.81) and original image transmission scheme. Thus, the transmission time of the $\beta$ -VAE scheme and robust $\beta$ -VAE scheme is about $0.3$ ms, which is significantly lower than those of the JPEG compression scheme (10.88ms) and original image transmission scheme (18.86ms). Therefore, the proposed semantic communication system can significantly reduce the transmission load and time. Moreover, the PSNR of the robust $\beta$ -VAE scheme is close to that of the JPEG compression scheme, and is higher than that of the $\beta$ -VAE scheme. Comparing of reconstructed images, we can clearly and accurately identify the number “ $7$ ” from the recovered images using the proposed robust $\beta$ -VAE scheme.

Table VIII compares a compression ratio, transmission time, PSNR and reconstructed image of the original image transmission scheme, JPEG compression scheme, $\beta$ -VAE scheme, and robust $\beta$ -VAE scheme over our proposed semantic communication prototype on CelebA dataset with high SNR. Similar to Table VII, the compression ratio of the $\beta$ -VAE scheme and robust $\beta$ -VAE scheme is $384$ which is significantly higher than those of the JPEG compression scheme (4.49) and original image transmission scheme. Thus, the transmission time of the $\beta$ -VAE scheme and robust $\beta$ -VAE scheme is about $0.18$ ms, which is significantly lower than those of the JPEG compression scheme (9.53ms) and original image transmission scheme (28.38ms). Therefore, the proposed semantic communication system can significantly reduce the transmission load and time. Moreover, the PSNR of the robust $\beta$ -VAE scheme is close to that of the JPEG compression scheme, and is higher than that of the $\beta$ -VAE scheme. Note that, although the effect of the reconstructed image of proposed robust $\beta$ -VAE scheme is a bit blurry, the three main semantic features of the original image: female, white skin color and long hair, are all accurately transmitted, which verifies the validity and accuracy of the proposed task-oriented semantic communication scheme.

VII Conclusions

In this paper, we propose an explainable and easy-to-implement semantic communication framework that is compatible with conventional communication systems. In this new framework, the semantic encoder can extract feature vectors, disentangle the semantic information, and improve robustness against semantic information ambiguity. To further reduce the communication cost, we apply feature selection to choose only task-related semantic information to transmit. Then, we present two information theoretic metrics, namely, the rate-distortion-perception function and semantic channel capacity to characterize the semantic information compression and transmission, respectively. To quantify the semantic information transmission with the additive quantization noise and physical channel noise, we further derive upper and lower bounds on the semantic channel capacity. Then, we propose a feasible design of the explainable semantic communication system, which includes a robust $\beta$ -VAE lightweight unsupervised learning network. Finally, we develop a wireless mobile semantic communication proof-of-concept prototype to implement the semantic communication design. Our experiments demonstrate that the proposed semantic communication system significantly outperforms the state-of-the-art methods, and shows robustness against various noise levels on two benchmark datasets. This work attempts to provide frameworks and theoretic metrics to explain and analyze the black-box semantic communications problem, and to provide guidelines on implementing the semantic communication in practical communication systems.

VIII Appendices

Appendix A Proof of Lemma 1

We first derive the optimal conditional distribution $q(\widehat{x}|x)$ in (14) for a given output distribution $r(x)$ . The mutual information $I({X;\widehat{X}})=\sum\limits_{x}{\sum\limits_{\widehat{x}}{p\left(x\right)q({\widehat{x}|x})\log\frac{{q({\widehat{x}|x})}}{{r({\widehat{x}})}}}}$ is convex in ${q({\widehat{x}|x})}$ for fixed ${p\left(x\right)}$ , and the KL divergence ${d_{KL}}\left({p\left(x\right),r({\widehat{x}})}\right)=\sum\limits_{x}{p\left(x\right)\log\frac{{p\left(x\right)}}{{r({x})}}}$ is also convex in ${q({\widehat{x}|x})}$ for fixed ${p\left(x\right)}$ . Thus, problem (14) is convex in ${q({\widehat{x}|x})}$ . Then, the Lagrangian function of problem (14) is given by

[TABLE]

where $\alpha\geq 0$ , $\mu\geq 0$ and ${\gamma\left(x\right)}\geq 0$ are Lagrange multipliers attached with constraints (14b), (14c) and (14d), respectively. For given $r(\widehat{x})$ , the derivative of (28) with respect to ${q({\widehat{x}|x})}$ is given as

[TABLE]

Let $\frac{{\partial L\left({q({\widehat{x}|x})}\right)}}{{\partial q({\widehat{x}|x})}}=0$ , then we obtain the optimal $q({\widehat{x}|x})$ as

[TABLE]

where $\widetilde{\gamma}\left(x\right)\buildrel\Delta\over{=}\exp\left({\frac{{\gamma\left(x\right)}}{{p\left(x\right)}}}\right)$ .

Since $\sum\limits_{\widehat{x}}{q({\widehat{x}|x})}=1$ , we have

[TABLE]

Furthermore, we obtain

[TABLE]

Substituting (32) into (30b), we obtain the optimal $q^{*}({\widehat{x}|x})$ as given in Lemma 1.

From [37] , we find that given a fixed conditional distribution $q({\widehat{x}|x})$ , the optimal output distribution $r(\widehat{x})$ is $r^{*}(x)\triangleq\sum_{x}p(x)q({\widehat{x}|x})$ . We rewrite the proof below.

[TABLE]

where the last inequality holds because of the non-negative property of KL divergence.

Bibliography42

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] L. Knud, “State of the Io T 2020: 12 billion Io T connections, surpassing non-Io T for the first time,” https://iot-analytics.com/state-of-the-iot-2020-12-billion-iot-connections-surpassing-non-iot/ , 2020.
2[2] J. Antoniou, “Quality of experience and emerging technologies: Considering features of 5G, Io T, cloud and AI,” in Quality of Experience and Learning in Information Systems , pp. 1–8. Springer, 2021.
3[3] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems: Applications, trends, technologies, and open research problems,” IEEE Netw. , vol. 34, no. 3, pp. 134–142, Oct. 2020.
4[4] E. Calvanese Strinati, S. Barbarossa, J. L. Gonzalez-Jimenez, D. Ktenas, N. Cassiau, L. Maret, and C. Dehos, “6G: The next frontier: From holographic messaging to artificial intelligence using subterahertz and visible light communication,” IEEE Veh. Technol. Mag. , vol. 14, no. 3, pp. 42–50, Oct. 2019.
5[5] B. Mao, F. Tang, Y. Kawamoto, and N. Kato, “AI models for green communications towards 6G,” IEEE Commun. Surveys Tuts. , vol. 24, no. 1, pp. 210–247, Nov. 2022.
6[6] K. Niu, J. Dai, S. Yao, S. Wang, Z. Si, X. Qin, and P. Zhang, “Towards semantic communications: A paradigm shift,” ar Xiv preprint ar Xiv:2203.06692 , 2022.
7[7] P. Zhang, W. Xu, H. Gao, K. Niu, X. Xu, X. Qin, C. Yuan, Z. Qin, H. Zhao, J. Wei, et al., “Toward wisdom-evolutionary and primitive-concise 6G: A new paradigm of semantic communication networks,” Engineering , 2022.
8[8] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Commun. Mag. , vol. 59, no. 6, pp. 96–102, Jan. 2021.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Task-oriented Explainable Semantic Communications

Abstract

Index Terms:

I Introduction

II Explainable Semantic Communication Framework

II-A Knowledge Bases

II-B Semantic Sources

II-C Semantic Encoder

II-D *Feature

II-E Semantic Channel

II-E1 Quantization Noise

II-E2 Physical Channel Noise

II-F Feature Completion

II-G Semantic Decoder

II-H Semantic Destination

III Information-Theoretic Metrics of Semantic Communications

III-A Rate-Distortion-Perception Function

Lemma 1**.**

III-B Lower and Upper Bounds on Semantic Channel Capacity

Proposition 1** (Lower and upper bounds on the semantic channel capacity).**

IV β\betaβ-VAE based Robust and Explainable Semantic Communication System

IV-A * Robust β\betaβ-VAE based Semantic Encoder/Decoder*

IV-B * Feature Selection and Completion*

IV-C Proposed Architecture

IV-C1 Semantic encoder architecture

IV-C2 Semantic decoder architecture

V Prototype and Implementations

VI Experiments and Discussions

VI-A Demonstration via GPU

VI-B *Demonstration via Prototype *

VII Conclusions

VIII Appendices

Appendix A Proof of Lemma 1

Lemma 1.

Proposition 1 (Lower and upper bounds on the semantic channel capacity).

IV $\beta$ -VAE based Robust and Explainable Semantic Communication System

IV-A * Robust $\beta$ -VAE based Semantic Encoder/Decoder*

VI-B Demonstration via Prototype