TL;DR
This paper introduces a neural network-based speech steganography method that effectively conceals multiple messages with minimal perceptible changes, outperforming traditional techniques and robust under channel distortions.
Contribution
The paper proposes a novel neural network model incorporating Fourier transforms for speech steganography, addressing limitations of vision-based models and enabling multi-message concealment.
Findings
Effective concealment of multiple messages in speech
Minimal perceptible changes to human listeners
Robustness under various channel distortions
Abstract
Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for vision are less suitable for speech, and propose a new model that includes the short-time Fourier transform and inverse-short-time Fourier transform as differentiable layers within the network, thus imposing a vital constraint on the network outputs. We empirically demonstrated the effectiveness of the proposed method comparing to deep learning based on several speech datasets and analyzed the results quantitatively and qualitatively. Moreover, we showed that the proposed approach could be…
| Model | Car. loss | Car. SNR | Msg. loss | Msg. SNR | |
|---|---|---|---|---|---|
| TIMIT | Freq. Chop | 0.0770 | 0.22 | 0.046 | 6.85 |
| Baluja et al. [1] | 0.0023 | 27.11 | 0.096 | 0.14 | |
| Zhu et al. [2] | 0.0027 | 32.70 | 0.078 | 0.71 | |
| Ours | 0.0016 | 28.27 | 0.035 | 8.76 | |
| Ours + Adv. | 0.0022 | 34.54 | 0.051 | 4.02 | |
| YOHO | Freq. Chop | 0.0550 | 0.24 | 0.038 | 7.08 |
| Baluja et al. [1] | 0.0021 | 26.35 | 0.072 | 0.53 | |
| Zhu et al. [2] | 0.0047 | 27.99 | 0.066 | 1.05 | |
| Ours | 0.0016 | 27.86 | 0.028 | 8.16 | |
| Ours + Adv. | 0.0016 | 31.18 | 0.033 | 6.00 |
| TIMIT | Carrier | Message | |||
|---|---|---|---|---|---|
| model | loss | SNR | loss | SNR | |
| multi-3 | 0.0042 | 25.13 | 0.0458 | 6.16 | |
| cond-3 | 0.0043 | 24.08 | 0.0463 | 6.08 | |
| multi-5 | 0.0058 | 23.64 | 0.0550 | 4.42 | |
| cond-5 | 0.0063 | 22.70 | 0.0516 | 4.87 | |
| YOHO | multi-3 | 0.0042 | 23.80 | 0.0349 | 6.29 |
| cond-3 | 0.0038 | 23.53 | 0.0344 | 6.49 | |
| multi-5 | 0.0046 | 23.33 | 0.0428 | 4.17 | |
| cond-5 | 0.0051 | 22.30 | 0.0392 | 4.85 | |
| Noise | Msg. Loss | Msg. SNR |
|---|---|---|
| Down-sampling to 8k | 0.046 | 7.72 |
| MP3 compression 128k | 0.045 | 6.88 |
| MP3 compression 64k | 0.062 | 5.34 |
| MP3 compression 32k | 0.089 | 2.15 |
| AWGN, | 0.077 | -12.52 |
| AWGN, | 0.044 | 8.50 |
| Speckle, | 0.035 | 8.26 |
| Speckle, | 0.035 | 8.76 |
| Prec. reduction 8-bit | 0.160 | 0.25 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Hide and Speak: Towards Deep Neural Networks for Speech Steganography
Abstract
Steganography is the science of hiding a secret message within an ordinary public message, which is referred to as Carrier. Traditionally, digital signal processing techniques, such as least significant bit encoding, were used for hiding messages. In this paper, we explore the use of deep neural networks as steganographic functions for speech data. We showed that steganography models proposed for vision are less suitable for speech, and propose a new model that includes the short-time Fourier transform and inverse-short-time Fourier transform as differentiable layers within the network, thus imposing a vital constraint on the network outputs. We empirically demonstrated the effectiveness of the proposed method comparing to deep learning based on several speech datasets and analyzed the results quantitatively and qualitatively. Moreover, we showed that the proposed approach could be applied to conceal multiple messages in a single carrier using multiple decoders or a single conditional decoder. Lastly, we evaluated our model under different channel distortions. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible.
1 Introduction
Steganography (“steganos” – concealed or covered, “graphein” – writing) is the science of concealing messages inside other messages. It is generally used to convey concealed “secret” messages to recipients who are aware of their presence, while keeping even their existence hidden from other unaware parties who only see the “public” or “carrier” message.
Recently, [1, 2] proposed to use deep neural networks as a steganographic function for hiding an image inside another image. Unlike traditional steganography methods [3, 4], in this line of work, the network learns to conceal a hidden message inside the carrier without manually specifying a particular redundancy to exploit.
Although these studies presented impressive results on image data, the applicability of such models for speech data was not explored. As opposed to working with raw images in the domain of vision processing, the common approach when learning from speech data is to work at the frequency domain, and specifically, using the short time Fourier transform (STFT) to capture the spectral changes over time. The STFT output is a complex matrix composed of the Fourier transform of different time frames. The common practice is to use the absolute values (magintudes) of the STFT measurements, and to maintain a substantial overlap between adjacent frames [5]. Consequently, the original signal cannot be losslessly recovered from STFT. Moreover, as only the magnitude is considered, the phase needs to be recovered. This process complicates the restoration of the time domain signal even further.
In this study, we show that steganography models proposed for vision are less suitable for speech. We build on the work by [1, 2] and propose a new model that includes the STFT and inverse-STFT as differentiable layers within the network, thus imposing a vital constraint on the network outputs.
Although one can simply hide written text inside audio files and convey the same lexical content, concealing audio inside audio preserves additional features. For instance, the secret message may convey the speaker identity, the sentiment of the speaker, prosody, etc. These features can be used for later identification and authentication of the message.
Similarly to [1, 2], the proposed model is comprised of three parts. The first learns to encode a hidden message inside the carrier. The second component are differential STFT and inverse-STFT layers that simulate transformations between frequency and time domains. Lastly, the third component learns to decode a hidden message from a generated carrier. Additionally, we demonstrated for the first time, that the above scheme now permits us to hide multiple secret messages into a single carrier, each potentially with a different intended recipient who is the only person who can recover it.
Further analysis shows that the addition of STFT layers yields a method which is robust to various channel distortions and compression methods, such as MP3 encoding, Additive White Gaussian Noise, sample rate reduction, etc. Qualitative experiments suggest that modifications to the carrier are unnoticeable by human listeners and that the decoded messages are highly intelligible and preserve other semantic content, such as speaker identity.
Our contribution:
- •
We empirically show that steganographic vision-oriented models are less suitable for the audio domain.
- •
We augment vision-oriented models with differentiable STFT/Inverse-STFT layers during training to care for noise introduced when converting signals from frequency to time domain and back.
- •
We embed multiple speech messages in a single speech carrier.
- •
We provide extensive empirical and subjective analysis of the reconstructed signals and show that the produced carriers are indistinguishable from the original carriers, while keeping the decoded messages highly intelligible.
The paper is organized as follows, Section 2 formulates all the notations we use throughout the paper. In Section 3 we describe the proposed model. Section 4 and Section 5 present the results together with objective and subjective analysis. Section 6 summarizes the related work. We conclude the paper in Section 7 with a discussion and future work.
2 Notation and representation
In this section, we rigorously set the notation we use throughout the paper.
Steganography notations.
Recall, in steganography the goal is to conceal a hidden message within a carrier segment. Specifically, the steganography system is a function that gets as input a carrier utterance, denoted by , and a hidden message, denoted by . The outputs of the system are the embedded carrier , and consequently the recovered message, , such that the following constraints are satisfied: (i) both and should be perceptually similar to and , respectively, by a human evaluator; (ii) the message should be recoverable from the carrier and should be intelligible; and lastly (iii) a human evaluator should not be able to detect the presence of a hidden message embedded in .
Audio notations.
Let be a speech signal that is composed of samples. The spectral content of the signal changes over time, therefore it is often represented by the short-time Fourier transform, commonly known as the spectrogram, rather than by the Fourier transform.
The STFT, , is a matrix of complex numbers, its columns are the Fourier transform of a given time frame and its rows are frame indices. In speech processing we are often interested in the absolute value of the STFT, or the magnitude, which is denoted by . Similarly we denote the phase of the STFT by . Furthermore, we denote by the operator that gets as input a real signal and outputs the magnitude matrix of its STFT, , and denote by the operator that gets as input the magnitude and phase matrices of the STFT, and returns a recovered version of the speech waveform, . Here is computed by taking the inverse Fourier transform of each column of , and then reconstructing the waveform by combining the outputs by the overlap-and-add method. Note that this reconstruction is imperfect, since there is a substantial overlap between adjacent windows when using STFT in speech processing, hence part of the signal at each window is lost [6].
3 Model
Similarly to the models proposed in [1, 2], our architecture is composed of the following components: (i) Encoder Network denoted ; (ii) Carrier Decoder Network denoted ; and (iii) Message Decoder Network denoted . The model is schematically depicted in Figure 1A. The Encoder Network , gets as input a carrier , and outputs a latent representation of the carrier, . Then, we compose a joint representation of the encoded carrier , message , and original carrier by concatenating all three along the convolutional channel axis, as proposed in [2], where we denote the concatenation operator by ;.
The Carrier Decoder Network, , gets as input the aforementioned representation and outputs , the carrier embedded with the hidden message. Lastly, the Message Decoder Network , gets as input and outputs , the reconstructed hidden message. Each of the above components is a neural network, where the parameters are found by minimizing the absolute error between the carrier and the embedded carrier and between the original message and the reconstructed message.
At this point our architecture diverges from the one proposed in [1] by the addition of a differentiable STFT layers. Recall, our goal is to transmit , which means we need to recover the time-domain waveform from the magnitude . Unfortunately, the recovery of from the STFT magnitude only, is an ill-posed problem in general [7, 6]. Ideally, we would like to reconstruct using . However, the phase is unknown, and therefore must be approximated.
One way to overcome this phase recovery obstacle is to use the classical alternating projection algorithm of Griffin and Lim [8]. Unfortunately, this method produces a carrier with noticeable artifacts. The message, however, can be recovered that way and is intelligible.
Another way to reconstruct the time-domain signal is to use the magnitude of the embedded carrier , and the phase of the original carrier, . In subjective tests we found that the restructured carrier, denoted as , sounds acoustically similar to the original carrier . However, when recovering the hidden message we get unintelligible output. This is due to the fact that we used a mismatched phase.
To mitigate that, we turn to a third solution, where we constrain the loss function by and . Formally, we minimize:
[TABLE]
where , and . Practically, we added and operators as differentiable 1D-convolution layers as illustrated in Figure 1B. In words, we jointly optimize the model to generate which will preserve the hidden message after and will also resemble .
The above approach can be naturally extended to conceal multiple messages. In that case, the model is provided with a single carrier , and a set of messages, , where . We explored two settings: (i) multiple message decoders, in which we use different message decoders denoted by where , one for each message; and (ii) a single conditional decoder, in which we condition a single decoder with a set of codes . Each code is represented as a one-hot vector of size .
4 Experimental results
We evaluated our approach on TIMIT [9] and YOHO [10] datasets using the standard train/val/test splits. We evaluated the proposed method on the aforementioned datasets to assess the model under various recording conditions. Each utterance was sampled at 16kHz and represented as its power spectrum by applying the STFT with FFT frequency bins and sliding window with a shift . Training examples were generated by randomly selecting one utterance as carrier and other utterances as messages for . Thus, the matching of carrier and message is completely arbitrary and not fixed. Further, it may originate from different speakers.
All models were trained using Adam for 80 epochs with an initial learning rate of and a decaying factor of 10 every 20 epochs. We balanced between the carrier and message reconstruction losses using , . Each component in our model is implemented as a Gated Convolutional Neural Network as proposed by [11]. Specifically, is composed of three blocks of gated convolutions, was composed of four blocks of gated convolutions, and was composed of six blocks of gated convolutions. Each block contained 64 kernels of size 33. Sample waveforms of different models and experiments as well as the source code are available at http://hyperurl.co/ab7c3g.
We report results for the proposed approach together with [1, 2]. Additionally, we included a naive baseline, denoted by Frequency Chop. In which, we concatenated the lower half of frequencies of above the lower half of frequencies of , to form . Message decoding was performed by extracting the upper half of frequencies from and zero padding to the original size.
Results for concealing a single message are reported in Table 1: the Absolute-Error (AE) and Signal-to-Noise-Ratio (SNR) for both carrier and message of all baselines and the proposed models on TIMIT and YOHO.
Notice, while both [1] and [2] yield low carrier errors, their direct application to speech data produced unintelligible messages with a low SNR. This is due to the fact that these models were not constrained to retain the same carrier content after the conversion to time-domain and back. Figure 2 depicts the training process of the proposed model and baselines. It can be seen that without any constraints, the baseline message decoders diverge. Lastly, Frequency Chop retains much of the message content after decoding, but creates a carrier with noticeable artifacts. This is due to the fact that the hidden message is audible as it resides in the carrier’s high frequencies.
Moreover, we explored including adversarial loss terms between and to the optimization problem as suggested by [2]. Similarly to the effect on images, when incorporating the adversarial loss, the carrier quality improved and contained less artifacts, however it comes with the cost of less accurate message reconstruction.
Overall, the above results highlight the importance of modeling the time-frequency transformations in the context of steganographic models for the audio-domain.
Multiple messages.
Next, we further explore the capability of the proposed model for concealing several hidden messages. We analyzed the two settings described in Section 3, namely multiple decoders and single conditional decoder. Table 2 summarizes the results. The reported loss and SNR are averaged over the messages. Interestingly, both settings achieved comparable results for embedding 3 and 5 messages in a single carrier. An increase in the number of messages translates to higher loss values both for carrier and for messages. These results are to be expected as the model is forced to work at higher compression rates due to concealing and recovering more messages while keeping the carrier dimension the same.
5 Analysis
In this section we provide several evaluations regarding the quality of the embedded carrier, and the recovered message. We start with a subjective analysis of the resulted waveforms.
5.1 Carrier ABX testing
To validate that the difference between and is not detectable by humans, we performed ABX testing. We present each human with two audio samples A and B. Each of these two samples is either the original carrier or the carrier embedded with a hidden message. These two samples are followed by a third sample X randomly selected to be either A or B. Next, the human must choose whether X is the same as A or B. We generated 50 (25 from TIMIT and 25 from YOHO) audio samples, for each audio sample we recorded 20 answers from Amazon Mechanical Turk (AMT), 1000 answers overall. Only 51.2% (48.8% for TIMIT and 53.6% for YOHO) of the carriers embedded with hidden messages could be distinguished from the original ones by humans (the optimal ratio is 50%). Therefore we conclude that the modifications made by the steganographic function are not detectable by the human ear.
5.2 Message intelligibility
A major metric in evaluating a speech steganography system is the intelligibility of the reconstructed messages. To quantify this measure we conducted an additional subjective experiment in AMT. We generated 40 samples from TIMIT dataset: 20 original messages and 20 messages reconstructed by our model. We used TIMIT for that task since it contains a reacher vocabulary set comparing to YOHO. We recorded 20 answers for each sample (800 answers overall). The participants were instructed to transcribe the presented samples, and the Word Error Rate (WER) and Character Error Rate (CER) were measured. While WER is a coarse measure, CER provides finer evaluation of transcription error. The CER/WER measured on original and reconstructed messages were 5.1%/2.86% and 5.15%/2.78% respectively. We therefore deduce that our system does not degrade the intelligibility of speech signal.
5.3 Speaker recognition
An advantage to concealing speech and not text is preservation of non-lexical content such as speaker identity. To evaluate that we conducted both human and automatic evaluations, adhering to the Speaker Verification Protocol [12]. Given 4 speech segments, the first three were uttered by a single speaker, the forth was uttered by either the same speaker or by a different one, the goal is to verify whether the speaker in the forth sample is the same as in the first three 111We use speakers of the same gender to make the task of speaker differentiation more challenging.. For the human evaluation, we recorded 400 human answers, in 82% of cases, listeners were able to distinguish whether the speaker in the forth sample matched the speaker in the first three. In the automatic evaluation setup, we used the automatic speaker verification system proposed by [12]. The Equal Error Rate (EER) of the system is 18% (82% accuracy) on the generated messages, and 15% EER (85% accuracy) on the original messages. Hence, we deduce much of the speaker identity information is preserved in the generated messages.
5.4 Robustness to channel distortion
Another critical evaluation is performance under noisy conditions. To explore that we applied different channel distortion and compression techniques on the reconstructed carrier . In Table 3 we describe message reconstruction results after distorting the carrier using: 16kHz to 8kHz down-sampling, MP3 compression (using different bit rates), 16-bit precision to 8-bit precision, Additive White Gaussian Noise (AWGN) and Speckle noise. Results suggest that our method is robust to carrier down-sampling, MP3 compression and noise addition. Contrarily, the model is sensitive to bit precision change, but this is to be expected as the message decoder relies on miniscule carrier modification in order to reconstruct the hidden message.
6 Related work
A large variety of steganography methods have been proposed over the years, where most of them are applied to images [3, 4]. Traditionally, steganographic functions exploited actual or perceptual redundancies in the carrier signal. The most common approach is to encode the secret message is in the least significant bits of individual signal samples [13]. Other methods include concealing the secret message in the phase of the frequency components of the carrier [14] or in the form of the parameters of a miniscule echo that is introduced into the carrier signal [15].
Recently, neural networks have been widely used for steganography [1, 2, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. The authors in [1] first suggested to train neural networks to hide an entire image within another image (similarly to Figure 1A). [2] extended the work of [1] while adding an adversarial loss term to the objective. [16] suggested to use generative adversarial learning to generate stenographic images. However, none of the above approaches explored speech data and were focused on hiding a single message only.
A closely related task is Watermarking. Both approaches aim to encode a secret message into a data file. However, in steganography the goal is to perform secret communication while in watermarking the goal is verification and ownership protection. Several watermarking techniques use LSB encoding [26, 27]. Recently, [28, 29] suggested to embed watermarks into neural networks parameters.
7 Discussion and future work
In this work we show that the recently proposed deep learning models for image steganography are less suitable for audio data. We show that in order to utilize such models, time-domain transformations must be addressed during training. Moreover, we extend the general deep-learning steganography approach to hide multiple messages. We evaluated our model under several noisy conditions and showed empirically that such modifications to carriers are indistinguishable by humans and the messages recovered by our model are highly intelligible. Finally, we demonstrated that voice speaker verification is a viable means of authentication for hidden speech messages.
For future work we would like to explore the ability of such steganographic methods to evade detection by steganalysis algorithms, and incorporate such evasion capabilities as part of the training pipeline.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Shumeet Baluja, “Hiding images in plain sight: Deep steganography,” in Advances in Neural Information Processing Systems , 2017, pp. 2069–2079.
- 2[2] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei, “Hidden: Hiding data with deep networks,” in European Conference on Computer Vision . Springer, 2018, pp. 682–697.
- 3[3] Tayana Morkel, Jan HP Eloff, and Martin S Olivier, “An overview of image steganography.,” in ISSA , 2005, pp. 1–11.
- 4[4] GC Kessler, “An overview of steganography for the computer forensics examiner. retrieved february 26, 2006,” 2004.
- 5[5] Jae Soo Lim and Alan V Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proceedings of the IEEE , vol. 67, no. 12, pp. 1586–1604, 1979.
- 6[6] Kishore Jaganathan, Yonina C Eldar, and Babak Hassibi, “Stft phase retrieval: Uniqueness guarantees and recovery algorithms,” IEEE Journal of selected topics in signal processing , vol. 10, no. 4, pp. 770–781, 2016.
- 7[7] E Hofstetter, “Construction of time-limited functions with specified autocorrelation functions,” IEEE Transactions on Information Theory , vol. 10, no. 2, pp. 119–126, 1964.
- 8[8] Daniel Griffin and Jae Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 32, no. 2, pp. 236–243, 1984.
