2019 Guide to Deep Learning for Speech Synthesis
▲Click above Leifeng.com Follow
Chasing the forefront~
Text | Li Feng
Editor's note of Leifeng.com AI Technology Review:
Artificial synthesis of human speech is called speech synthesis.
This machine learning-based technology is suitable for text-to-speech, music generation, speech generation, speech-enabled devices, navigation systems, and barrier-free services for the visually impaired.
In this article, we will look at the research or model framework based on deep learning.
Before we get started, we need to briefly outline some specific, traditional speech synthesis strategies: concatenation and parameterization.
The concatenation method requires the use of speech in a large database to concatenate and generate new audible speech. If a different speech style is required, a new audio database must be used, which greatly limits the scalability of this method.
The parametric method uses a recorded human voice and a parameterized function to change the voice by adjusting the function parameters.
These two methods represent traditional approaches to speech synthesis. Now let’s look at newer approaches using deep learning. To explore currently popular approaches to speech synthesis, we looked at these:
-
WaveNet: A Generative Model for Raw Audio
-
Tacotron: End-to-end speech synthesis
-
Deep Voice 1: Real-time Neural Text-to-Speech
-
Deep Voice 2: Multi-speaker Neural Text-to-Speech
-
Deep Voice 3: Scalable Text-to-Speech with Convolutional Sequence Learning
-
Parallel WaveNet: Fast and High-Fidelity Speech Synthesis
-
Neural network voice cloning using small samples
-
VoiceLoop: Speech fitting and synthesis via voice loop
-
Natural TTS Synthesis Using Conditional WaveNet on Mel-spectrogram Prediction
The authors of this article are from Google. They proposed a neural network that can generate raw audio waves. Their model is fully probabilistic and autoregressive, and has achieved state-of-the-art results on text-to-speech in both English and Chinese.
Article link: https://arxiv.org/abs/1609.03499
WaveNET is an audio generation model based on PixelCNN that is able to generate sounds similar to those made by humans.
In this generative model, each audio sample is conditioned on the previous audio samples. The conditional probabilities are modeled using a set of convolutional layers. This network has no pooling layers, and the output of the model has the same temporal dimension as the input.
Using temporary convolutions in the model architecture ensures that the model does not violate the order of data modeling. In this model, each predicted speech sample is fed back to the network to help predict the next speech sample. Since temporary convolutions do not have recurrent connections, they are faster to train than RNNs.
One of the main challenges of using temporary convolutions is that they require many layers to increase the receptive field. To address this challenge, the authors used widened convolutions. Widened convolutions allow networks with only a few layers to have a larger receptive field. The model uses a Softmax distribution to model the conditional distribution of each audio sample.
The model was evaluated in speech generation for multi-person scenarios, text-to-speech conversion, music audio modeling, etc. The test used the mean opinion score (MOS), which measures the quality of sound, essentially the same as a person's evaluation of the sound quality. It has a number between 1 and 5, where 5 represents the best quality.
The following figure shows the speech quality of waveNet at level 1-5.
The author of this article is from Google. Tacotron is an end-to-end generative text-to-speech model that can directly form speech from text and audio. Tacotron achieved an average score of 3.82 on American English. Tacotron generates speech at the frame level, so it is faster than the sample-level autoregressive method.
Article link: https://arxiv.org/abs/1703.10135
This model is trained on audio and text pairs, so it can be easily applied to new datasets. Tacotron is a seq2seq model that includes an encoder, an attention-based decoder, and a post-processing net. As shown in the following framework diagram, the model inputs characters and outputs the original spectrogram. The spectrogram is then converted into a waveform.
The figure below shows the structure of the CBHG module. It consists of 1-D convolutional filters, highway networks, and bidirectional GRU (Gated Recurrent Unit).
The character sequence is fed into the encoder, which extracts a sequential representation of the text. Each character is represented as a one-hot vector embedded into a continuous vector. A non-linear transformation is then added, followed by a dropout to reduce overfitting. This essentially reduces the mispronunciation of words.
The decoder used in the model is a tanh decoder based on content attention. The Griffin-Lim algorithm is then used to generate the waveform. The hyperparameters used in the model are shown below.
The figure below shows the performance advantage of Tacotron compared to other alternatives.
The author of this article is from Baidu Silicon Valley Artificial Intelligence Lab. Deep Voice is a text-to-speech system developed using deep neural networks.
Article link: https://arxiv.org/abs/1702.07825
It has five important components:
-
A segmentation model for localizing phoneme boundaries (based on a deep neural network using the Connectionist Temporal Classification (CTC) loss function);
-
The grapheme-to-phoneme conversion model (grapheme-to-phoneme is the process of generating word pronunciation under certain rules);
-
Phoneme duration prediction model;
-
Fundamental frequency prediction model;
-
Audio synthesis model (a WaveNet variant with fewer parameters).
The letter-to-phoneme model converts English characters into phonemes. The segmentation model identifies where each phoneme starts and ends in the audio file. The phoneme duration model predicts the duration of each phoneme in a phoneme sequence.
The fundamental frequency model predicts whether a phoneme is pronounced. The audio synthesis model combines the outputs of the letter-to-phoneme conversion model, the phoneme duration model, the fundamental frequency prediction model, etc. to perform audio synthesis.
Here's how it compares to other models
This article is the second iteration of Baidu Silicon Valley AI Lab's work on Deep Voice. They introduced a method to enhance neural text-to-speech using low-dimensional trainable speaker embeddings, which can generate different voices from a single model.
The model has a similar pipeline to DeepVoice 1, but it offers significant improvements in audio quality. The model is able to learn hundreds of unique voices from less than half an hour of speech data per speaker.
Article link: https://arxiv.org/abs/1705.08947
The author also introduces a neural vocoder based on WaveNet for spectrogram-to-audio, and combines it with Tacotron to replace Griffin-Lim audio generation. The focus of this article is to handle multiple speakers with very little data for each speaker. The architecture of the model is similar to Deep Voice 1, and the training process is shown in the figure below.
The main difference between Deep Voice 2 and Deep Voice 1 is the separation of the phoneme duration model and the frequency model. Deep Voice 1 had a single model for jointly predicting phoneme duration and frequency profiles; whereas in Deep Voice 2, phoneme durations are predicted first and then used as input to the frequency model.
The segmentation model in Deep Voice 2 uses a convolutional recurrent structure (using the connectionist temporal classification (CTC) loss function) to classify phoneme pairs. The main modification of Deep Voice 2 is the addition of a large amount of normalization and residual connections in the convolutional layers. Its voicing model is based on the WaveNet architecture.
Synthesizing speech from multiple speakers is done by augmenting each model with a single low-dimensional speaker embedding vector for each speaker. Weight distribution between speakers is achieved by storing speaker-dependent parameters in a very low-dimensional vector.
The initial state of the recurrent neural network (RNN) is generated by the embedding of the speaker's voice. The speaker's voice embedding is randomly initialized using a uniform distribution method and jointly trained using backpropagation. The speaker's voice embedding is included in multiple parts of the model to ensure that the voice characteristics of each speaker are taken into account.
Next let’s see how it performs compared to other models
Article link: https://arxiv.org/abs/1710.07654
The authors of this article propose a fully convolutional character-to-spectrogram framework that can achieve fully parallel computation. The framework is an attention-based sequence-to-sequence model. This model is trained on the LibriSpeech ASR dataset.
The structure of this model can convert text features such as characters, phonemes, and stress into different vocoder parameters, including Mel-band spectrogram, linear scale logarithmic amplitude spectrogram, base spectrum, spectral envelope, and aperiodic parameters. These vocoder parameters are then used as input to the audio waveform synthesis model.
The structure of the model consists of the following parts:
-
Encoder: A fully convolutional encoder that converts text features into internal learned representations.
-
Decoder: A fully convolutional causal decoder that decodes the learned representations in an autoregressive manner.
-
Transformer: A fully convolutional post-processing network that predicts the final vocoder parameters.
For text preprocessing, the authors capitalize text input characters, remove punctuation marks, end each sentence with a period or question mark, and replace spaces with special characters indicating pause lengths.
The figure below compares the performance of this model with other alternative models.
The authors of this article are from Google. They introduced a method called Probability Density Distillation, which trains a parallel feed-forward network from a trained WaveNet. The method is built by combining the best features of Inverse Autoregressive Flows (IAFS) and WaveNet. These features represent efficient training of WaveNet and efficient sampling of IAF networks.
Article link: https://arxiv.org/abs/1711.10433
In order to train effectively, the authors use an already trained WaveNet as a "teacher" and the WaveNet 'student' learns from it in parallel. The goal is to make the student match the probability of its own samples from the distribution learned from the teacher.
The authors also propose additional loss functions to guide the student to generate high-quality audio streams:
-
Power loss function: ensures that the power of different frequency bands of speech is used, just like a person speaking.
-
Perceptual loss function: For this loss function, the authors tried the feature reconstruction loss function (Euclidean distance between feature maps in the classifier) and the style loss function (Euclidean distance between Gram matrices). They found that the style loss function produced better results.
-
The contrastive loss penalizes waveforms with high likelihood regardless of the conditioning vector.
The following figure shows the performance of this model:
According to Leifeng.com , the authors of this article are from Baidu Research Institute. They introduced a neural voice cloning system that can synthesize a person's voice from a small number of audio samples by learning.
The two methods used by the system are speaker adaptation, which is achieved by fine-tuning a generative model of the voices of multiple speakers, and speaker encoding, which is achieved by training a separate model to directly infer a new embedding to the generative model of the voices of multiple speakers.
Article link: https://arxiv.org/abs/1802.06006v3
This paper uses Deep Voice 3 as the baseline for the multi-speaker model. Voice cloning is to extract the voice features of a speaker and generate audio corresponding to a given text based on these features.
The performance metrics of the generated audio are determined by the naturalness of the speech and the similarity of the speaker's voice. The authors propose a speaker encoding method that can predict the speaker voice embedding from unseen speaker audio samples.
Here are the performance of the sound clone:
The authors of this article are from Facebook AI Research. They introduced a neural text-to-speech (TTS) technology that can convert text into speech from sounds collected in the wild.
Article link: https://arxiv.org/abs/1707.06588
VoiceLoop is inspired by a working memory model called the phonological loop, which stores language information for a short period of time. It consists of two parts: a phonological store that is constantly replaced and a rehearsal process that maintains longer-term representations in the phonological store.
Voiceloop constructs a speech memory by treating the moving buffer as a matrix. A sentence is represented as a list of phonemes. A short vector is then decoded from each phoneme. The current context vector is generated by weighting the encodings of the phonemes and summing them at each time point.
Some of the properties that make VoiceLoop stand out include the use of memory buffers instead of traditional RNNs, memory sharing between all processes, and the use of shallow, fully connected networks for all computations.
The following figure shows how the model performs compared to other alternatives
The authors are from Google and UC Berkeley. They introduce Tacotron 2, a neural network architecture for text-to-speech synthesis.
Article link: https://arxiv.org/abs/1712.05884
It consists of a recurrent sequence-to-sequence feature prediction network that embeds characters into mel-scale spectrograms. This is followed by a modified WaveNet model that acts as a vocoder and uses the spectrogram to synthesize time-domain waves. The model has a mean opinion score (MOS) of 4.53.
This model combines the best features of Tacconon and WaveNet. Here is a comparison of its performance with other models:
The current speech synthesis technology is developing very fast, and we hope to catch up with the most cutting-edge research as soon as possible.
The above articles are the most important representatives of the current progress in the field of speech synthesis. The papers and their code implementations can be found online. We hope you can download them for testing and get the expected results.
Let us create a colorful voice world together.
Original link:
https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-with-deep-learning-630afcafb9dd
New arrival! "AI Investment Research" has now launched the complete video of the CCF GAIR 2019 summit and white papers on major theme sessions, including the Robotics Frontier Session, Intelligent Transportation Session, Smart City Session, AI Chip Session, AI Finance Session, AI Healthcare Session, Smart Education Session, etc. "AI Investment Research" members can watch the annual summit videos and research reports for free, scan the QR code to enter the member page to learn more, or send a private message to teaching assistant Xiao Mu (WeChat: moocmm) for consultation.