Self-Supervised Solution to the Control Problem of Articulatory Synthesis
Interspeech 2023
Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu
Abstract
Given an articulatory-to-acoustic forward model, it is a priori unknown how its motor control must be operated to achieve a desired acoustic result. This control problem is a fundamental issue of articulatory speech synthesis and the cradle of acoustic-to-articulatory inversion, a discipline which attempts to address the issue by the means of various methods. This work presents an end-to-end solution to the articulatory control problem, in which synthetic motor trajectories of Monte-Carlo-generated artificial speech are linked to input modalities (such as natural speech recordings or phoneme sequence input) via speaker-independent latent representations of a vector-quantized variational autoencoder. The proposed method is self-supervised and thus, in principle, synthesizer and speaker model independent.
TensorTract Functionality
TensorTract is a compound model of multiple deep neural networks. It is capable to perform a number of tasks including acoustic-to-articulatory inversion (AAI), phoneme-to-articulatory conversion (P2A) and articulatory-to-acoustic neural speech synthesis. In case of AAI and P2A, TensorTract provides articulatory trajectories that are compatible with the state-of-the-art articulatory speech synthesizer VocalTractLab (VTL).
The centerpiece of TensorTract is a vector-quantized variational autoencoder (VQ-VAE), which maps log-melspectrograms via an TCN-based encoder to a speaker independent quantized latent representation and then back to log-mel features via a speaker and pitch conditioned TCN decoder. The VQ-VAE is trained on both natural and synthetic speech, whereby the latter is generated randomly using VTL. In subsequent training processes, the synthetic articulatory trajectories are mapped to the quantized latent via a multihead attention (MHA)-BiGRU forward model M2L (motor-to-latent) and a respective inverse model L2M (latent-to-motor). This creates a link between natural speech data and synthetic articulatory trajectories. Further, phoneme annotations of the natural speech are mapped to the latent via MHA+TCN-based forward model P2L (phoneme-to-latent). This effectively enables the control of VTL via natural speech and/or phoneme sequence input.
The following abbreviations have been used in the tables below:
- P2L+V: Phoneme sequence was mapped to the latent using the P2L model, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
- VTL (Rule-based): Speech was synthesized using the rule-based phoneme-to-speech functionality of VTL, which is based on vocal tract state presets derived from magnetic resonance imaging data.
- M2L+H: The motor trajectories obtained from the VTL (Rule-based) configuration were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
- M2L+H (V-ID): Similar to M2L+H, but the speaker identity of the VTL speaker was used in the decoder.
- L2M+V: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
- L2M+H: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The resulting articulatory trajectories were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
- L2M+H (V-ID): Similar to L2M+H, but the speaker identity of the VTL speaker was used in the decoder.
- VQV+H: Audio input features were encoded and decoded using the VQ-VAE model. The decoded features were synthesized using Hifi-GAN.
Note that Hifi-GAN was not fine-tuned to the reconstructed features. Thus, the syntheses with Hifi-GAN do not sound as natural as they could.
Phoneme-to-Articulatory Conversion
Tested on natural phoneme sequences (KIEL data set, Berlin sentences).
Please note: The audio files can be played with Firefox and Google Chrome, but problems were noticed when playing on Safari.
Model | Utterances | |||||||||||||||||||
k02be001 | k02be002 | k02be003 | k02be004 | k02be005 | k02be006 | k02be007 | k02be008 | k02be009 | k02be010 | k61be001 | k61be002 | k61be003 | k61be004 | k61be005 | k61be006 | k61be007 | k61be008 | k61be009 | k61be010 | |
Reference | ||||||||||||||||||||
P2L+V | ||||||||||||||||||||
VTL (Rule-based) | ||||||||||||||||||||
M2L+H | ||||||||||||||||||||
M2L+H (V-ID) |
Zero-shot phoneme-to-speech (unseen speakers, unseen utterances from KIEL data set, Siemens sentences).
Model | Utterances | |||||||||||
dlmsi038 | dlmsi063 | dlmsi064 | dlmsi072 | dlmsi092 | ||||||||
Reference | ||||||||||||
P2L+V | ||||||||||||
VTL (Rule-based) |
Tested on natural audio samples (KIEL data set, Berlin sentences).
Model | Utterances | |||||||||||||||||||
k61be011 | k61be018 | k61be023 | k61be030 | k61be037 | k61be061 | k62be005 | k62be024 | k62be086 | k62be095 | k65be002 | k65be013 | k65be017 | k65be075 | k65be077 | k66be008 | k66be041 | k66be060 | k66be062 | k66be063 | |
Reference | ||||||||||||||||||||
L2M+V | ||||||||||||||||||||
L2M+H | ||||||||||||||||||||
L2M+H (V-ID) | ||||||||||||||||||||
VQV+H |
Tested on natural utterances (MLS German data set).
Here we demonstrate how the intelligibility of the produced samples increases if the acoustic noise (which is induced by articulatory noise) is suppressed. Future work should address how to regularize the neural network to output smooth, non-noisy articulatory trajectories in case of natural speech input where no motor loss is available. In the table NR means noise reduction is applied.
Model | Utterances | |||||||||||
252_1614_000008 | 2497_1614_000000 | 10148_10119_000697 | ||||||||||
Reference | ||||||||||||
L2M+V (NR) | ||||||||||||
L2M+V |
Tested on unseen languages (LJSpeech, LibriSpeech, aishell-3).
Here we demonstrate the acoustic-to-articulatory inversion on other languages. This is challenging, as the audio encoder has never seen the speakers, utterances, or languages during training.
English:
Model | Utterances | |||||||||||
LJ001_0001 | 1320_122612_0000 | LJ001_0024 | 1320_122612_0012 | |||||||||
Reference | ||||||||||||
L2M+V (NR) | ||||||||||||
L2M+V |
Mandarin:
Model | Utterances | |||||||||||
A | B | C | ||||||||||
Reference | ||||||||||||
L2M+V (NR) | ||||||||||||
L2M+V |