Skip to the content.

Self-Supervised Solution to the Control Problem of Articulatory Synthesis

Interspeech 2023

Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu


Abstract

Given an articulatory-to-acoustic forward model, it is a priori unknown how its motor control must be operated to achieve a desired acoustic result. This control problem is a fundamental issue of articulatory speech synthesis and the cradle of acoustic-to-articulatory inversion, a discipline which attempts to address the issue by the means of various methods. This work presents an end-to-end solution to the articulatory control problem, in which synthetic motor trajectories of Monte-Carlo-generated artificial speech are linked to input modalities (such as natural speech recordings or phoneme sequence input) via speaker-independent latent representations of a vector-quantized variational autoencoder. The proposed method is self-supervised and thus, in principle, synthesizer and speaker model independent.


TensorTract Functionality

TensorTract is a compound model of multiple deep neural networks. It is capable to perform a number of tasks including acoustic-to-articulatory inversion (AAI), phoneme-to-articulatory conversion (P2A) and articulatory-to-acoustic neural speech synthesis. In case of AAI and P2A, TensorTract provides articulatory trajectories that are compatible with the state-of-the-art articulatory speech synthesizer VocalTractLab (VTL).


Schematic of the TensorTract model.

The centerpiece of TensorTract is a vector-quantized variational autoencoder (VQ-VAE), which maps log-melspectrograms via an TCN-based encoder to a speaker independent quantized latent representation and then back to log-mel features via a speaker and pitch conditioned TCN decoder. The VQ-VAE is trained on both natural and synthetic speech, whereby the latter is generated randomly using VTL. In subsequent training processes, the synthetic articulatory trajectories are mapped to the quantized latent via a multihead attention (MHA)-BiGRU forward model M2L (motor-to-latent) and a respective inverse model L2M (latent-to-motor). This creates a link between natural speech data and synthetic articulatory trajectories. Further, phoneme annotations of the natural speech are mapped to the latent via MHA+TCN-based forward model P2L (phoneme-to-latent). This effectively enables the control of VTL via natural speech and/or phoneme sequence input.
The following abbreviations have been used in the tables below:
  • P2L+V: Phoneme sequence was mapped to the latent using the P2L model, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
  • VTL (Rule-based): Speech was synthesized using the rule-based phoneme-to-speech functionality of VTL, which is based on vocal tract state presets derived from magnetic resonance imaging data.
  • M2L+H: The motor trajectories obtained from the VTL (Rule-based) configuration were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
  • M2L+H (V-ID): Similar to M2L+H, but the speaker identity of the VTL speaker was used in the decoder.
  • L2M+V: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
  • L2M+H: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The resulting articulatory trajectories were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
  • L2M+H (V-ID): Similar to L2M+H, but the speaker identity of the VTL speaker was used in the decoder.
  • VQV+H: Audio input features were encoded and decoded using the VQ-VAE model. The decoded features were synthesized using Hifi-GAN.
Note that Hifi-GAN was not fine-tuned to the reconstructed features. Thus, the syntheses with Hifi-GAN do not sound as natural as they could.

Phoneme-to-Articulatory Conversion

Tested on natural phoneme sequences (KIEL data set, Berlin sentences).

Please note: The audio files can be played with Firefox and Google Chrome, but problems were noticed when playing on Safari.

Model Utterances
k02be001 k02be002 k02be003 k02be004 k02be005 k02be006 k02be007 k02be008 k02be009 k02be010 k61be001 k61be002 k61be003 k61be004 k61be005 k61be006 k61be007 k61be008 k61be009 k61be010
Reference
P2L+V
VTL (Rule-based)
M2L+H
M2L+H (V-ID)

Zero-shot phoneme-to-speech (unseen speakers, unseen utterances from KIEL data set, Siemens sentences).

Model Utterances
dlmsi038 dlmsi063 dlmsi064 dlmsi072 dlmsi092
Reference
P2L+V
VTL (Rule-based)

Acoustic-to-Articulatory Inversion

Tested on natural audio samples (KIEL data set, Berlin sentences).

ModelUtterances
k61be011 k61be018 k61be023 k61be030 k61be037 k61be061 k62be005 k62be024 k62be086 k62be095 k65be002 k65be013 k65be017 k65be075 k65be077 k66be008 k66be041 k66be060 k66be062 k66be063
Reference
L2M+V
L2M+H
L2M+H (V-ID)
VQV+H

Tested on natural utterances (MLS German data set).

Here we demonstrate how the intelligibility of the produced samples increases if the acoustic noise (which is induced by articulatory noise) is suppressed. Future work should address how to regularize the neural network to output smooth, non-noisy articulatory trajectories in case of natural speech input where no motor loss is available. In the table NR means noise reduction is applied.
Model Utterances
252_1614_000008 2497_1614_000000 10148_10119_000697
Reference
L2M+V (NR)
L2M+V

Tested on unseen languages (LJSpeech, LibriSpeech, aishell-3).

Here we demonstrate the acoustic-to-articulatory inversion on other languages. This is challenging, as the audio encoder has never seen the speakers, utterances, or languages during training.

English:
Model Utterances
LJ001_0001 1320_122612_0000 LJ001_0024 1320_122612_0012
Reference
L2M+V (NR)
L2M+V

Mandarin:
Model Utterances
A B C
Reference
L2M+V (NR)
L2M+V