TensorTract

Self-Supervised Solution to the Control Problem of Articulatory Synthesis

Interspeech 2023

Paul Konstantin Krug, Peter Birkholz, Branislav Gerazov, Daniel Rudolph van Niekerk, Anqi Xu, Yi Xu

Abstract

Given an articulatory-to-acoustic forward model, it is a priori unknown how its motor control must be operated to achieve a desired acoustic result. This control problem is a fundamental issue of articulatory speech synthesis and the cradle of acoustic-to-articulatory inversion, a discipline which attempts to address the issue by the means of various methods. This work presents an end-to-end solution to the articulatory control problem, in which synthetic motor trajectories of Monte-Carlo-generated artificial speech are linked to input modalities (such as natural speech recordings or phoneme sequence input) via speaker-independent latent representations of a vector-quantized variational autoencoder. The proposed method is self-supervised and thus, in principle, synthesizer and speaker model independent.

TensorTract Functionality

TensorTract is a compound model of multiple deep neural networks. It is capable to perform a number of tasks including acoustic-to-articulatory inversion (AAI), phoneme-to-articulatory conversion (P2A) and articulatory-to-acoustic neural speech synthesis. In case of AAI and P2A, TensorTract provides articulatory trajectories that are compatible with the state-of-the-art articulatory speech synthesizer VocalTractLab (VTL).

The centerpiece of TensorTract is a vector-quantized variational autoencoder (VQ-VAE), which maps log-melspectrograms via an TCN-based encoder to a speaker independent quantized latent representation and then back to log-mel features via a speaker and pitch conditioned TCN decoder. The VQ-VAE is trained on both natural and synthetic speech, whereby the latter is generated randomly using VTL. In subsequent training processes, the synthetic articulatory trajectories are mapped to the quantized latent via a multihead attention (MHA)-BiGRU forward model M2L (motor-to-latent) and a respective inverse model L2M (latent-to-motor). This creates a link between natural speech data and synthetic articulatory trajectories. Further, phoneme annotations of the natural speech are mapped to the latent via MHA+TCN-based forward model P2L (phoneme-to-latent). This effectively enables the control of VTL via natural speech and/or phoneme sequence input.

The following abbreviations have been used in the tables below:

P2L+V: Phoneme sequence was mapped to the latent using the P2L model, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
VTL (Rule-based): Speech was synthesized using the rule-based phoneme-to-speech functionality of VTL, which is based on vocal tract state presets derived from magnetic resonance imaging data.
M2L+H: The motor trajectories obtained from the VTL (Rule-based) configuration were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
M2L+H (V-ID): Similar to M2L+H, but the speaker identity of the VTL speaker was used in the decoder.
L2M+V: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The rsulting articulatory trajectories were synthesized with VTL.
L2M+H: Audio input features were mapped to the latent using the VQ-VAE encoder, then the latent was decoded using the L2M model. The resulting articulatory trajectories were synthesized using the M2L model and a pretrained Hifi-GAN. The speaker identity of the original natural speakers were used in the decoder.
L2M+H (V-ID): Similar to L2M+H, but the speaker identity of the VTL speaker was used in the decoder.
VQV+H: Audio input features were encoded and decoded using the VQ-VAE model. The decoded features were synthesized using Hifi-GAN.

Note that Hifi-GAN was not fine-tuned to the reconstructed features. Thus, the syntheses with Hifi-GAN do not sound as natural as they could.

Phoneme-to-Articulatory Conversion

Tested on natural phoneme sequences (KIEL data set, Berlin sentences).

Please note: The audio files can be played with Firefox and Google Chrome, but problems were noticed when playing on Safari.

Model

Utterances

k02be001

k02be002

k02be003

k02be004

k02be005

k02be006

k02be007

k02be008

k02be009

k02be010

k61be001

k61be002

k61be003

k61be004

k61be005

k61be006

k61be007

k61be008

k61be009

k61be010

Reference

P2L+V

VTL (Rule-based)

M2L+H

M2L+H (V-ID)

Zero-shot phoneme-to-speech (unseen speakers, unseen utterances from KIEL data set, Siemens sentences).

Model

Utterances

dlmsi038

dlmsi063

dlmsi064

dlmsi072

dlmsi092

Reference

P2L+V

VTL (Rule-based)

Acoustic-to-Articulatory Inversion

Tested on natural audio samples (KIEL data set, Berlin sentences).

Model

Utterances

k61be011

k61be018

k61be023

k61be030

k61be037

k61be061

k62be005

k62be024

k62be086

k62be095

k65be002

k65be013

k65be017

k65be075

k65be077

k66be008

k66be041

k66be060

k66be062

k66be063

Reference

L2M+V

L2M+H

L2M+H (V-ID)

VQV+H

Tested on natural utterances (MLS German data set).

Here we demonstrate how the intelligibility of the produced samples increases if the acoustic noise (which is induced by articulatory noise) is suppressed. Future work should address how to regularize the neural network to output smooth, non-noisy articulatory trajectories in case of natural speech input where no motor loss is available. In the table NR means noise reduction is applied.

Model

Utterances

252_1614_000008

2497_1614_000000

10148_10119_000697

Reference

L2M+V (NR)

L2M+V

Tested on unseen languages (LJSpeech, LibriSpeech, aishell-3).

Here we demonstrate the acoustic-to-articulatory inversion on other languages. This is challenging, as the audio encoder has never seen the speakers, utterances, or languages during training.

English:

Model

Utterances

LJ001_0001

1320_122612_0000

LJ001_0024

1320_122612_0012

Reference

L2M+V (NR)

L2M+V

Mandarin:

Model

Utterances

Reference

L2M+V (NR)

L2M+V