JUCS - Journal of Universal Computer Science 26(4): 434-453, doi: 10.3897/jucs.2020.023

Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding

Milan Sečujski^‡, Darko Pekar^§, Siniša Suzić^‡, Anton Smirnov^§, Tijana Nosek^‡

‡ University of Novi Sad, Novi Sad, Serbia§ AlfaNum Speech Technologies Ltd., Novi Sad, Serbia

Corresponding author: Milan Sečujski ( secujski@uns.ac.rs )

This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY-ND 4.0). This license allows reusers to copy and distribute the material in any medium or format in unadapted form only, and only so long as attribution is given to the creator. The license allows for commercial use.

Citation: Sečujski M, Pekar D, Suzić S, Smirnov A, Nosek T (2020) Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding. JUCS - Journal of Universal Computer Science 26(4): 434-453. https://doi.org/10.3897/jucs.2020.023

Abstract

The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the similarities and differences between speakers and speaking styles more efficiently. The initial model from which speaker/style adaptation was carried out was a multi-speaker/multi-style model based on 8.5 hours of American English speech data which corresponds to 16 different speaker/style combinations. The results of the experiments show that both versions of the obtained system, one using 10 minutes and the other as little as 30 seconds of target data, outperform the state of the art in parametric speaker/style-dependent speech synthesis. This opens a wide range of application of speaker/style dependent speech synthesis based on small quantities of training data, in domains ranging from customer interaction in call centers to robot-assisted medical therapy.

Keywords

deep neural networks, embedding, speaker adaptation, text-to-speech synthesis