2.30 15.30   8.30 Opening      
  3.00 16.00   9.00 Special Synthesis Problems Articulation and Naturalness Modeling and Evaluation   9.00
  5.00 18.00 11.00 break break break 11.00
  5.10 18.10 11.10 Lior Wolf - Keynote 1 István Winkler - Keynote 2 Thomas Drugman - Keynote 3 11.10
  6.10 19.10 12.10 Morning Session Discussion Morning Session Discussion 2 Morning Session Discussion 3 12.10
  7.10 20.10 13.10 Articulation and Speech Styles Emotion, Singing and Voice Conversation Synthesis and Context 13.10
  9.10 22.10 15.10 break break Closing & SynSIG announcement 15.10
  9.20 22.20 15.20 Expressive Synthesis Multilingual and Evaluation   15.20
11.20   0.20 17.20 Afternoon Session Discussion Afternoon Session Discussion 2   17.20

Welcome reception in the venue

Social event    


Papers at ISCA website

Each paper has its DOI number.


      absz ea  

August 26., Thursday



   8.30 - 9.00



Géza Németh, Chairman, BME, Hungary


Special Synthesis Problems

   9.00 - 11.00


Session Chair: Lior Wolf


Sai Sirisha Rallabandi, Babak Naderi and Sebastian Möller:
Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech


Tamás Gábor Csapó:
Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging


Martin Lenglet, Olivier Perrotin and Gérard Bailly
Impact of Segmentation and Annotation in French end-to-end Synthesis


Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez and Odette Scharenborg
Pathological voice adaptation with autoencoder-based voice conversion

  Elijah Gutierrez, Pilar Oplustil-Gallegos and Catherine Lai
Location, Location: Enhancing the Evaluation of Text-to-Speech synthesis using the Rapid Prosody Transcription Paradigm

Keynote 1

   11.10 - 12.10


Session Chair: Erica Cooper

Lior Wolf, Facebook AI Research and Tel Aviv University, Israel
Deep Audio Conversion Technologies and Their Applications in Speech, Singing, and Music


Lior Wolf is a research scientist at Facebook AI Research and a full professor in the School of Computer Science at Tel-Aviv University, Israel. He conducted postdoctoral research at prof. Poggio's lab at the Massachusetts Institute of Technology and received his PhD degree from the Hebrew University, under the supervision of Prof. Shashua. He is an ERC grantee and has won the ICCV 2001 and ICCV 2019 honorable mention, and the best paper awards at ECCV 2000 and ICANN 2016. His research focuses on computer vision, audio synthesis, and deep learning.


Morning Session Discussion

   12.10 - 13.10


Articulation and Speech Styles

   13.10 - 15.10


Session Chair: Esther Klabbers


Tamás Gábor Csapó, Laszlo Toth, Gábor Gosztolya and Alexandra Markó
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input


Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie and Yannis Stylianou:
Combining speakers of multiple languages to improve quality of neural voices


Christina Tånnander and Jens Edlund:
Methods of slowing down speech


Joakim Gustafson, Jonas Beskow and Eva Szekely
Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis

  Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh and Tamás Gábor Csapó
Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Expressive Synthesis

   15.20 - 17.20


Session Chair: Gábor Olaszy


Bastian Schnell and Philip N. Garner
Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction


Slava Shechtman and Avrech Ben-David
Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis


Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman and Jaime Lorenzo-Trueba
EmoCat: Language-agnostic Emotional Voice Conversion


Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba and Viacheslav Klimkov
Enhancing audio quality for expressive Neural Text-to-Speech


Lucas H. Ueda, Paula D. P. Costa, Flavio O. Simoes and Mário U. Neto
Are we truly modeling expressiveness? A study on expressive TTS in Brazilian Portuguese for real-life application styles


Afternoon Session Discussion

   17.20 - 18.20


August 27, Friday


Articulation and Naturalness

   9.00 - 11.00


Session Chair: Tamás Gábor Csapó


Debasish Ray Mohapatra, Pramit Saha, Yadong Liu, Bryan Gick and Sidney Fels:
Vocal tract area function extraction using ultrasound for articulatory speech synthesis


Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa and Thomas Merritt:
Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech


Paul Konstantin Krug, Simon Stone and Peter Birkholz:
Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies


Ambika Kirkland, Marcin Włodarczak, Joakim Gustafson and Eva Szekely:
Perception of smiling voice in spontaneous speech synthesis

  Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati and Thomas Drugman:
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Keynote 2

   11.10 - 12.10


Session Chair: Cassia Valentini Botinhao

István Winkler, Research Centre for Natural Sciences, Hungary
Early Development of Infantile Communication by Sound


István Winkler, PhD, DSc, electrical engineer, psychologist. He received his PhD in 1993 at the University of Helsinki, studying auditory sensory memory by electroencephalographic measures. He defended his Doctor of Science thesis in 2005 at the Hungarian Academy of Sciences on auditory deviance detection. His current fields of interest are predictive processing in the auditory deviance detection, auditory scene analysis, communication by sound, and the development of these functions in infancy. During his career, he has authored/coauthored over 250 publications, which received over 11000 references. Currently he is the director of the Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Budapest, Hungary and the head of the Sound and Speech Perception research group (


Morning Session Discussion

   12.10 - 13.10


Emotion, Singing and Voice Conversation

   13.10 - 15.10


Session Chair: Simon King 


Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris and Georgia Maniati
Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control


Jennifer Williams, Jason Fong, Erica Cooper and Junichi Yamagishi:
Exploring Disentanglement with Multilingual and Monolingual VQ-VAE


Erica Cooper, Xin Wang and Junichi Yamagishi:
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis


Hieu-Thi Luong and Junichi Yamagishi:
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

  Patrick Lumban Tobing and Tomoki Toda:
Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction

Multilingual and Evaluation

   15.20 - 17.20


Session Chair: Junichi Yamagishi


Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai and Simon King:
Factors Affecting the Evaluation of Synthetic Speech in Context


Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga and Sharath Adavane:
Non-native English lexicon creation for bilingual speech synthesis


Dan Wells and Korin Richmond:
Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis


Ayushi Pandey, Sebastien Le Maguer, Julie Berndsen and Naomi Harte:
Mind your p’s and k’s -- Comparing obstruents across TTS voices of the Blizzard Challenge 2013

  Jason Fong, Jilong Wu, Prabhav Agrawal, Andrew Gibiansky, Thilo Koehler and Qing He:
Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning

Afternoon Session Discussion

   17.20 - 18.20


August 28, Saturday


Modeling and Evaluation

   9.00 - 11.00


Session Chair: Gérard Bailly

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati and Thomas Drugman:
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech


Erica Cooper and Junichi Yamagishi:
How do Voices from Past Speech Synthesis Challenges Compare Today?


Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi and Hiroshi Saruwatari:
Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder


Jason Taylor, Sébastien Le Maguer and Korin Richmond:
Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French


Qiao Tian, Chao Liu, Zewang Zhang, Heng Lu, Linghui Chen, Bin Wei, Pujiang He and Shan Liu:
FeatherTTS: Robust and Efficient attention based Neural TTS


Keynote 3

   11.10 - 12.10


Session Chair: Gustav Eje Henter

Thomas Drugman, Amazon, Germany
Expressive Neural TTS


Thomas Drugman is a Science Manager in Amazon TTS Research team. He received his PhD in 2011 from the University of Mons, winning the IBM Belgium award for “Best Thesis in Computer Science”. His PhD thesis studied the use of glottal source analysis in Speech Processing. He then made a 3-year post-doc on speech/audio analysis for two biomedical applications: trachea-esophageal speech reconstruction and cough detection in chronic respiratory diseases. In 2014, he joined Amazon as a Scientist in the Alexa ASR team. He then transferred to the TTS team in 2016, where he is Science Manager since 2017. He has contributed in making Amazon’s Neural TTS more natural and expressive, notably by enriching Alexa’s experience with different speaking styles: emotions, newscaster, whispering, etc. His current research interests lie in improving the naturalness and flow of longer synthetic speech interactions. He has about 125 publications in the field of Speech Processing. He got the Interspeech Best Student Paper awards in 2009 and 2014 (as supervisor). He is also member of the IEEE Speech and Language Technical Committee since 2019.


Morning Session Discussion

   12.10 - 13.10


Synthesis and Context

   13.10 - 15.10


Session Chair: Thomas Drugman


Pilar Oplustil-Gallegos, Johannah O'Mahony and Simon King:
Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech


Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura and Hiroshi Saruwatari:
Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings


Mano Ranjith Kumar M, Jom Kuriakose, Karthik Pandia D S and Hema A Murthy:
Lipsyncing efforts for transcreating lecture videos in Indian languages


Marco Nicolis and Viacheslav Klimkov:
Homograph disambiguation with contextual word embeddings for TTS systems

  Jason Fong, Jennifer Williams and Simon King:
Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks

Closing & SynSIG announcement

   15.10 - 15.20



Platinum Sponsor

Gold Sponsors