Program

 

EDT JST CEST

Thursday

Friday

Saturday

CEST
  2.30 15.30   8.30 Opening      
  3.00 16.00   9.00 Special Synthesis Problems Articulation and Naturalness Modeling and Evaluation   9.00
 
  5.00 18.00 11.00 break break break 11.00
  5.10 18.10 11.10 Lior Wolf - Keynote 1 István Winkler - Keynote 2 Thomas Drugman - Keynote 3 11.10
 
  6.10 19.10 12.10 Morning Session Discussion Morning Session Discussion 2 Morning Session Discussion 3 12.10
 
  7.10 20.10 13.10 Articulation and Speech Styles Emotion, Singing and Voice Conversation Synthesis and Context 13.10
 
  9.10 22.10 15.10 break break Closing & SynSIG announcement 15.10
  9.20 22.20 15.20 Expressive Synthesis Multilingual and Evaluation   15.20
 
11.20   0.20 17.20 Afternoon Session Discussion Afternoon Session Discussion 2   17.20
 
     


Welcome reception in the venue

Social event    

 

Papers at ISCA website

Each paper has its DOI number. https://www.isca-speech.org/archive/ssw_2021/index.html

  

      absz ea  

August 26., Thursday

     

Opening

   8.30 - 9.00


 

 
   
 

Géza Németh, Chairman, BME, Hungary

     

Special Synthesis Problems

   9.00 - 11.00

 
 
   
 

Session Chair: Lior Wolf

     
 

Sai Sirisha Rallabandi, Babak Naderi and Sebastian Möller:
Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech

   
 

Tamás Gábor Csapó:
Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

   
 

Martin Lenglet, Olivier Perrotin and Gérard Bailly
Impact of Segmentation and Annotation in French end-to-end Synthesis

     
 

Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez and Odette Scharenborg
Pathological voice adaptation with autoencoder-based voice conversion

   
  Elijah Gutierrez, Pilar Oplustil-Gallegos and Catherine Lai
Location, Location: Enhancing the Evaluation of Text-to-Speech synthesis using the Rapid Prosody Transcription Paradigm
     

Keynote 1

   11.10 - 12.10

   
 

Session Chair: Erica Cooper

Lior Wolf, Facebook AI Research and Tel Aviv University, Israel
Deep Audio Conversion Technologies and Their Applications in Speech, Singing, and Music

 
 

Lior Wolf is a research scientist at Facebook AI Research and a full professor in the School of Computer Science at Tel-Aviv University, Israel. He conducted postdoctoral research at prof. Poggio's lab at the Massachusetts Institute of Technology and received his PhD degree from the Hebrew University, under the supervision of Prof. Shashua. He is an ERC grantee and has won the ICCV 2001 and ICCV 2019 honorable mention, and the best paper awards at ECCV 2000 and ICANN 2016. His research focuses on computer vision, audio synthesis, and deep learning.
 

   
           

Morning Session Discussion

   12.10 - 13.10

   

Articulation and Speech Styles

   13.10 - 15.10

   
 

Session Chair: Esther Klabbers

 
 

Tamás Gábor Csapó, Laszlo Toth, Gábor Gosztolya and Alexandra Markó
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

   
 

Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie and Yannis Stylianou:
Combining speakers of multiple languages to improve quality of neural voices

   
 

Christina Tånnander and Jens Edlund:
Methods of slowing down speech

   
 

Joakim Gustafson, Jonas Beskow and Eva Szekely
Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis

   
  Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh and Tamás Gábor Csapó
Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
   

Expressive Synthesis

   15.20 - 17.20

   
 

Session Chair: Gábor Olaszy

 
 

Bastian Schnell and Philip N. Garner
Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

   
 

Slava Shechtman and Avrech Ben-David
Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis

   
 

Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman and Jaime Lorenzo-Trueba
EmoCat: Language-agnostic Emotional Voice Conversion

     
 

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba and Viacheslav Klimkov
Enhancing audio quality for expressive Neural Text-to-Speech

   
 

Lucas H. Ueda, Paula D. P. Costa, Flavio O. Simoes and Mário U. Neto
Are we truly modeling expressiveness? A study on expressive TTS in Brazilian Portuguese for real-life application styles

   

Afternoon Session Discussion

   17.20 - 18.20

       
           
           
           

August 27, Friday

     

Articulation and Naturalness

   9.00 - 11.00

 
 
   
 

Session Chair: Tamás Gábor Csapó

     
 

Debasish Ray Mohapatra, Pramit Saha, Yadong Liu, Bryan Gick and Sidney Fels:
Vocal tract area function extraction using ultrasound for articulatory speech synthesis

   
 

Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa and Thomas Merritt:
Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

   
 

Paul Konstantin Krug, Simon Stone and Peter Birkholz:
Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies

   
 

Ambika Kirkland, Marcin Włodarczak, Joakim Gustafson and Eva Szekely:
Perception of smiling voice in spontaneous speech synthesis

   
  Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati and Thomas Drugman:
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments
   

Keynote 2

   11.10 - 12.10

   
 

Session Chair: Cassia Valentini Botinhao

István Winkler, Research Centre for Natural Sciences, Hungary
Early Development of Infantile Communication by Sound

 
 

István Winkler, PhD, DSc, electrical engineer, psychologist. He received his PhD in 1993 at the University of Helsinki, studying auditory sensory memory by electroencephalographic measures. He defended his Doctor of Science thesis in 2005 at the Hungarian Academy of Sciences on auditory deviance detection. His current fields of interest are predictive processing in the auditory deviance detection, auditory scene analysis, communication by sound, and the development of these functions in infancy. During his career, he has authored/coauthored over 250 publications, which received over 11000 references. Currently he is the director of the Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Budapest, Hungary and the head of the Sound and Speech Perception research group (http://www.ttk.hu/kpi/en/sound-and-speech-perception/).

   
           

Morning Session Discussion

   12.10 - 13.10

   

Emotion, Singing and Voice Conversation

   13.10 - 15.10

   
 

Session Chair: Simon King 

 
 

Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris and Georgia Maniati
Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

   
 

Jennifer Williams, Jason Fong, Erica Cooper and Junichi Yamagishi:
Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

   
 

Erica Cooper, Xin Wang and Junichi Yamagishi:
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

 
 

Hieu-Thi Luong and Junichi Yamagishi:
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

  Patrick Lumban Tobing and Tomoki Toda:
Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction
   

Multilingual and Evaluation

   15.20 - 17.20

   
 

Session Chair: Junichi Yamagishi

 
 

Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai and Simon King:
Factors Affecting the Evaluation of Synthetic Speech in Context

   
 

Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga and Sharath Adavane:
Non-native English lexicon creation for bilingual speech synthesis

   
 

Dan Wells and Korin Richmond:
Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis

   
 

Ayushi Pandey, Sebastien Le Maguer, Julie Berndsen and Naomi Harte:
Mind your p’s and k’s -- Comparing obstruents across TTS voices of the Blizzard Challenge 2013

   
  Jason Fong, Jilong Wu, Prabhav Agrawal, Andrew Gibiansky, Thilo Koehler and Qing He:
Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning
   

Afternoon Session Discussion

   17.20 - 18.20

               
           
           
           

August 28, Saturday

     

Modeling and Evaluation

   9.00 - 11.00

 
 
   
 

Session Chair: Gérard Bailly

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati and Thomas Drugman:
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

   
 

Erica Cooper and Junichi Yamagishi:
How do Voices from Past Speech Synthesis Challenges Compare Today?

   
 

Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi and Hiroshi Saruwatari:
Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder

   
 

Jason Taylor, Sébastien Le Maguer and Korin Richmond:
Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

   
 

Qiao Tian, Chao Liu, Zewang Zhang, Heng Lu, Linghui Chen, Bin Wei, Pujiang He and Shan Liu:
FeatherTTS: Robust and Efficient attention based Neural TTS

   

Keynote 3

   11.10 - 12.10

   
 

Session Chair: Gustav Eje Henter

Thomas Drugman, Amazon, Germany
Expressive Neural TTS

 
 

Thomas Drugman is a Science Manager in Amazon TTS Research team. He received his PhD in 2011 from the University of Mons, winning the IBM Belgium award for “Best Thesis in Computer Science”. His PhD thesis studied the use of glottal source analysis in Speech Processing. He then made a 3-year post-doc on speech/audio analysis for two biomedical applications: trachea-esophageal speech reconstruction and cough detection in chronic respiratory diseases. In 2014, he joined Amazon as a Scientist in the Alexa ASR team. He then transferred to the TTS team in 2016, where he is Science Manager since 2017. He has contributed in making Amazon’s Neural TTS more natural and expressive, notably by enriching Alexa’s experience with different speaking styles: emotions, newscaster, whispering, etc. His current research interests lie in improving the naturalness and flow of longer synthetic speech interactions. He has about 125 publications in the field of Speech Processing. He got the Interspeech Best Student Paper awards in 2009 and 2014 (as supervisor). He is also member of the IEEE Speech and Language Technical Committee since 2019.

   
           

Morning Session Discussion

   12.10 - 13.10

   

Synthesis and Context

   13.10 - 15.10

   
 

Session Chair: Thomas Drugman

 
 

Pilar Oplustil-Gallegos, Johannah O'Mahony and Simon King:
Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

   
 

Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura and Hiroshi Saruwatari:
Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings

   
 

Mano Ranjith Kumar M, Jom Kuriakose, Karthik Pandia D S and Hema A Murthy:
Lipsyncing efforts for transcreating lecture videos in Indian languages

 
 

Marco Nicolis and Viacheslav Klimkov:
Homograph disambiguation with contextual word embeddings for TTS systems

  Jason Fong, Jennifer Williams and Simon King:
Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks
   

Closing & SynSIG announcement

   15.10 - 15.20

   
 

 
           
           
           

Platinum Sponsor

 
Gold Sponsors
     
Partners