Program

to be updated

EDT JST CEST

Thursday

Friday

Saturday

CEST
  2.30 15.30   8.30 Opening      
  3.00 16.00   9.00 Session 1 Session 4 Session 7   9.00
 
  5.00 18.00 11.00 break break break 11.00
  5.10 18.10 11.10 Keynote 1 Keynote 2 Keynote 3 11.10
  6.10 19.10 12.10 Morning Session Discussion Morning Session Discussion 2 Morning Session Discussion 3 12.10
  7.10 20.10 13.10 Session 2 Session 5 Session 8 13.10
 
  9.10 22.10 15.10 break break Closing & SynSIG announcement 15.10
  9.20 22.20 15.20 Session 3 Session 6   15.20
 
11.20   0.20 17.20 Afternoon Session Discussion Afternoon Session Discussion 2   17.20
     


Welcome reception

Social event    

 

   8.30 - 9.00

      absz ea  

 August 26., Thursday

     

Opening

   8.30 - 9.00


 

 
   
 

Géza Németh, Chairman, BME, Hungary

     

Session 1

   9.00 - 11.10

 
 
   
 

Session Chair: 

     
 

Sai Sirisha Rallabandi, Babak Naderi and Sebastian Möller:
Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech

   
 

Tamás Gábor Csapó:
Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

   
 

Martin Lenglet, Olivier Perrotin and Gérard Bailly
Impact of Segmentation and Annotation in French end-to-end Synthesis

     
 

Bence Mark Halpern, Marc Illa, Rob van Son, Laureano Moro-Velazquez and Odette Scharenborg
Pathological voice adaptation with autoencoder-based voice conversion

   
  Elijah Gutierrez, Pilar Oplustil-Gallegos and Catherine Lai
Location, Location: Enhancing the Evaluation of Text-to-Speech synthesis using the Rapid Prosody Transcription Paradigm
     

Keynote 1

   11.10 - 12.10

   
 

Lior Wolf, Facebook AI Research and Tel Aviv University, Israel
Deep Audio Conversion Technologies and Their Applications in Speech, Singing, and Music

 
  Lior Wolf is a research scientist at Facebook AI Research and a full professor in the School of Computer Science at Tel-Aviv University, Israel. He conducted postdoctoral research at prof. Poggio's lab at the Massachusetts Institute of Technology and received his PhD degree from the Hebrew University, under the supervision of Prof. Shashua. He is an ERC grantee and has won the ICCV 2001 and ICCV 2019 honorable mention, and the best paper awards at ECCV 2000 and ICANN 2016. His research focuses on computer vision, audio synthesis, and deep learning.    
           

Morning Session Discussion

   12.10 - 13.10

   

Session 2

   13.10 - 15.10

   
 

Session Chair: 

 
 

Tamás Gábor Csapó, Laszlo Toth, Gábor Gosztolya and Alexandra Markó
Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

   
 

Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie and Yannis Stylianou:
Combining speakers of multiple languages to improve quality of neural voices

   
 

Christina Tånnander and Jens Edlund:
Methods of slowing down speech

   
 

Joakim Gustafson, Jonas Beskow and Eva Szekely
Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis

   
  Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh and Tamás Gábor Csapó
Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
   

Session 3

   15.20 - 17.20

   
 

Session Chair:

 
 

Bastian Schnell and Philip N. Garner
Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

   
 

Slava Shechtman and Avrech Ben-David
Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis

   
 

Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman and Jaime Lorenzo-Trueba
EmoCat: Language-agnostic Emotional Voice Conversion

     
 

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba and Viacheslav Klimkov
Enhancing audio quality for expressive Neural Text-to-Speech

   
 

Lucas H. Ueda, Paula D. P. Costa, Flavio O. Simoes and Mário U. Neto
Are we truly modeling expressiveness? A study on expressive TTS in Brazilian Portuguese for real-life application styles

   

Afternoon Session Discussion

   17.20 - 18.20

       
           
           
           

 August 27, Friday

     

Session 4

   9.00 - 11.10

 
 
   
 

Session Chair: 

     
 

Debasish Ray Mohapatra, Pramit Saha, Yadong Liu, Bryan Gick and Sidney Fels:
Vocal tract area function extraction using ultrasound for articulatory speech synthesis

   
 

Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa and Thomas Merritt:
Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

   
 

Paul Konstantin Krug, Simon Stone and Peter Birkholz:
Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies

   
 

Ambika Kirkland, Marcin Włodarczak, Joakim Gustafson and Eva Szekely:
Perception of smiling voice in spontaneous speech synthesis

   
  Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati and Thomas Drugman:
Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments
   

Keynote 2

   11.10 - 12.10

   
 

István Winkler, Research Centre for Natural Sciences, Hungary
Early Development of Infantile Communication by Sound

 
  István Winkler, PhD, DSc, electrical engineer, psychologist. He received his PhD in 1993 at the University of Helsinki, studying auditory sensory memory by electroencephalographic measures. He defended his Doctor of Science thesis in 2005 at the Hungarian Academy of Sciences on auditory deviance detection. His current fields of interest are predictive processing in the auditory deviance detection, auditory scene analysis, communication by sound, and the development of these functions in infancy. During his career, he has authored/coauthored over 250 publications, which received over 11000 references. Currently he is the director of the Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Budapest, Hungary and the head of the Sound and Speech Perception research group (http://www.ttk.hu/kpi/en/sound-and-speech-perception/).    
           

Morning Session Discussion

   12.10 - 13.10

   

Session 5

   13.10 - 15.10

   
 

Session Chair: 

 
 

Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris and Georgia Maniati
Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

   
 

Jennifer Williams, Jason Fong, Erica Cooper and Junichi Yamagishi:
Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

   
 

Erica Cooper, Xin Wang and Junichi Yamagishi:
Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

 
 

Hieu-Thi Luong and Junichi Yamagishi:
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance

  Patrick Lumban Tobing and Tomoki Toda:
Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction
   

Session 6

   15.20 - 17.20

   
 

Session Chair:

 
 

Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai and Simon King:
Factors Affecting the Evaluation of Synthetic Speech in Context

   
 

Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga and Sharath Adavane:
Non-native English lexicon creation for bilingual speech synthesis

   
 

Dan Wells and Korin Richmond:
Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis

   
 

Ayushi Pandey, Sebastien Le Maguer, Julie Berndsen and Naomi Harte:
Mind your p’s and k’s -- Comparing obstruents across TTS voices of the Blizzard Challenge 2013

   
  Jason Fong, Jilong Wu, Prabhav Agrawal, Andrew Gibiansky, Thilo Koehler and Qing He:
Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning
   

Afternoon Session Discussion

   17.20 - 18.20

               
           
           
           

 August 28, Saturday

     

Session 7

   9.00 - 11.10

 
 
   
 

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati and Thomas Drugman:
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

   
 

Erica Cooper and Junichi Yamagishi:
How do Voices from Past Speech Synthesis Challenges Compare Today?

   
 

Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi and Hiroshi Saruwatari:
Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder

   
 

Jason Taylor, Sébastien Le Maguer and Korin Richmond:
Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

   
 

Qiao Tian, Chao Liu, Zewang Zhang, Heng Lu, Linghui Chen, Bin Wei, Pujiang He and Shan Liu:
FeatherTTS: Robust and Efficient attention based Neural TTS

   

Keynote 3

   11.10 - 12.10

   
 

Thomas Drugman, Amazon, Germany
Expressive Neural TTS

 
  Thomas Drugman is a Science Manager in Amazon TTS Research team. He received his PhD in 2011 from the University of Mons, winning the IBM Belgium award for “Best Thesis in Computer Science”. His PhD thesis studied the use of glottal source analysis in Speech Processing. He then made a 3-year post-doc on speech/audio analysis for two biomedical applications: trachea-esophageal speech reconstruction and cough detection in chronic respiratory diseases. In 2014, he joined Amazon as a Scientist in the Alexa ASR team. He then transferred to the TTS team in 2016, where he is Science Manager since 2017. He has contributed in making Amazon’s Neural TTS more natural and expressive, notably by enriching Alexa’s experience with different speaking styles: emotions, newscaster, whispering, etc. His current research interests lie in improving the naturalness and flow of longer synthetic speech interactions. He has about 125 publications in the field of Speech Processing. He got the Interspeech Best Student Paper awards in 2009 and 2014 (as supervisor). He is also member of the IEEE Speech and Language Technical Committee since 2019.    
           

Morning Session Discussion

   12.10 - 13.10

   

Session 8

   13.10 - 15.10

   
 

Session Chair: 

 
 

Pilar Oplustil-Gallegos, Johannah O'Mahony and Simon King:
Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

   
 

Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura and Hiroshi Saruwatari:
Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings

   
 

Mano Ranjith Kumar M, Jom Kuriakose, Karthik Pandia D S and Hema A Murthy:
Lipsyncing efforts for transcreating lecture videos in Indian languages

 
 

Marco Nicolis and Viacheslav Klimkov:
Homograph disambiguation with contextual word embeddings for TTS systems

  Jason Fong, Jennifer Williams and Simon King:
Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks
   

Closing & SynSIG announcement

   15.10 - 15.20

   
 

Session Chair:

 
           
           
           

Accepted papers

Tamás Gábor Csapó. Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging

Tamás Gábor Csapó, Laszlo Toth, Gábor Gosztolya and Alexandra Markó. Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input

Jennifer Williams, Jason Fong, Erica Cooper and Junichi Yamagishi. Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

Bastian Schnell and Philip N. Garner. Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction

Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman and Jaime Lorenzo-Trueba. EmoCat: Language-agnostic Emotional Voice Conversion

Slava Shechtman and Avrech Ben-David. Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis

Pilar Oplustil-Gallegos, Johannah O'Mahony and Simon King. Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech

Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati and Thomas Drugman. Multi-Scale Spectrogram Modelling for Neural Text-to-Speech

Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa and Thomas Merritt. Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech

Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba and Viacheslav Klimkov. Enhancing audio quality for expressive Neural Text-to-Speech

Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie and Yannis Stylianou. Combining speakers of multiple languages to improve quality of neural voices

Paul Konstantin Krug, Simon Stone and Peter Birkholz. Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies

Wataru NakataTomoki KoriyamaShinnosuke Takamichi, Naoko Tanji, Yusuke IjimaRyo Masumura and Hiroshi Saruwatari. Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings

Mano Ranjith Kumar M, Jom Kuriakose, Karthik Pandia D S and Hema A Murthy. Lipsyncing efforts for transcreating lecture videos in Indian languages

Sai Sirisha Rallabandi, Babak Naderi and Sebastian Möller. Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech

Erica Cooper and Junichi Yamagishi. How do Voices from Past Speech Synthesis Challenges Compare Today?

Erica Cooper, Xin Wang and Junichi Yamagishi. Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis

Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi and Hiroshi Saruwatari. Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder

Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai and Simon King. Factors Affecting the Evaluation of Synthetic Speech in Context

Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga and Sharath Adavane. Non-native English lexicon creation for bilingual speech synthesis

Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati and Thomas Drugman. Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Dan Wells and Korin Richmond. Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis

Jason Taylor, Sébastien Le Maguer and Korin Richmond. Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French

Hieu-Thi Luong and Junichi Yamagishi. Preliminary study on using vector quantization latent spaces for consistent performance TTS/VC systems

Patrick Lumban Tobing and Tomoki Toda. Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction

Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris and Georgia Maniati. Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Martin Lenglet, Olivier Perrotin and Gérard Bailly. Impact of Segmentation and Annotation in French end-to-end Synthesis

Christina Tånnander and Jens Edlund. Methods of slowing down speech

Marco Nicolis and Viacheslav Klimkov. Homograph disambiguation with contextual word embeddings for TTS systems

Joakim Gustafson, Jonas Beskow and Eva Szekely. Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis

Ayushi PandeySebastien Le MaguerJulie Berndsen and Naomi Harte. Mind your p’s and k’s -- Comparing obstruents across TTS voices of the Blizzard Challenge 2013

Bence Mark Halpern, Marc Illa, Rob van Son, Laureano Moro-Velazquez and Odette Scharenborg. Pathological voice adaptation with autoencoder-based voice conversion

Lucas H. Ueda, Paula D. P. Costa, Flavio O. Simoes and Mário U. Neto. Are we truly modeling expressiveness? A study on expressive TTS in Brazilian Portuguese for real-life application styles

Debasish Ray Mohapatra, Pramit Saha, Yadong Liu, Bryan Gick and Sidney Fels. Vocal tract area function extraction using ultrasound for articulatory speech synthesis

Ambika Kirkland, Marcin Włodarczak, Joakim Gustafson and Eva Szekely. Perception of smiling voice in spontaneous speech synthesis

Elijah Gutierrez, Pilar Oplustil-Gallegos and Catherine Lai. Location, Location: Enhancing the Evaluation of Text-to-Speech synthesis using the Rapid Prosody Transcription Paradigm

Jason Fong, Jilong Wu, Prabhav Agrawal, Andrew Gibiansky, Thilo Koehler and Qing He. Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning

Jason Fong, Jennifer Williams and Simon King. Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks

Qiao Tian. FeatherTTS: Robust and Efficient attention based Neural TTS

Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh and Tamás Gábor Csapó. Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Platinum Sponsor

 
Gold Sponsors
     
Partners