Program

EDT	JST	CEST	Thursday	Friday	Saturday	CEST
2.30	15.30	8.30	Opening
3.00	16.00	9.00	Special Synthesis Problems	Articulation and Naturalness	Modeling and Evaluation	9.00
5.00	18.00	11.00	break	break	break	11.00
5.10	18.10	11.10	Lior Wolf - Keynote 1	István Winkler - Keynote 2	Thomas Drugman - Keynote 3	11.10
6.10	19.10	12.10	Morning Session Discussion	Morning Session Discussion 2	Morning Session Discussion 3	12.10
7.10	20.10	13.10	Articulation and Speech Styles	Emotion, Singing and Voice Conversation	Synthesis and Context	13.10
9.10	22.10	15.10	break	break	Closing & SynSIG announcement	15.10
9.20	22.20	15.20	Expressive Synthesis	Multilingual and Evaluation		15.20
11.20	0.20	17.20	Afternoon Session Discussion	Afternoon Session Discussion 2		17.20
			Welcome reception in the venue	Social event

Papers at ISCA website

Each paper has its DOI number. https://www.isca-speech.org/archive/ssw_2021/index.html

		absz	ea
August 26., Thursday
Opening 8.30 - 9.00
	Géza Németh, Chairman, BME, Hungary
Special Synthesis Problems 9.00 - 11.00
	Session Chair: Lior Wolf
	Sai Sirisha Rallabandi, Babak Naderi and Sebastian Möller: Identifying the vocal cues of likeability, friendliness and skilfulness in synthetic speech
	Tamás Gábor Csapó: Extending Text-to-Speech Synthesis with Articulatory Movement Prediction using Ultrasound Tongue Imaging
	Martin Lenglet, Olivier Perrotin and Gérard Bailly: Impact of Segmentation and Annotation in French end-to-end Synthesis
	Marc Illa, Bence Mark Halpern, Rob van Son, Laureano Moro-Velazquez and Odette Scharenborg: Pathological voice adaptation with autoencoder-based voice conversion
	Elijah Gutierrez, Pilar Oplustil-Gallegos and Catherine Lai: Location, Location: Enhancing the Evaluation of Text-to-Speech synthesis using the Rapid Prosody Transcription Paradigm
Keynote 1 11.10 - 12.10
	Session Chair: Erica Cooper Lior Wolf, Facebook AI Research and Tel Aviv University, Israel Deep Audio Conversion Technologies and Their Applications in Speech, Singing, and Music
	Lior Wolf is a research scientist at Facebook AI Research and a full professor in the School of Computer Science at Tel-Aviv University, Israel. He conducted postdoctoral research at prof. Poggio's lab at the Massachusetts Institute of Technology and received his PhD degree from the Hebrew University, under the supervision of Prof. Shashua. He is an ERC grantee and has won the ICCV 2001 and ICCV 2019 honorable mention, and the best paper awards at ECCV 2000 and ICANN 2016. His research focuses on computer vision, audio synthesis, and deep learning.

Morning Session Discussion 12.10 - 13.10
Articulation and Speech Styles 13.10 - 15.10
	Session Chair: Esther Klabbers
	Tamás Gábor Csapó, Laszlo Toth, Gábor Gosztolya and Alexandra Markó: Speech Synthesis from Text and Ultrasound Tongue Image-based Articulatory Input
	Javier Latorre, Charlotte Bailleul, Tuuli Morrill, Alistair Conkie and Yannis Stylianou: Combining speakers of multiple languages to improve quality of neural voices
	Christina Tånnander and Jens Edlund: Methods of slowing down speech
	Joakim Gustafson, Jonas Beskow and Eva Szekely: Personality in the mix - investigating the contribution of fillers and speaking style to the perception of spontaneous speech synthesis
	Csaba Zainkó, László Tóth, Amin Honarmandi Shandiz, Gábor Gosztolya, Alexandra Markó, Géza Németh and Tamás Gábor Csapó: Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging
Expressive Synthesis 15.20 - 17.20
	Session Chair: Gábor Olaszy
	Bastian Schnell and Philip N. Garner: Improving Emotional TTS with an Emotion Intensity Input from Unsupervised Extraction
	Slava Shechtman and Avrech Ben-David: Acquiring conversational speaking style from multi-speaker spontaneous dialog corpus for prosody-controllable sequence-to-sequence speech synthesis
	Bastian Schnell, Goeric Huybrechts, Bartek Perz, Thomas Drugman and Jaime Lorenzo-Trueba: EmoCat: Language-agnostic Emotional Voice Conversion
	Abdelhamid Ezzerg, Adam Gabrys, Bartosz Putrycz, Daniel Korzekwa, Daniel Saez-Trigueros, David McHardy, Kamil Pokora, Jakub Lachowicz, Jaime Lorenzo-Trueba and Viacheslav Klimkov: Enhancing audio quality for expressive Neural Text-to-Speech
	Lucas H. Ueda, Paula D. P. Costa, Flavio O. Simoes and Mário U. Neto: Are we truly modeling expressiveness? A study on expressive TTS in Brazilian Portuguese for real-life application styles
Afternoon Session Discussion 17.20 - 18.20



August 27, Friday
Articulation and Naturalness 9.00 - 11.00
	Session Chair: Tamás Gábor Csapó
	Debasish Ray Mohapatra, Pramit Saha, Yadong Liu, Bryan Gick and Sidney Fels: Vocal tract area function extraction using ultrasound for articulatory speech synthesis
	Raahil Shah, Kamil Pokora, Abdelhamid Ezzerg, Viacheslav Klimkov, Goeric Huybrechts, Bartosz Putrycz, Daniel Korzekwa and Thomas Merritt: Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech
	Paul Konstantin Krug, Simon Stone and Peter Birkholz: Intelligibility and naturalness of articulatory synthesis with VocalTractLab compared to established speech synthesis technologies
	Ambika Kirkland, Marcin Włodarczak, Joakim Gustafson and Eva Szekely: Perception of smiling voice in spontaneous speech synthesis
	Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati and Thomas Drugman: Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments
Keynote 2 11.10 - 12.10
	Session Chair: Cassia Valentini Botinhao István Winkler, Research Centre for Natural Sciences, Hungary Early Development of Infantile Communication by Sound
	István Winkler, PhD, DSc, electrical engineer, psychologist. He received his PhD in 1993 at the University of Helsinki, studying auditory sensory memory by electroencephalographic measures. He defended his Doctor of Science thesis in 2005 at the Hungarian Academy of Sciences on auditory deviance detection. His current fields of interest are predictive processing in the auditory deviance detection, auditory scene analysis, communication by sound, and the development of these functions in infancy. During his career, he has authored/coauthored over 250 publications, which received over 11000 references. Currently he is the director of the Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Budapest, Hungary and the head of the Sound and Speech Perception research group (http://www.ttk.hu/kpi/en/sound-and-speech-perception/).

Morning Session Discussion 12.10 - 13.10
Emotion, Singing and Voice Conversation 13.10 - 15.10
	Session Chair: Simon King
	Konstantinos Markopoulos, Nikolaos Ellinas, Alexandra Vioni, Myrsini Christidou, Panos Kakoulidis, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Pirros Tsiakoulis, Aimilios Chalamandaris and Georgia Maniati: Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control
	Jennifer Williams, Jason Fong, Erica Cooper and Junichi Yamagishi: Exploring Disentanglement with Multilingual and Monolingual VQ-VAE
	Erica Cooper, Xin Wang and Junichi Yamagishi: Text-to-Speech Synthesis Techniques for MIDI-to-Audio Synthesis
	Hieu-Thi Luong and Junichi Yamagishi: Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance
	Patrick Lumban Tobing and Tomoki Toda: Low-latency real-time non-parallel voice conversion based on cyclic variational autoencoder and multiband WaveRNN with data-driven linear prediction
Multilingual and Evaluation 15.20 - 17.20
	Session Chair: Junichi Yamagishi
	Johannah O'Mahony, Pilar Oplustil-Gallegos, Catherine Lai and Simon King: Factors Affecting the Evaluation of Synthetic Speech in Context
	Arun Baby, Pranav Jawale, Saranya Vinnaitherthan, Sumukh Badam, Nagaraj Adiga and Sharath Adavane: Non-native English lexicon creation for bilingual speech synthesis
	Dan Wells and Korin Richmond: Cross-lingual Transfer of Phonological Features for Low-resource Speech Synthesis
	Ayushi Pandey, Sebastien Le Maguer, Julie Berndsen and Naomi Harte: Mind your p’s and k’s -- Comparing obstruents across TTS voices of the Blizzard Challenge 2013
	Jason Fong, Jilong Wu, Prabhav Agrawal, Andrew Gibiansky, Thilo Koehler and Qing He: Improving Polyglot Speech Synthesis through Multi-task and Adversarial Learning
Afternoon Session Discussion 17.20 - 18.20



August 28, Saturday
Modeling and Evaluation 9.00 - 11.00
	Session Chair: Gérard Bailly Ammar Abbas, Bajibabu Bollepalli, Alexis Moinet, Arnaud Joly, Penny Karanasou, Peter Makarov, Simon Slangens, Sri Karlapati and Thomas Drugman: Multi-Scale Spectrogram Modelling for Neural Text-to-Speech
	Erica Cooper and Junichi Yamagishi: How do Voices from Past Speech Synthesis Challenges Compare Today?
	Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi and Hiroshi Saruwatari: Accent Modeling of Low-Resourced Dialect in Pitch Accent Language Using Variational Autoencoder
	Jason Taylor, Sébastien Le Maguer and Korin Richmond: Liaison and Pronunciation Learning in End-to-End Text-to-Speech in French
	Qiao Tian, Chao Liu, Zewang Zhang, Heng Lu, Linghui Chen, Bin Wei, Pujiang He and Shan Liu: FeatherTTS: Robust and Efficient attention based Neural TTS
Keynote 3 11.10 - 12.10
	Session Chair: Gustav Eje Henter Thomas Drugman, Amazon, Germany Expressive Neural TTS
	Thomas Drugman is a Science Manager in Amazon TTS Research team. He received his PhD in 2011 from the University of Mons, winning the IBM Belgium award for “Best Thesis in Computer Science”. His PhD thesis studied the use of glottal source analysis in Speech Processing. He then made a 3-year post-doc on speech/audio analysis for two biomedical applications: trachea-esophageal speech reconstruction and cough detection in chronic respiratory diseases. In 2014, he joined Amazon as a Scientist in the Alexa ASR team. He then transferred to the TTS team in 2016, where he is Science Manager since 2017. He has contributed in making Amazon’s Neural TTS more natural and expressive, notably by enriching Alexa’s experience with different speaking styles: emotions, newscaster, whispering, etc. His current research interests lie in improving the naturalness and flow of longer synthetic speech interactions. He has about 125 publications in the field of Speech Processing. He got the Interspeech Best Student Paper awards in 2009 and 2014 (as supervisor). He is also member of the IEEE Speech and Language Technical Committee since 2019.

Morning Session Discussion 12.10 - 13.10
Synthesis and Context 13.10 - 15.10
	Session Chair: Thomas Drugman
	Pilar Oplustil-Gallegos, Johannah O'Mahony and Simon King: Comparing acoustic and textual representations of previous linguistic context for improving Text-to-Speech
	Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, Naoko Tanji, Yusuke Ijima, Ryo Masumura and Hiroshi Saruwatari: Audiobook Speech Synthesis Conditioned by Cross-Sentence Context-Aware Word Embeddings
	Mano Ranjith Kumar M, Jom Kuriakose, Karthik Pandia D S and Hema A Murthy: Lipsyncing efforts for transcreating lecture videos in Indian languages
	Marco Nicolis and Viacheslav Klimkov: Homograph disambiguation with contextual word embeddings for TTS systems
	Jason Fong, Jennifer Williams and Simon King: Analysing Temporal Sensitivity of VQ-VAE Sub-Phone Codebooks
Closing & SynSIG announcement 15.10 - 15.20