The 11th Speech Synthesis Workshop (SSW11) will be held physically in Hungary as scheduled from August 26 to 28, 2021. The organizing team is closely monitoring the development of the COVID-19 situation.

Call for Papers

Speech Synthesis Workshop (SSW)

At an international conference on speech processing, a speech scientist once held up a tube of toothpaste (whose brand was "Signal") and, squeezing it in front of the audience, coined the phrase "This is speech synthesis; speech recognition is the art of pushing the toothpaste back into the tube."

One could turn this very simplistic view the other way round: users are generally much more tolerant of speech recognition errors than they are willing to listen to unnatural speech. There is magic in a speech recognizer that transcribes continuous radio speech into text with a word accuracy as low as 50%; in contrast, even a perfectly intelligible speech synthesizer is only moderately tolerated by users if it delivers nothing more than "robot voices". Delivering both intelligibility and naturalness has been the holy grail of speech synthesis research for the past 30 years. More recently, expressivity has been added as a major objective of speech synthesis.

Add to this the engineering costs (computational cost, memory cost, design cost for making another synthetic voice or another language) which have to be taken into account, and you'll start to have an idea of the challenges underlying text-to-speech synthesis.

Major challenges call for major meetings: the Speech Synthesis Workshops (SSWs), which are held every three years under the auspices of ISCA's SynSIG. In 2019 it was decided to have an SSW every two years, since the technology is advancing faster these days. SSWs provide a unique occasion for people in the speech synthesis area to meet each other. They contribute to establishing a feeling that we are all participating in a joint effort towards intelligible, natural, and expressive synthetic speech.

Workshop Topics

Papers in all areas of speech synthesis technology are encouraged to be submitted, including but not limited to:

  • Grapheme-to-phoneme conversion for synthesis
  • Text processing for speech synthesis (text normalization, syntactic and semantic analysis, intent detection)
  • Segmental-level and/or concatenative synthesis
  • Signal processing/statistical model for synthesis
  • Speech synthesis paradigms and methods; articulatory synthesis, articulation-to-speech synthesis, parametric synthesis etc.
  • Prosody modeling, transfer and generation
  • Expression, emotion and personality generation
  • Voice conversion and modification, morphing (parallel and non-parallel)
  • Concept-to-speech conversion speech synthesis in dialog systems
  • Avatars and talking faces
  • Cross-lingual and multilingual aspects for synthesis (e.g. automatic language switching)
  • Applications of synthesis technologies to communication disorders
  • TTS for embedded devices and computational issues
  • Tools and data for speech synthesis
  • Quality assessment/evaluation metrics in synthesis
  • End-to-end text-to-speech synthesis
  • Direct speech waveform modelling and generation
  • Neural vocoding for speech synthesis
  • Speech synthesis using non-ideal data ('found', user-contributed, etc.)
  • Natural language generation for speech synthesis
  • Special topic: Speech uniqueness and deep learning (generating diverse and natural speech)

Call for Demos

We are planning to have a demo session to showcase new developments in speech synthesis. If you have some demonstrations of your work that does not really fit in a regular oral or poster presentation, please let us know.


Thomas Drugman, Amazon, Germany
Expressive Neural TTS

Thomas Drugman is a Science Manager in Amazon TTS Research team. He received his PhD in 2011 from the University of Mons, winning the IBM Belgium award for “Best Thesis in Computer Science”. His PhD thesis studied the use of glottal source analysis in Speech Processing. He then made a 3-year post-doc on speech/audio analysis for two biomedical applications: trachea-esophageal speech reconstruction and cough detection in chronic respiratory diseases. In 2014, he joined Amazon as a Scientist in the Alexa ASR team. He then transferred to the TTS team in 2016, where he is Science Manager since 2017. He has contributed in making Amazon’s Neural TTS more natural and expressive, notably by enriching Alexa’s experience with different speaking styles: emotions, newscaster, whispering, etc. His current research interests lie in improving the naturalness and flow of longer synthetic speech interactions. He has about 125 publications in the field of Speech Processing. He got the Interspeech Best Student Paper awards in 2009 and 2014 (as supervisor). He is also member of the IEEE Speech and Language Technical Committee since 2019.


István Winkler, Research Centre for Natural Sciences, Hungary
Early Development of Infantile Communication by Sound

István Winkler, PhD, DSc, electrical engineer, psychologist. He received his PhD in 1993 at the University of Helsinki, studying auditory sensory memory by electroencephalographic measures. He defended his Doctor of Science thesis in 2005 at the Hungarian Academy of Sciences on auditory deviance detection. His current fields of interest are predictive processing in the auditory deviance detection, auditory scene analysis, communication by sound, and the development of these functions in infancy. During his career, he has authored/coauthored over 250 publications, which received over 11000 references. Currently he is the director of the Institute of Cognitive Neuroscience and Psychology, Research Centre for Natural Sciences, Budapest, Hungary and the head of the Sound and Speech Perception research group (

Lior Wolf, Facebook AI Research and Tel Aviv University, Israel
Deep Audio Conversion Technologies and Their Applications in Speech, Singing, and Music

Lior Wolf is a research scientist at Facebook AI Research and a full professor in the School of Computer Science at Tel-Aviv University, Israel. He conducted postdoctoral research at prof. Poggio's lab at the Massachusetts Institute of Technology and received his PhD degree from the Hebrew University, under the supervision of Prof. Shashua. He is an ERC grantee and has won the ICCV 2001 and ICCV 2019 honorable mention, and the best paper awards at ECCV 2000 and ICANN 2016. His research focuses on computer vision, audio synthesis, and deep learning.


List of ISCA ITRW Speech Synthesis Workshops (SSW)
(Full papers are available on-line at the ISCA Archive, links provided)

SSW11   August 26-28, 2021, Budapest, Hungary
SSW10   September 20-22, 2019, Vienna, Austria
SSW9   September 13-15, 2016, Sunnyvale, California, USA
SSW8    August 31 - September 2, 2013, Barcelona, Spain
SSW7   September 22-24, 2010, Kyoto, Japan
SSW6   August 22-24, 2007, Bonn , Germany
SSW5   June 14-16, 2004, Pittsburgh, PA, USA
SSW4   August 29 - September 1, 2001, Atholl Palace Hotel, Pitlochry, Perthshire, Scotland
SSW3   November 26-29, 1998, Jenolan Caves House, Blue Mountains, Australia (Dedicated to the memory of Christian Benoît)
SSW2   September 12-15, 1994, Mohonk Mountain House, New Paltz, NY, USA
SSW1   September 25-28, 1990, Autrans, France