2018-2019 Talks

See other talks


Speaker Prof. Simon King, University of Edinburgh
Title Does “end-to-end” speech synthesis mean we don’t need text processing or signal processing any more?
Time & Venue Thomas Davis Lecture Theatre (Room 2043), Arts Block. TCD, Friday 18th January, 2019 at 11.30am

Abstract: Almost every text-to-speech synthesiser contains three components. A front-end text processor normalises the input text and extracts useful features from it. An acoustic model performs regression from these features to an acoustic representation, such as a spectrogram. A waveform generator then creates the corresponding waveform.

In many commercially-deployed speech synthesisers, the waveform generator still constructs the output signal by concatenating pre-recorded fragments of natural speech. But very soon we expect that to be replaced by a neural vocoder that directly outputs a waveform. Neural approaches are already the dominant choice for acoustic modelling, starting with simple Deep Neural Networks guiding waveform concatenation, and progressing to sequence-to-sequence models driving a vocoder. Completely replacing the traditional front-end pipeline with an entirely neural approach is trickier, although there are some impressive so-called "end-to-end" systems.

In this rush to use end-to-end neural models to directly generate waveforms given raw text input, much of what we know about text and speech signal processing appears to have been cast aside. Maybe this is a good thing: the new methods are a long-overdue breath of fresh air. Or, perhaps there is still some value in the knowledge accumulated from 50+ years of speech processing. If there is, how do we decide what to keep and what to discard - for example, is source-filter modelling still a good idea?

Page last modified on October 11, 2019