2017-2018 Talks

See other talks

Upcoming Speakers

To Be Provided...


Speaker Dr. Ravi Kumar, TCD
Title Control by Example using a quadcopter as a guinea pig
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 17-Oct-18

Abstract: This project explores the concept of “Control by example”. We use machine learning and control theory to control a drone to hover at a specific height. The idea is for this work to lead eventually to an interesting teaching platform for both ML and Control Engineering. In this talk, we will introduce a “proof of concept” for such a design where a quadcopter is controlled using PID controllers and Machine learning techniques. Two different ML frameworks are explored for replacing the PID controller NNs and Gaussian Processes (GPs). Results show that the generation of the control signal is definitely more of a regression problem amenable to solution using GPs.

Speaker Dr. Celine De Looze, TCD
Title Changes in Speech Chunking in Reading Aloud is a Marker of Mild Cognitive Impairment and Mild-to-Moderate Alzheimer’s Disease
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 24-Jul-18

Abstract: Progressive loss of cognitive function in Alzheimer’s disease (AD) includes deficits in episodic memory, executive skills, working memory, attention, speech and language abilities, visuospatial and orientation skills. Many studies have shown that cognitive deficits in AD may be apparent years before diagnosis is made. Detecting these changes at an early stage is therefore critical, particularly for the development and establishment of early interventions and to make effective treatment decisions.

Speech and language impairments in Mild Cognitive Impairment (MCI ) and AD are generally attributed to lexico-semantic deficits. In this talk, I will discuss how the temporal organisation of speech (reflective of speech production planning) in reading aloud may be indicative of working memory and attention deficits, and underlying neural mechanisms, in MCI and AD. I will also address their discriminative ability for the detection of cognitive impairment. The findings and the methodology used (with regard to the designed speech tasks) will be discussed in relation to earlier work on speech planning in Multiple Sclerosis and Parkinson’s disease.

The development and implementation of connected speech-based technologies in clinical and community settings may provide additional information for the early detection of cognitive impairment in neurological diseases.

Speaker Dr. Finnian Kelly, Oxford Wave Research (OWR)
Title Challenges for real-world automatic speaker recognition
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 29-May-18

Abstract: Automatic speaker recognition is increasingly being deployed in real-world scenarios, ranging from forensic investigation to online authentication. This talk will highlight some of the challenges created by unseen and diverse conditions in different applications of speaker recognition, and present some of the ways in which these challenges can be dealt with in practice. This talk will also highlight the use of automatic speaker profiling to estimate speaker characteristics such as age, gender, spoken language, and vocal effort, and how these estimates can be used to inform speaker recognition system behaviour.

Speaker Antonia Crescenci, Imperial College
Title Image Editing With Deep Generative Models
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 13-Feb-18

Abstract: “Deep learning”, involves modelling and explaining observations using highly complex functions with many, learnable parameters. Recently, there has been significant progress in using deep models for synthesising new data, in particular images, these are referred to as “generative models”. Typically, these take on the form of latent variable models, whereby new images may be synthesised by drawing latent samples and passing them through a learned generative model. During training, generative models learn to organise the space of latent variables in an interesting (near-linear) way, allowing interesting image operations to be performed in latent space. For example, traversing a line in latent space may correspond to the addition or removal of an attribute in image space. This enables interesting object design and image manipulation applications.

When editing attributes of an image (in this talk human faces), we often want to edit a single attribute (e.g. smiling) of that image. By training a “generative model” with both images (e.g. faces) and additional attribute information (e.g. smiling/not smiling), we are able to weakly control how the latent space is organised, allowing us to then edit attributes of an image by changing a single (latent) variable in the model. Unfortunately, naive implementations may result in undesirable changes to other attributes in the image. We introduce a novel approach to “factor” the latent space such that by changing a single latent variable, the identity of the person is preserved, and only the desired attribute is changed.

Speaker Dr. Joao Cabral, TCD
Title Acoustic analysis of the voice source and its applications to speech synthesis
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 16-Jan-18

Abstract: Accurate estimation of the voice source and vocal tract from the speech signal is a complex and difficult task. Alternatively, robust and simple speech analysis methods that do not perform good source-tract separation are very popular because they are more robust and provide good performance for several applications, including the domains of automatic speech recognition (ASR), Text-To-Speech Synthesis (TTS), speaker identification, affect computing and speech coding. Although glottal source modelling has been successfully applied in these domains, its application to real technology systems is small. A better method to separate the voice source from the vocal tract is needed to open a new road for future improvements.

In this talk, I will give an overview of previous work on acoustic glottal source estimation, focusing on the applications to speech synthesis and voice transformation. I will also present my current research interest on its evaluation. The resulting experimental outcomes are expected to give insights on how to improve the voice source analysis.

Speaker Dr. Ali Karaali, TCD
Title Edge-Based Defocus Blur Estimation with Adaptive Scale Selection
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 22-Nov-1

Abstract: Camera systems have limited depth of field (DOF) which may cause defocused blur regions in the captured images. In this talk, a new edge-based method for spatially varying defocus blur estimation will be presented. The proposed approach is based on locally reblurred gradient magnitudes ratios. The core of the proposed approach is to estimate a scale-consistent edge map along with a local scale parameter that indicates how isolated each detected edge is. The local scale is used to adaptively select a reblurring parameter accounting for noise, edge mis-localization and interfering edges. After the initial defocus blur estimation at the detected scale-consistent edge points, a novel Connected Edge Filter (CEF) is then introduced to smooth the initial blur estimates based on pixel connectivity within the detected edge contours. Finally, a fast guided filter is used to propagate the sparse blur map through the whole image. Experimental results show that the proposed approach presents a very good compromise between estimation error and running time when compared to state-of-the-art methods. The proposed blur estimation method is also explored in the context of image deblurring, and it is shown that metrics typically used to evaluate blur estimation may not correlate as expected with the visual quality of the deblurred image.

Speaker Dr. Ilaria Torre, TCD
Title Trust in artificial voices: A “congruency effect” of first impressions and behavioural experience
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 8-Nov-17

Abstract: Societies rely on trustworthy communication in order to function, and the need for trust clearly extends to human-machine communication. Therefore, it is essential to design machines to elicit trust, so as to make interactions with them acceptable and successful. However, while there is a substantial literature on first impressions of trustworthiness based on various characteristics, including voice, not much is known about the trust development process. Are first

impressions maintained over time? Or are they influenced by the experience of an agent’s behaviour? We addressed these questions in three experiments using the “iterated investment game”, a methodology derived from game theory that allows implicit measures of trust to be collected over time. Participants played the game with various agents having different voices: in the first experiment, participants played with a computer agent that had either a Standard Southern British English accent or a Liverpool accent; in the second experiment, they played with a computer agent that had either an SSBE or a Birmingham accent; in the third experiment, they played with a robot that had either a natural or a synthetic voice. All these agents behaved either trustworthily or untrustworthily. In all three experiments, participants trusted the agent with one voice more when it was trustworthy, and the agent with the other voice more when it was untrustworthy. This suggests that participants might change their trusting behaviour based on the congruency of the agent’s behaviour with the participant’s first impression. Implications for human-machine interaction design are discussed.

Speaker Dr. Andew Hines, UCD
Title TPredicting People’s Opinions Sounds Easy
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 18-Oct-17

Abstract: Quality of Experience (QoE), or degree of delight or annoyance experiences by the user of an application or service, involves a synthesis of content, context, human and service factors. Predicting how a user perceives speech and audio quality has become more important with the transition from traditional fixed telephony to Voice over Internet Protocol (VoIP)-based systems. Similarly, media streaming is now an established method for listening to music and watching movie and TV content. Network bandwidth constraints are variable across the diverse range of devices on which content is consumed (e.g., mobile, desktop, home theatre). As a result, content distributors, such as YouTube or SoundCloud must support a range of bit rates and codecs, beyond the once ubiquitous mp3, to optimise consumers’ QoE. This talk will look at models developed to predict QoE for users of systems from ranging from hearing aids to VoIP, to Netflix and VR 3D spatial sound.

Speaker Dr. Sebastien Le Maguer, Saarland University in the Multimodal Speech Processing group of MMCI
Title About the future directions of MaryTTS
Time & Venue Stack-B Trinity College - 2:00-3:00pm, 04-Oct-17

Abstract: MaryTTS is a modular Text-To-Speech (TTS) synthesis system whose development started around 2003. The system is open-source and has grown significantly thanks to the contribution of the community. Lately, we have completely redesigned the core of the system in order to make it more flexible. During this talk, I am going to present the new architecture, the motivations and the problems we addressed by using it. While so, we will also cover some challenges that are currently addressed in the TTS field and the ones which should be addressed.

Speaker Dr. Wissam Jassim, TCD
Title Speech enhancement using neural responses
Time & Venue Stack-B Trinity College - 10:30-11:30am, 27-Jan-17

Abstract: In statistical-based speech enhancement algorithms, the challenge is to find a non-linear estimator using the set of DFT coefficients of noisy speech to estimate the corresponding unknown clean speech coefficients. Most of these algorithms require the a priori SNR to be estimated. This talk discusses a new approach to estimate the a priori SNR using the neural responses simulated by a computational model of the auditory-nerve (AN) system. The outputs of the AN model corresponding to a range of characteristic frequencies construct the neurogram which is a 2D representation (time-frequency). Features are extracted from the resultant 2D neurogram using the Radon transform. Support Vector Regression is used to predict the a priori SNR of unseen frames. A range of speech quality and intelligibility measures were used for performance evaluation. The results demonstrate that the estimate of the a priori SNR using neural features can improve the output of a range of speech enhancement algorithms, including MMSE, Wiener filtering and perceptually-motivated approaches.
Page last modified on October 26, 2018