Sigmedia Talks Page

Next Scheduled Talk


Speaker Dr. Jean-Yves Guillemaut
Title Joint Multi-Layer Segmentation and Reconstruction for 3D-TV Content Production
Time & Venue Printing House Hall - 14:30 31st March 2011


Abstract Current state-of-the-art image-based scene reconstruction techniques are capable of generating high-fidelity 3D models when used under controlled capture conditions. However, they are often inadequate when used in more challenging environments such as outdoor scenes with moving cameras. Algorithms must be able to cope with relatively large calibration and segmentation errors as well as input images separated by a wide-baseline and possibly captured at different resolutions.

In this talk, I will present a technique which, under these challenging conditions, is able to efficiently compute a high-quality scene representation via graph-cut optimisation of an energy function combining multiple image cues. Robustness is achieved by jointly optimising scene segmentation and multiple view reconstruction in a view-dependent manner with respect to each input camera. Joint optimisation prevents propagation of errors from segmentation to reconstruction as is often the case with sequential approaches. View-dependent processing increases tolerance to errors in through-the-lens calibration compared to global approaches.

Experimental results will be presented with a variety of challenging outdoor scenes captured with manually operated broadcast cameras as well as several indoor scenes with natural background. These datasets will be used to evaluate the accuracy of the technique for high quality segmentation and reconstruction and demonstrate its application for 3D-TV content production. Particularly, two main applications will be considered: free-viewpoint video, which gives a user the ability to freely control the viewpoint from which a video is rendered, and 3D video, which augments a conventional 2D video with depth information.

Bio Jean-Yves Guillemaut is a Research Fellow in the Centre for Vision, Speech and Signal Processing, University of Surrey, U.K. His research interests includes free-viewpoint video and 3D TV, image/video-based scene reconstruction and rendering, image/video segmentation and matting, camera calibration, and active appearance models for face recognition. Currently, he is working on the i3DLive project, in collaboration with The Foundry and BBC R&D, addressing use of multiple camera systems for stereo production in film and broadcast. Previously, he worked on the iview project developing computer vision algorithms for 3D reconstruction and free-viewpoint video rendering in sports.



Upcomming Talks, Last Year's Talks

Previous Talks


23rd March 2011

Speaker Viliam Rapcan
Title Can changes in speech predict cognitive decline?


Abstract The biggest limiting factor to independence in older people is impaired cognitive function. While the population of the world is growing older, the burden on the health care providers is increasing. Less expensive and less labour intensive methods of cognitive function assessment are an active area of research. In this presentation, the use of speech as a biomarker for cognitive function will be presented together with the results of a clinic study of 189 elderly participants, and the results of a pilot study of an automated Interactive Voice Response (IVR) system for remote, fully automated delivery of cognitive function assessment tests.





16th March 2011

Speaker Kangyu Pan
Title CELLSNAKE : A new active contour technique for cell/fibre segmentation


Abstract Active contours are well known for object segmentation and widely adopted in various forms for biological image analysis. Most of the techniques are commonly based on object geometry but overlapping regions cause severe problems to contour propagation. In this paper, we propose a novel active contour technique (“cellsnake”) for solving this problem with an application to cell and fibre segmentation. Given that the transparency of overlapped objects is unavailable, we present a new set of contour forces derived from a-priori knowledge of cell geometry that allows the contour to deform correctly in those regions. We have combined these terms with other existing forces and we show that cellsnake gives appropriate shape estimation of the objects especially in the overlapped area in the observed images.





2nd March 2011

Speaker Finian Kelly
Title Effects of Ageing on Long-Term Speaker Verification


Abstract The changes that occur in the human voice due to ageing have been well documented. The impact these changes have on speaker verification is unclear however. Given the increasing prevalence of biometric technology, it is important to quantify this impact. This presentation will describe a preliminary investigation into the effect of long-term vocal ageing on a speaker verification system.

On a cohort of 13 adult speakers, using a conventional verification system, longitudinal testing of each speaker is carried out across a 30-40 year range. A progressive degradation in verification score is observed as the time span between the training and test material increases. Above a time span of 5 years, this degradation exceeds the range of normal inter-session variability. The age of the speaker at the time of training is shown to influence the rate at which the verification scores degrade. Our results suggest that the verification score drop-off accelerates for speakers over the age of 60. The implications of these findings for speaker verification will be discussed along with directions of future work.



9th Februaury 2011

Speaker Claire Masterson
Title Binaural Impulse Response Rendering for Immersive Audio


Abstract This talk will cover the main tenets of my PhD work in spatial audio

reproduction. This includes a method for the factorisation of datasets of head related impulse responses (HRIRs) using a least squares approach as well as a number of regularisation strategies to enable for more psychoacoustically meaningful, initial-condition independent results to be obtained for various types of HRIR data. A technique for the spatial interpolation of room impulse responses using dynamic time warping and tail synthesis will also be covered. The incorporation of both techniques into an overall spatial audio system using the virtual loudspeaker approach will be described.



2nd February 2011

Speaker Damien Kelly
Title Voxel-based Viterbi Active Speaker Tracking (V-VAST) with Best View Selection for Video Lecture Post-production


Abstract An automated system is presented for reducing a multi-view lecture recording into a single view video containing a best view summary of active speakers. The system uses skin color detection and voxel-based analysis in locating likely speaker locations. Using time-delay estimates from multiple microphones, speech activity is analyzed for each speaker position. The Viterbi algorithm is then used to estimate a track of the active speaker. This track is determined as that which maximizes the observed speech activity. This novel approach is termed Voxel-based Viterbi Active Speaker Tracking (V-VAST) and is shown to track speakers with an accuracy of 0.23m. Using this tracking information, the system is applied as a post-production step to segment the most frontal face view of active speakers from the available camera views.





26th January 2011

Speaker Luca Cappelletta
Title Improved Visual Features for Audio-visual Speech Recognition


Abstract Automatic Speech Recognition (ASR) is technology that allows a computer to identify the words that a person speaks into an input device (microphone, telephone, etc) by analyzing the audio signal. In the past years the technology achieved remarkable results, even if state of the art ASR systems lag human speech perception by up one order of magnitude. A major factor affecting ASR is the signal to noise ratio: in a noisy environment, automatic speech recognition suffers a huge loss in performance. However, is has been proved that human speech production is bimodal by its nature. Moreover, hearing impaired people utilize lipreading in order to improve their speech perception. Thus, it is possible to include visual cues in order to improve ASR. The combination of audio and visual cues forms the so called Audio-Visual Speech Recognition, or AVSR. The main topic of this research is the video branch of a AVSR system. Particularly 'Region of Interest' definition and detection, visual feature extraction and finally visual-only ASR.





19th January 2011

Speaker Felix Raimbault
Title Stereo Video Inpainting


Abstract As the production of stereoscopic content increases, so does the need for post-production tools for that content. Video inpainting has become an important tool for rig removal but there has been little consideration of the problem in stereo. This paper presents an algorithm for stereo video inpainting that builds on existing exemplar-based video completion and also considers the issues of view consistency. Given user selected regions in the sequence which may be in the same location in several frames and in both views, the objective is to ll in this area using all the available picture information. Existing algorithms lack temporal consistency, causing flickering and other artefacts. This paper explores the use of long-term picture information across many frames in order to achieve temporal consistency at the same time as exploiting inter-view dependencies within the same framework.





14th December 2010

Speaker Andrew Hines
Title Speech Intelligibility Prediction using a Simulated Performance Intensity Function


Abstract Discharge patterns produced by fibres from normal and impaired auditory nerves in response to speech and other complex sounds can be discriminated subjectively through visual inspection. Similarly, responses from auditory nerves where speech is presented at diminishing sound levels progressively deteriorate from those at normal listening levels. The Performance Intensity Function is a standard listener test that evaluates a test subject’s phoneme discrimination performance over a range of sound intensities. A computational model of the auditory periphery was used to replace the human subject and develop a methodology that simulates a real listener test. This work represents an important step in validating the use of auditory nerve models to predict speech intelligibility.





7th December 2010

Speaker Mohamed Ahmed
Title Reflection Detection in Image Sequences


Abstract/Details

Reflections in image sequences consist of several layers superimposed over each other. This phenomenon causes many image processing techniques to fail as they assume the presence of only one layer at each examined site e.g. motion estimation and object recognition. Reflections can arise by mixing any two images and hence detecting them automatically remains a hard problem that was not addressed before. This work presents an automated technique for detecting reflections in image sequences by analyzing motion trajectories of feature points. We generate sparse and dense detection maps and our results show high detection rate with rejection to pathological motion, occlusion, and motion blur.



12th October 2010

Speaker Bruno Nicoletti
Title Developing VFX for Film and Video on GPUs


Abstract/Details

In the visual effects world, London-based award-winning firm The Foundry is renowned for its software. Bruno Nicoletti, founder and CTO of The Foundry, speed-talked through a tour of the company’s tools and software, demonstrating to an audience with a healthy population of VFX artists and developers how GPUs are changing the industry in “Developing GPU-Enabled Visual Effects for Film and Video.”

Foundry technology has been used in a host of blockbusters, such as Avatar, Harry Potter, The Dark Knight and many, many others, and its Nuke compositing software has been used for everything from the fantastic (CGI castles) to the mundane (complexion correction).

As a leader in the industry, Nicoletti has an invaluable perspective on the changes that GPUs are making in VFX. GPUs are reducing rendering times and allowing VFX to be involved more pervasively in all stages of production, in effect blurring the line between post production and production.

The popularity of utilizing the power of GPUs in the visual effects (VFX) industry continues to gain momentum. Major film production studios that historically have been CPU-based for VFX are not only utilizing GPUs, they are starting to replace their CPU-based rendering systems with GPU-based one.

This transition to GPU in VFX, however, requires some legwork, particularly when it comes to the complex image processing algorithms in VFX software. This (along with The Foundry’s solution) was the subject of the second half of Nicoletti’s talk.

With hundreds of effects and millions of lines of code in its software, The Foundry was faced with having to rewrite everything to exploit GPUs while maintaining separate algorithms for CPUs. Faced with the prospect of writing and debugging two sets of complex algorithms, The Foundry created something they’re calling Blink (although Nicoletti used its internal code name of RIP, or “Righteous Image Processing”).

Blink wraps image processing up into a high level C++ API. It lets programmers run kernels on the CPU for debugging, and then those kernels can be translated to spit out GPU CUDA. Nicoletti showed several coding examples and wrapped by showing examples of a motion estimation function run on an Intel Xeon 5504 versus an NVIDIA Quadro 5000. The speed difference was extraordinary (from 5fps to more than 200fps), which augurs for increased demand for VFX on GPU – and Blink.

Bio

Bruno Nicoletti has worked in visual effects since graduating with a degree in Computer Science and Mathematics from Sydney University in 1987. He has worked at production companies, creating visual effects for broadcast and film, as well as at commercial software companies, developing software to sell into visual effects companies. In his career he has developed 2D image processing software, 3D animation, rendering and modelling tools, often before any equivalent tools were commercially available. In 1996 he started The Foundry to develop visual effects plug-ins and oversaw it's initial growth. The Foundry now develops and sells a range of applications and plugins for VFX which are used in may feature films and TV programmes. Now CTO, he acts as senior engineer at the company and is overseeing the effort to move The Foundry's software to a new image processing frameworks that can exploit CPUs and GPUs to yield dramatic speed improvements.

Upcoming Speakers


DateSpeaker(s)
7th FebruaryClaire Masterson
Page last modified on March 28, 2011