Probabilistic time-frequency masking for convolutive speech source separation
Journal of Communication Disorders, Deaf Studies & Hearing Aids

Journal of Communication Disorders, Deaf Studies & Hearing Aids
Open Access

ISSN: 2375-4427

Probabilistic time-frequency masking for convolutive speech source separation

International Conference and Expo on Audiology and Hearing Devices

August 17-18, 2015 Birmingham, UK

Wenwu Wang

Posters-Accepted Abstracts: Commun Disord Deaf Stud Hearing Aids

Abstract :

Blind source separation (BSS) has received extensive attention over the past two decades; thanks to its wide applicability in a
number of areas such as biomedical engineering, audio signal processing, and telecommunications. The aim of BSS is to infer/
estimate the most probable but unknown sources, given the available observations (acquired by sensors or sensor arrays) with very
limited information about the characteristics of the transmission channels through which the sources propagate to the sensors. In
many practical applications, such as in an acoustic room environment, the sensors (i.e. microphones) pick up not only the direct
sound from the sources, but also the multipath components due to the sound reflections from the room surfaces. The microphone
signals are therefore convolutive mixtures of original unknown sources. Estimating sources from convolutive mixtures leads to the
so-called convolutive BSS problem. In this talk, following a brief overview of the methods in the literature, we will focus on stereo
source separation based on probabilistic time-frequency masking. In particular, we show that the mixing vector (MV) cue used in the
statistical mixing model is complementary to the binaural cues represented by interaural level and phase differences (ILD and IPD).
The MV distributions are quite distinct while binaural models overlap when the sources are close to each other; however, the binaural
cues are more robust to high reverberation than the MV models. To exploit such complementarity, a new robust algorithm has
been developed to model the MV and binaural cues in parallel. The contribution of each cue to the final decision is also adjusted by
weighting the log-likelihoods of the cues empirically. The model parameters are updated iteratively with an expectation maximization
algorithm, leading to probabilistic time-frequency masks, which are then used to separate the sources in the time-frequency domain.
Experiments are performed systematically on determined and underdetermined speech mixtures in five rooms with various acoustic
properties including anechoic, highly reverberant, and spatially-diffuse noise conditions. The results in terms of signal-to-distortionratio
(SDR) confirm the benefits of integrating the MV and binaural cues, as compared with two state-of-the-art baseline algorithms.