Speech separation by humans and machines

Speech against the background of multiple speech sources, such as crowd noise or even a single talker in a reverberant environment, has been recognized as the acoustic setting perhaps most detrimental to verbal communication. Auditory data collected over the last 25 years have succeeded in better defining the processes necessary for a human listener to perform this difficult task. The same data has also motivated the development of models that have been able to increasingly better predict and explain human performance
in a “cocktail-party” setting. As the data showed the limits of performance under these difficult listening conditions, it became also clear that significant improvement of speech understanding in speech noise is likely to be brought about only by some yet-to-be-developed device that automatically separates the speech mixture, enhances the target source, and filters out the unwanted sources. The last decade has allowed us to witness an unprecedented rush toward the development of different computational schemes
aimed at achieving this goal.