Keynotes

Next Generation 3GPP Speech Coding: Enhanced Voice Services (EVS)

Tuesday, September 9, 9.15 - 10.00

Dr. Imre Varga,

Qualcomm Inc., San Diego, CA, USA

Slides now available in PDF 

The EVS (Enhanced Voice Services) project in 3GPP aims at improving user experience through enhancement of the key element of telephony – speech quality in speech coding. EVS is the next generation speech coding standard in 3GPP after the successful coders used in mobile telephony. The EVS coder is especially designed for packet-switched networks, such as voice-over-LTE as a key target application. Besides ensuring enhanced quality for VoLTE, EVS addresses all networks, including mobile VoIP with QoS, best effort VoIP, and CS. EVS provides improved user experience through super-wideband bandwidth at low bit rates (at 32 kHz sampling), improved robustness through significantly better error resilience, improved music performance, wide bit-rate range and all bandwidths (narrowband, wideband, super-wideband, full band) for maximum flexibility, and improved capacity by introduction of a variable-bit-rate mode. The standardization process includes a qualification phase to determine the most promising technologies, a selection phase which actually includes a candidate developed jointly by all proponent candidates, and a characterization testing to obtain more detailed information about the performance of the coder. The presentation will address the goals of the project, the standardization process, the speech quality, and the algorithmic elements of the coder.

 

Dr. Imre Varga received his M.Sc. and Ph.D. (summa cum laude) degrees and worked in university area on network and filter theory, sigma-delta modulators, and adaptive filtering with applications for telecommunications. He was responsible for the development of signal processing algorithms for professional audio at Barco-EMT in Germany. He worked for Thomson Multimedia Corporate Research as project leader and technical advisor for multimedia communication, speech coding, audio processing, command-and-control and NLP-based dialogue systems. He joined Siemens AG in 1997 as R&D lab manager of signal processing activities for mobile telephony, later he became head of Multimedia R&D Department, with focus on speech and multimedia applications. Early 2009 he joined Qualcomm Inc. working on speech coding and related standardization.

Dr. Varga participates in standardization groups for speech coding and multimedia applications in ITU-T SG16 and SG12, 3GPP. He serves as Director in the Board of IMTC, he is author of a number of technical papers and a Senior Member of the IEEE.

 

Design and Implementation of Small Microphone Arrays for Acoustic and Speech Signal Processing

Tuesday, September 9, 10.00 - 10.45

Prof. Jingdong Chen, 

Northwestern Polytechnical University, Xi’an, China

Slides now available in PDF 

Microphone array is a generic expression used to refer to a sound system that has multiple microphones. These microphones can be distributed into an arbitrary network (often called a microphone sensor network) or arranged into a particular geometry (then called an organized array). In most of the cases, however, when we say microphones arrays, we mean organized arrays in which the sensors' positions relative to a reference point are known to the subsequent processors. This kind of arrays can be used to solve many important problems such as source localization/tracking, noise reduction/speech enhancement, source separation, dereverberation, spatial sound recording, etc. Consequently, the design of such microphone arrays and the associated processing algorithms has attracted a significant amount of research and engineering interest over the last three decades. In this talk, we will address the problems and challenges of microphone array processing. Based on how they respond to the sound field, the microphone arrays are categorized into two classes: additive arrays and differential ones. The former refers to those arrays with large inter-element spacing (from a couple of centimeters to a couple of decimeters) and optimal beamforming in broadside directions while the latter refers to those microphone arrays that are responsive to the spatial derivatives of an acoustic pressure field. We will present a brief overview of the basic principles underlying the two classes of arrays. We will then focus on discussing the design of differential microphone arrays (DMAs), which have many advantages over additive arrays in practical applications, particularly with small devices. We will elaborate, using examples, on how to design DMA beamforming that is good for processing broadband signals like speech. We will also discuss how to deal with the white noise amplification problem that used to be a big problem preventing DMA from practical usage.

 

Prof. Jingdong Chen is currently a professor at the Northwestern Polytechnical University (NWPU) in Xi'an, China.  Before joining NWPU in Jan. 2011, he served as the Chief Scientist of WeVoice Inc. in New Jersey for one year. Prior to this position, he was with Bell Labs in New Jersey for nine years. Before joining Bell Labs, he held positions at the Griffith University in Brisbane, Australia and the Advanced Telecommunications Research Institute International (ATR) in Kyoto, Japan.  His research interests include acoustic signal processing, adaptive signal processing, speech enhancement, adaptive noise/echo control, microphone array signal processing, signal separation, and speech communication. Dr. Chen is currently an Associate Editor of the IEEE Transactions on Audio, Speech, and Language Processing. He co-authored the books Study and Design of Differential Microphone Arrays (Springer-Verlag, 2013), Speech Enhancement in the STFT Domain (Berlin: Springer-Verlag, 2011), Optimal Time-Domain Noise Reduction Filters: A Theoretical Study (Berlin: Springer-Verlag, 2011), Speech Enhancement in the Karhunen-Loeve Expansion Domain (Morgan & Claypool, 2011), Noise Reduction in Speech Processing (Springer-Verlag, 2009), Microphone Array Signal Processing (Springer-Verlag, 2008), and Acoustic MIMO Signal Processing (Springer-Verlag, 2006). He is also a co-editor/co-author of the book Speech Enhancement (Berlin, Germany: Springer-Verlag, 2005) and a section co-editor of the reference Springer Handbook of Speech Processing (Springer-Verlag, Berlin, 2007).  Dr. Chen received the IEEE Signal Processing Society 2008 Best Paper Award, the best paper award from the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) in 2011, the Bell Labs Role Model Teamwork Award twice, respectively, in 2009 and 2007, the NASA Tech Brief Award twice, respectively, in 2010 and 2009, and the Young Author Best Paper Award from the 5th National Conference on Man-Machine Speech Communications in 1998.

 

 

 

Exploiting structure: new models and algorithms for source separation, optimization, and deep learning

Tuesday, September 9, 14.30 - 15.15

Dr. Steven J. Rennie

IBM T.J. Watson Research Center

Slides now available in PDF 

A cornerstone of machine learning, signal processing, and numerical optimization models and algorithms is the identification and exploitation of structure. Structural assumptions can make ill-posed problems like under-determined source separation well defined, structural realities such as complex source models and explaining away effects can make such problems computationally intractable, and subtle hidden structure, once uncovered, can make the same problem tractable once again. In this talk, I’ll begin by describing approximate inference algorithms for source separation that exploit the inherent structure of the problem to efficiently search over literally trillions of states and realize super-human multi-talker speech separation and recognition. I’ll then talk about the integral role of structure regularization when defining and learning state-of-the-art neural networks for automatic speech recognition, and discuss neural network architectures that make it feasible to train neural networks with millions of neurons. Finally, I’ll spend some time characterizing the surprisingly broad class of functions for which Hessian-vector products can be efficiently computed, and describe how such methods have recently been utilized to efficiently solve high-dimensional structure-learning and compression problems.

 

Dr. Steven J. Rennie (sjrennie [at] us [dot] ibm [dot] com) is a research staff member at the IBM T.J. Watson Research Center in New York. He received his Ph.D. degree in computer engineering from the University of Toronto, Canada. In 2006, he interned at IBM and helped to create the first speech separation algorithm that could outperform human listeners. Since joining IBM in 2007, he has worked on developing new graphical models and inference algorithms for efficient source separation and robust speech recognition. His current research interests include machine learning using graphical models, structure learning, combinatorial optimization, and their application to problems in machine perception.

 

 

Dereverberation for Professional Audio and Consumer Electronics

Thursday, September 11, 9.00 - 9.45

Alexis Favrot

Research and development, Illusonic GmbH, Switzerland

Slides now available in PDF 

The concept of Illusonic's dereverberation algorithms involves a number of microphone elements, capturing samples of the sound field at various points of the recording space. A reverberation estimate is derived and used to suppress the reverberation in frequency subbands. Professional audio requires only slight dereverberation, but high signal quality. In this case, a second microphone element is added to a conventional shot-gun (interference tube) microphone design and dereverberation with beamforming is applied. Whereas in consumer electronics, the use of one, two, or three consumer grade omni-directional microphone elements is aimed at suppressing reverberation and noise more aggressively. In this case, the algorithms have to cope with inter-microphone mismatch, high microphone/electronic noise levels, and limited frequency responses.

 

Alexis Favrot received his M.S. (Ing.) degree in electrical engineering from Supélec, Paris, France, and EPFL, Lausanne, Switzerland in 2005. From 2005 to 2006 he worked at Scopein Research, where he worked on Cocktail Party Processing, and design for Public Address System algorithms on DSP. After working for Merging Technologies, developing numerous audio processing algorithms, he joined Illusonic in 2007 as an audio research engineer, where he worked on noise suppression method, echo acoustic suppression, and spatial hearing and spatial sound capture, processing, and reproduction.

He is also co-author of the book chapter Acoustic Echo Control in Speech Enhancement (Berlin, Germany: Springer-Verlag, 2005) in "Academic Press Library in Signal Processing: Volume 4 Image, Video Processing and Analysis, Hardware, Audio, Acoustic and Speech Processing" (Elsevier 2014).

 

 

Melody Extraction from Polyphonic Music Signals

Thursday, September 11, 14.30 - 15.15

Prof. Gaël Richard

Telecom ParisTech, France

Slides now available in PDF 

Melody extraction algorithms aim at extracting a sequence of frequency values corresponding to the pitch of the dominant melody from polyphonic musical signals. Melody extraction is now an active research topic, comprising a large variety of solutions spanning a wide range of techniques including approaches which are suitable for predominant melody enhancement or suppression. Based on a recently published paper [1], this talk will provide an overview of these techniques, the applications for which melody extraction is useful, and some of the challenges that remain. The talk will first discuss the concept of ‘melody’ from both musical and signal processing perspectives, and then provide a comprehensive comparative analysis of melody extraction algorithms. The discussion will encompass issues related to algorithm design, evaluation and potential applications. Finally, some of the remaining challenges in melody extraction research will be outlined and discussed.  

[1] J. Salamon, E. Gomez, D. Ellis,  G. Richard, (2013), "Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges", IEEE Signal Processing magazine, Vol  31, Issue 2, pp 118-134, March 2014

 

Prof. Gaël Richard received the State Engineering degree from Telecom ParisTech, France (formerly ENST) in 1990, the Ph.D. degree from LIMSI-CNRS, University of Paris-XI, in 1994 in speech synthesis, and the Habilitation à Diriger des Recherches degree from the University of Paris XI in September 2001. After the Ph.D. degree , he spent two years at the CAIP Center, Rutgers University, Piscataway, NJ, in the Speech Processing Group of Prof. J. Flanagan, where he explored innovative approaches for speech production. From 1997 to 2001, he successively worked for Matra, Bois d’Arcy, France, and for Philips, Montrouge, France. In particular, he was the Project Manager of several large scale European projects in the field of audio and multimodal signal processing. In September 2001, he joined the Department of Signal and Image Processing, Telecom ParisTech, where he is now a Full Professor in audio signal processing and Head of the Audio, Acoustics, and Waves research group. He is a coauthor of over 150 papers and inventor in a number of patents and is also one of the experts of the European commission in the field of speech and audio signal processing. He was an Associate Editor of the IEEE Transactions on Audio, Speech and Language Processing between 1997 and 2011 and one of the guest editors of the special issue on “Music Signal Processing” of IEEE Journal on Selected Topics in Signal Processing (2011). He currently is a member of the IEEE Audio and Acoustic Signal Processing Technical Committee, member of the EURASIP and AES and senior member of the IEEE.

 

IWAENC 2014 sponsors

We gratefully acknowledge the kind support of the following sponsors.

Gold

Silver

Bronze

TECHNICAL CO-SPONSOR