Track 1 – 14:00-17:30
Duration: 3 hours, 30min coffee break
Location: Conference 6
Presenters: Li Deng 1
1) Deep Learning Technology Center, Microsoft Research, Redmond, WA, USA
Target Audience: Specialists, researchers, practitioners, and students of human language technologies and of their underlying sciences, especially those interested in applying deep learning and general machine learning methodologies to advancing speech and language processing.
Deep learning techniques have enjoyed tremendous successes in speech and language processing in recent years, establishing new state-of-the-art performance in speech recognition, language modeling, and some natural language processing tasks. The focus of this tutorial is on deep learning approaches to problems in both speech and language/text processing, with a particular emphasis on a range of artificial-intelligence applications including speech feature extraction, speech recognition, speech translation, information retrieval, spoken and written language understanding, knowledge representation, question answering, machine translation, and semantic modeling. Another emphasis is to bridge advances in speech and cognitive science and applied linguistic theory with the technology development enabled by deep learning.
I. The General Approach
In this tutorial, the latest deep learning technology will be surveyed, including both theoretical and practical perspectives that are most relevant to the topic. Common methods of deep neural networks (DNN) and more advanced methods of recurrent, recursive, stacking, and convolutional networks will be covered. Then, general problems and tasks in speech and text/language processing will be reviewed, and the key properties that differentiate language processing from speech (and image) recognition will be highlighted. I will take the approach that is aimed to connect modern machine learning, which emphasizes generalization properties of the learner, and traditional pattern recognition and signal processing, which is centered on seeking precise objective functions (e.g. minimum error rate on training data) for parameter estimation as the main goal of the learner or estimator. Such different focuses underlie the distinct approaches taken by these largely separate communities until the recent dramatic development in the field of speech and language processing due to inroads of deep learning into our community since 2009.
II. Deep Learning for Speech Recognition
For the speech processing part of the tutorial, I will first reflect on the path to the transformative success in speech recognition using deep learning which took place in recent years. The role of well-timed academic-industrial collaboration will be highlighted, so will be the advances of big data, big compute, and the seamless integration between the application-domain knowledge of speech and general principles of deep learning. Then, an overview will be given on sweeping achievements of deep learning in speech recognition since its initial success in 2010 (as well as in image recognition and computer vision since 2012, with even greater impact than in speech to date). Such achievements have resulted in across-the-board, industry-wide deployment of deep learning in speech recognition. State of the art speech recognition systems and their underlying deep learning methods will be reviewed. And a series of milestones towards the current state of the art and future directions will be analyzed, which are organized into the following sub-topics:
o Scaling up/out and speedup DNN training and decoding;
o Sequence discriminative training of DNNs;
o Feature processing by deep models with solid understanding of the underlying mechanisms;
o Adaptation of DNNs and of related deep models;
o Multi-task and transfer learning by DNNs and related deep models;
o Convolution neural networks and how to design them to best exploit domain knowledge of speech;
o Recurrent neural network and its rich LSTM variants;
o Other types of deep models including tensor-based models and integrated deep generative/discriminative models.
III. Deep Learning for Language/Text Processing
In this next part of the tutorial, I will highlight the general issues of natural language processing pertaining to symbolic/localist versus distributed representations, and elaborate on how new deep learning technologies have been developed to fundamentally address these issues. Attempts will be made to bridge this significant technology development within just past few years with long-standing cognitive and linguistic science based on connectionism. I will then place a particular emphasis on a number of important applications including: 1) search and information retrieval from text 2) language understanding and semantic parsing, 3) question answering, 4) machine translation, and 5) multimodal processing involving language and image (i.e. automatic image captioning). For each of these application areas, what particular architectures of deep learning models are most suitable given the nature of the task will be discussed, as well as how learning can be performed efficiently and effectively using an end-to-end optimization strategy as the hallmark of deep learning. I will also share some best practice developed within my Deep Learning Technology Center at MSR with concrete examples drawn from our first-hand experience in major research benchmarks and some industrial scale applications. Among the many popular deep learning techniques, I will devote most efforts to covering deep neural embedding methods in continuous space, which form the basis for many successful language processing applications.
IV. Key Issues for Near-Future Development of Deep Learning in Speech/Language and Related Applications
In this final, concluding part of the tutorial, I will address a set of key issues relevant to the near-future advances in the deep learning approach to speech and language processing applications as well as those in related fields. Some of these issues are drawn from my research group’s research experiences, and some from current debates within the deep learning research community and industry in general. For perceptual tasks (e.g. speech, audio, image, video, gestures, etc.), the issues involve: 1) With supervised learning which shines in almost all current successes of deep learning, what will be the limit for growing accuracy with respect to the increasing amounts of labeled data? 2) Beyond this limit or when labeled data become exhausted or non-economical to collect, will novel and effective unsupervised deep learning emerge and how? 3) What is the best paradigm to embed domain knowledge (e.g. deep-structured human speech perception and production, the nature of speech distortions, etc.) into deep learning models? For cognitive tasks (e.g. natural language, reasoning, knowledge, etc.), the key issues are: 1) Will supervised deep learning (e.g. machine translation) eventually beat the current state of the art to the same extent as for speech/image recognition? 2) How to distill/exploit “distant” supervision signals so that the well-established supervised deep learning can continue to excel? 3) Will dense-vector embedding with distributed representations (or “thought” vectors) be sufficient for language? 4) Do we need to directly encode and recover syntactic/semantic structure of language? Finally, for big-data analytics tasks involving language and other entities, the key issues will be centered on: 1) Is vector-embedding of business activities, people, and events the right approach? 2) Should the embedding methods be different from those for language? 3) What are the most effective “distant” supervision signals to exploit for big-data analytics tasks? 4) Data privacy issue and the needs for encryption before deep learning and for decryption after deep learning.
Li Deng and Dong Yu, DEEP LEARNING — Methods and Applications. NOW Publishers, June 2014.
Yoshua Bengio,“Learning deep architectures for AI”. NOW Publishers, 2009.
Geoffrey Hinton, Li Deng, Dong Yu, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath George Dahl, and Brian Kingsbury, Deep Neural Networks for Acoustic Modeling in Speech Recognition, IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, November 2012.
Li Deng and Xiao Li, Machine Learning Paradigms for Speech Recognition: An Overview, in IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 1060-1089, May 2013.
George Dahl, Dong Yu, Li Deng, and Alex Acero, Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition, in IEEE Transactions on Audio, Speech, and Language Processing (2013 IEEE SPS Best Paper Award) , vol. 20, no. 1, pp. 30-42, January 2012.
- Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, Convolutional Neural Networks for Speech Recognition, in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533-1545, October 2014.
Li Deng and Roberto Togneri, Chapter 6: Deep Dynamic Models for Learning Hidden Representations of Speech Features, pp. 153-196, Springer, December 2014.
Dong Yu and Li Deng, Automatic Speech Recognition – A Deep Learning Approach, Springer, October 2014. (325 pages).
Huang, P., He, X., Gao, J., Deng, L., Acero, A., and Heck, L. “Learning deep structured semantic models for Web search using clickthrough data,” Proc.ACM-CIKM, 2013.
Shen, Y., He, X., Gao, J., Deng, L., and Mesnil, G. “A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval,” Proc.ACM-CIKM, 2014.
Sutskever, I., Vinyals, O., and Le, Q. “Sequence to sequence learning with neural networks,” Proc. NIPS, 2014.
Mesnil, G., Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Yu, D. and Zweig, G. “Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, March 2015.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. “Distributed representations of words and phrases and their compositionality,” Proc. NIPS, 2013.
Fang, H., Gupta, S., Iandola, F., Srivastava, R., Deng, L., Dollar, P., Gao, J., He, X., Mitchell, X., Platt, P., Zitnick, L., and Zweig, G. “From Captions to Visual Concepts and Back,” Proc. CVPR, 2015.
Gao, J., He, X., Yih, W. and Deng, L. “Learning Continuous Phrase Representations for Translation Modeling,” Proc. ACL, 2014.
Gao, J., Patel, P., Gamon, M., He, X., Deng, L. “Modeling interestingness with deep neural networks,” Proc. EMNLP, 2014.
Relevant web links
Accuracy, Apps Advance Speech Recognition, IEEE Signal Processing Magazine, Jan 2015