Call for Paper

CAE solicits original research papers for the July 2018 Edition. Last date of manuscript submission is June 30, 2018.

Read More

Features and Model Adaptation Techniques for Robust Speech Recognition: A Review

Kapang Legoh, Utpal Bhattacharjee, T. Tuithung Published in Signal Processing

Communications on Applied Electronics
Year of Publication: 2015
© 2015 by CAE Journal
10.5120/cae-1507

Kapang Legoh, Utpal Bhattacharjee and T Tuithung. Article: Features and Model Adaptation Techniques for Robust Speech Recognition: A Review. Communications on Applied Electronics 1(2):18-31, January 2015. Published by Foundation of Computer Science, New York, USA. BibTeX

@article{key:article,
	author = {Kapang Legoh and Utpal Bhattacharjee and T. Tuithung},
	title = {Article: Features and Model Adaptation Techniques for Robust Speech Recognition: A Review},
	journal = {Communications on Applied Electronics},
	year = {2015},
	volume = {1},
	number = {2},
	pages = {18-31},
	month = {January},
	note = {Published by Foundation of Computer Science, New York, USA}
}

Abstract

In this paper, major speech features used in state-of-the-art technology in speech recognition research are reviewed. Also a brief review of major technological advancements during last few decades and a trend towards development of robust speech recognition system in terms of feature and model adaptation techniques is given. It has been the dream of researchers to develop a machine that recognizes speech and understands natural language like human but the reality is that the performance of the speech recognition system drastically degrades due to various adverse conditions like noise, variability in speaker, channel, device and mismatches in training and testing. This paper may be useful as a tutorial and review on state-of-the-art techniques for feature selection, feature normalization and model adaptation techniques for development of robust speech recognition system.

Reference

  1. B. H. Juang, and Lawrence R. Rabiner. "Automatic Speech Recognition – A brief History of the Technology Development. ", Elsevier Encyclopedia of Language and Linguistics, 2004.
  2. Sadaoki Furui. "50 Years of Progress in Speech and Speaker Recognition. ", ECTI Transactions on Computer and Information Technology, Vol. 1, No. 2, November 2005.
  3. L. R. Rabiner and B. H. Juang. Fundamentals of Speech Recognition, Prentice-Hall, Englewood Cliff, New Jersey, 1993
  4. Tomi Kinnunen, Haizhou Li, "Overview of text-independent speaker recognition: From features to supervectors. " Speech Communication, 52, pp. 12–40, 2010.
  5. D. S. Pallet, "Speech Results on Resource Management Task," in Proceedings of the February 1989 DARPA Speech and Natural Language Workshop, Morgan Kaufman Publishers, Inc. , Philadelphia, PA, USA, pp. 18-24, February 1989.
  6. D. Paul, "The Lincoln Robust Continuous Speech Recognizer," in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 556-559, Glasgow, Scotland, May 1989.
  7. J. G. Wilpon, R. P. Mikkilineni, D. B. Roe, and S. Gokcen, "Speech Recognition: From the Laboratory to the Real World," AT&T Technical Journal, vol. 69, no. 5, pp. 14-24, October 1990.
  8. J. G. Wilpon, D. M. DeMarco, R. P. Mikkilineni, "Isolated Word Recognition Over the DDD Telephone Network - Results Of Two Extensive Field Trials," in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 55-57, New York, NY, USA April 1988.
  9. B. Wheatley and J. Picone, "Voice Across America: Toward Robust Speaker Independent Speech Recognition For Telecommunications Applications", Digital Signal Processing: A Review Journal, vol. 1, no. 2, p. 45-64, April 1991.
  10. Picone, J. , "The Demographics of Speaker Independent Digit Recognition", in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 105-108, Albuquerque, New Mexico, USA, April 1990.
  11. Jain, A. , Duin, R. , Mao, J. , "Statistical pattern recognition: a review. " IEEE Transaction on Pattern Anal. Machine Intelligence, 22 (1), 4–37, 2000
  12. Furui, S. , "Cepstral analysis technique for automatic speaker verification. " IEEE Trans. Acoustics, Speech Signal Process, 29 (2), 254–272, 1981.
  13. Huang, X. , Acero, A. ,H. W. , Hon, Spoken Language Processing: a Guide to Theory, Algorithm, and System Development, Prentice-Hall, New Jersey.
  14. S. Dupont, et. al. "Hybrid HMM/NN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements. ", Proceedings of International Conference on Acoustics, Speech and Signals, pp. 1767 – 1770, Munich, Germany, April 1997.
  15. J. Hennebert, C. Ris, H. Bourlard, S. Renals, and N. Morgan. "Estimation of Global Posteriors and Forward-Backward Training of Hybrid HMM/NN Systems. ", Proceedings of EUROSPEECH'97, 1997.
  16. Iain Matthews et al. , "Extraction of Visual Features for Lipreading. ", IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 24, No. 2, February 2002.
  17. T. F. Cootes, G. J. Edwards, and C. J. Taylor, "Active Appearance Models," Proc. European Conf. Computer Vision, pp. 484-498, June 1998.
  18. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, "Active Shape Models—Their Training and Application," Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan. 1995.
  19. R. P. Lipmann. "Speech Recognition by Machines and Humans. ", Speech Communication, vol. 22, pp. 1 – 15, 1997.
  20. J. S. Lim and A. V. Oppenheim, "Enhancement and bandwidth compression of noisy speech," Proc. IEEE, vol. 67, pp. 1586-1604, Dec. 1979.
  21. R. J. Mcaulay and M. L. Malpass, "Speech enhancement using a soft-decision noise suppression filter," IEEE Trans. Acoustics, Speech and Signal Processing, vol. 28, pp. 137-145, Apr. 1980.
  22. S. F. Boll, "Suppression of acoustic noise in speech using spectral subtraction," IEEE Trans. Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113-120, 1979.
  23. M. Berouti, R. Schwartz, and J. Makhoul, "Enhancement of speech corrupted by acoustic noise," in Proc. ICASSP '79, USA, pp. 208-211, Apr. 1979.
  24. N. Virag, "Single channel speech enhancement based on masking properties of the human auditory system," IEEE Trans. Speech and Audio Processing, vol. 7, pp. 126-137, Mar. 1999.
  25. L. Deng and X. Huang, "Challenges in adopting speech recognition," Communications of the ACM, vol. 47, no. 1, pp. 69-75, 2004.
  26. P. Lockwood and J. Boudy, "Experiments with a nonlinear spectral subtractor (NSS), Hidden Markov Models and projection, for robust recognition in cars," Speech Communication, vol. 11, no. 2-3, pp. 215-228, 1992.
  27. Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error short time spectral amplitude estimator," IEEE Trans. Acoustics, Speech and Signal Processing, vol. 32, pp. 1109-1121, Dec. 1984.
  28. Y. Ephraim and D. Malah, "Speech enhancement using a minimum mean square error log-spectral amplitude estimator," IEEE Trans. Acoustics, Speech and Signal Processing, vol. 33, pp. 443-445, Apr. 1985.
  29. I. Y. Soon and S. N. Koh, "Low distortion speech enhancement," IEE Proc. On Visual Image Signal Processing, vol. 147, pp. 247-253, Jun. 2000.
  30. M. K. Hasan, S. Salahuddin, and M. R. Khan, "A modified a priori SNR for speech enhancement using spectral subtraction rules," IEEE Signal Processing let ters, vol. 11, pp. 450-453, Apr. 2004.
  31. I. Cohen, "Speech enhancement using a non causal A Priori SNR estimator," IEEE Signal Processing letters, vol. 11, pp. 725-728, Sep. 2004.
  32. Y. Hu and P. C. Loizou, "Speech enhancement based on wavelet thresholding the multitaper spectrum," IEEE Trans. Speech and Audio Processing, vol. 12, pp. 59-67, Jan. 2004.
  33. H. K. Kim and R. Rose, "Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments," IEEE Trans. Speech and Audio Processing, vol. 11, pp. 435-446, Sept. 2003.
  34. R. Gemello, F. Mana, and R. De Mori, "Automatic speech recognition with a modi_ed ephraim-malah rule," IEEE Signal Processing letters, vol. 13, pp. 56-59, Jan. 2006.
  35. M. Dendrinos, S. Bakamidis, and G. Carayannis, "Speech enhancement from noise: A regenerative approach," Speech Communication, vol. 10, pp. 45-57, Feb. 1991.
  36. S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Soensen, "Reduction of broad-band noise in speech by truncated QSVD," IEEE Trans. Speech and Audio Processing, vol. 3, pp. 439-444, Nov. 1995.
  37. Y. Ephraim and H. L. Van Trees, "A signal subspace approach for speech enhancement," IEEE Trans. Speech and Audio Processing, vol. 3, pp. 251-266, Jul. 1995.
  38. J. Huang and Y. Zhao, "An energy-constraind signal subspace method for speech enhancement and recognition in colored noise," in Proc. ICASSP '98, vol. 1, pp. 377-380, May 1998.
  39. A. Rezayee and S. Gazor, "An adaptive KLT approach for speech enhancement,"IEEE Trans. Speech and Audio Processing, vol. 9, pp. 87-95, Feb. 2001.
  40. K. Hermus and P. Wambacq, "Assessment of signal subspace based speech enhancement for noise robust speech recognition," Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04), IEEE International Conference on, vol. 1, pp. I-45-8, May 2004.
  41. U. Mittal and N. Phamdo, "Signal/noise KLT based approach for enhancing speech degraded by colored noise," IEEE Trans. Speech and Audio Processing, vol. 8, pp. 159-167, Mar. 2000.
  42. A. Acero and R. M. Stern, "Environmental robustness in Automatic Speech Recognition. " In Proc. IEEE Acoustics, Speech and Signal Processing, pp. 849-582. April 1990.
  43. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the em algorithm," Journal of Royal Statistical Society, Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977.
  44. L. Deng, J. Droppo, and A. Acero, "Enhancement of log mel power spectra of speech using a phase-sensitive model of the acoustic environment and sequential estimation of the corrupting noise," IEEE Trans. Speech and Audio Processing, vol. 12, pp. 133-143, Mar. 2004.
  45. L. Deng, J. Droppo, and A. Acero, "Estimating cepstrum of speech under the presence of noise using a joint prior of static and dynamic features," IEEE Trans. Speech and Audio Processing, vol. 12, pp. 218-223, May 2004.
  46. L. Deng, J. Droppo, and A. Acero, "Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametri model of speech distortion," IEEE Trans. Speech and Audio Processing, vol. 13, pp. 412-421, May 2005.
  47. L. Deng, J. Droppo, and A. Acero, "Recursive estimation of nonstationary noise using iterative stochastic approximation for robust speech recognition," IEEE Trans. Speech and Audio Processing, vol. 11, pp. 568-580, Nov. 2003.
  48. L. Deng, J. Droppo, and A. Acero, "A Bayesian approach to speech feature enhancement using the dynamic cepstral prior," In Proc. ICASSP '02, (Orlando,USA), May 2002.
  49. L. Buera, E. Lleida, A. Miguel, A. Ortega, and O. Saz, "Cepstral vector normalization based on stereo data for robust speech recognition," IEEE Trans. Audio, Speech, and Language Processing, vol. 15, pp. 1098-1113, March 2007.
  50. P. J. Moreno, B. Raj, and R. M. Stern, "Data-driven environmental compensation for speech recognition: A unified approach," Speech Communication, vol. 24, pp. 267-285, Jul. 1998.
  51. L. Deng, A. Acero, M. Plumpe, and X. D. Huang, "Large-vocabulary speech recognition under adverse acoustic environment," in Proc. ICSLP '00, pp. 806-809, China.
  52. L. Deng, A. Acero, L. Jiang, J. Droppo, and X. D. Huang, "High-performance robust speech recognition using stereo training data," in Proc. ICASSP '01, USA), pp. 301-304, May 2001.
  53. J. Droppo, L. Deng, and A. Acero, Evaluation of the SPLICE algorithm on the Aurora2 database," in Proc. Eurospeech '01, (Aalborg, Denmark), pp. 217-220,Sept. 2001.
  54. J. Droppo, L. Deng, and A. Acero, "Uncertainty decoding with SPLICE for noise robust speech recognition," in Proc. ICASSP '02, (Orlando, USA), May 2002.
  55. F. H. Liu, R. M. Stern, X. Huang, and A. Acero, "Efficient cepstral normalization for robust speech recognition," in Proc. ARPA Human Language Technology Workshop '93, (Princeton, NJ), pp. 69-74, Mar. 1993.
  56. L. Wang, N. Kitaoka, and S. Nakagawa, "Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN," EURASIP Journal on Applied Signal Processing, vol. 2006, pp. 1-11, 2006.
  57. O. Viikki and K. Laurila, "Cepstral domain segmental feature vector normalization for noise robust speech recognition," Speech Communication, vol. 25, pp. 133-147, 1998.
  58. F. Hilger and H. Ney, "Quantile based histogram equalization for noise robust large vocabulary speech recognition," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 845-854, 2006.
  59. A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, and A. J. Rubio, "Histogram equalization of speech representation for robust speech recognition," IEEE Trans. Speech and Audio Processing, vol. 13, no. 3, pp. 355-366, 2005.
  60. Y. Suh, M. Ji, and H. Kim, "Probabilistic class histogram equalization for robust speech recognition," IEEE Signal Processing letters, vol. 14, no. 4, pp. 287-290, 2007.
  61. Xiong Xiao, Eng Siong Chng and Haizhou Li, "Normalizing the speech modulation spectra for robust speech recognition," IEEE Transactions on Audio, Speech, and Language processing, vol. 16, no. 8, pp. 1662-1674, November 2008.
  62. T. Houtgast and H. J. M. Steeneken, "The modulation transfer function in room acoustics as a predictor of speech intelligibility," Acustica, vol. 28, pp. 66-73, 1973.
  63. T. Houtgast and H. J. M. Steeneken, "A review of the MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria," Journal of the Acoustical Society of America, vol. 77, no. 3, pp. 1069-1077, 1985.
  64. R. Drullman, J. M. Festen, and R. Plomp, "Effect of temporal envelope smearing on speech reception," Journal of the Acoustical Society of America, vol. 95, no. 2, pp. 1053-1064, 1994.
  65. R. Drullman, J. M. Festen, and R. Plomp, "Effect of reducing slow temporal modulations on speech reception," Journal of the Acoustical Society of America, vol. 95, no. 5, pp. 2670-2680, 1994.
  66. N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, "On the relative importance of various components of the modulation spectrum for automatic speech recognition," Speech Communication, vol. 28, no. 1, pp. 43-55, 1999.
  67. T. Arai, M. Pavel, H. Hermansky, and C. Avendano, "Syllable intelligibility for temporally filtered LPC cepstral trajectories," Journal of the Acoustical Society of America, vol. 105, no. 5, pp. 2783-2791, 1999.
  68. C. P. Chen and J. A. Bilmes, "MVA processing of speech features," IEEE Trans. Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 257-270, 2007.
  69. C. Avendano, S. van Vuuren, and H. Hermansky, "Data based filter design for RASTA-like channel normalization in ASR," in Proc. ICSLP '96, (Philadelphia, USA), Oct. 1996.
  70. S. van Vuuren and H. Hermansky, "Data-driven design of RASTA-like filters," in Proc. Eurospeech '97, Sept. 1997.
  71. H. Hermansky, "The modulation spectrum in the automatic recognition of speech," in Proc. ASRU '97, pp. 140-147, Dec 1997.
  72. M. Shire and B. Chen, On data-derived temporal processing in speech feature extraction," in Proc. ICSLP '00, (Beijing, China), Oct. 2000.
  73. M. L. Shire, "Data-driven modulation _lter design under adverse acoustic conditions and using phonetic and syllabic units," in Proc. Eurospeech '99, (Budapest, Hungary), Sept. 1999.
  74. J. -W. Hung and L. -S. Lee, "Optimization of temporal filters for constructing robust features in speech recognition," IEEE Trans. Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 808-832, 2006.
  75. A. Sankar and C. -H. Lee, "A maximum-likelihood approach to stochastic matching for robust speech recognition," IEEE Trans. Speech and Audio Processing, vol. 4, pp. 190-202, May 1996.
  76. A. Surendran, C. -H. Lee, and M. Rahim, "Nonlinear compensation for stochastic matching," Speech and Audio Processing, IEEE Transactions, 7, pp. 643-655, Nov 1999.
  77. C. -S. Huang, W. Hsiao-Chuan, and C. -H. Lee, "An snr-incremental stochastic matching algorithm for noisy speech recognition," Speech and Audio Processing, IEEE Transactions vol. 9, pp. 866-873, Nov 2001.
  78. J. L. Gauvain and C. H. Lee, "Maximum a posterirori estimation for multivariate Gaussian mixture observations of Markov chains," IEEE Trans. Speech and Audio Processing, vol. 2, pp. 291-298, Apr. 1994.
  79. Q. Huo, C. Chan, and C. H. Lee, "Bayesian adaptive learning of the parameters of hidden Markov model for speech recgonition," IEEE Trans. Speech and Audio Processing, vol. 3, pp. 334-345, Sep. 1995.
  80. Y. Tsao and C. -H. Lee, "Two extensions to ensemble speaker and speaking environment modeling for robust automatic speech recognition," in Proc. ASRU '07, pp. 77-80, Dec. 2007.
  81. Y. Tsao and C. -H. Lee, "An ensemble modeling approach to joint characterization of speaker and speaking environments," in Proc. Eurospeech '07, pp. 1050-1053, Sept. 2007.
  82. R. Kuhn, J. -C. Junqua, P. Nguyen, and N. Niedzielski, "Rapid speaker adaptation in eigenvoice space," IEEE Trans. Speech and Audio Processing, vol. 8, pp. 695-707, Nov 2000.
  83. M. J. F. Gales and S. J. Young, "Cepstral parameter compensation for HMM recognition," Speech Communication, vol. 12, pp. 231-239, Jul. 1993.
  84. Y. Gong, "A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition," IEEE Trans. Speech and Audio Processing, vol. 13, no. 5, pp. 975-983, 2005.
  85. T. Takiguchi, S. Nakamura, and K. Shikano, "Hmm-separation-based speech recognition for a distant moving speaker," Speech and Audio Processing, IEEE Transactions, vol. 9, pp. 127-140, Feb 2001.
  86. J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, "High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector taylor series," in Proc. ASRU '07, (Kyoto, Japan), pp. 65-70, Dec. 2007.
  87. J. Li, L. Deng, D. Yu, Y. Gong, and A. Acero, "HMM adaptation using a phase sensitive acoustic distortion model for environment-robust speech recognition," in Proc. ICASSP '08, pp. 4069-4072, Apr. 2008.
  88. D. P. Morgan and C. L. Scofield. Neural Networks for Speech Processing. Kluwer Academic Publishers, Norwell, Massachussets, 1991.
  89. S. Riis, "Hidden Neural Network: Application to speech recognition", In proceedings of Int. Conf. on Acoustics, Speech and Signal Processing, Seatle, 1998.
  90. Lipmann, R. P. and Gold, B. , "Neural net classifiers useful for speech recognition, In Proceedings of IEEE first International Conference on Neural Networks, vol. IV, pp. 417-425, June 1987.
  91. Briddle, J. S. , "Alphanets: a recurrent neural network architecture with a hidden markov model interpretation, Speech Communication, 9:83-92, Feb. 1990.
  92. Saul, L. K. and Jordan, M. I. , "Boltzmann Chains and Hidden markov models, In Advances in Neural Information Processing Systems, vol. 7, pp. 435-442, 1995.
  93. Ackley, D. H. , Hinton, G. E. and Sejnowjki, T. J. , "A learning algorithm for Boltzmann machine", Cognitive Science, 9:147-169, 1985.
  94. Mackay, D. J. C. , "Equivalence of linear Boltzmann chains and hidden markov models. Neural computation, 1(8):178-181, 1996.
  95. S. Furui. "Speaker independent isolated word recognition using dynamic features of speech spectrum. ", IEEE Trans. Acoustics, Speech, Signal Processing, ASSP-34, pp. 52-59, 1986.
  96. H. Hermansky, and N. Morgan. "RASTA-Processing of Speech. ", IEEE Transactions on Speech and Audio Processing, vol. 2, No. 4, pp. 578 – 589, 1994.
  97. K. Shikano, "Evaluation of LPC Spectral Matching Measures for Phonetic Unit Recognition," TM No. CMU-CS-86- 108, Computer Science Department, Carnegie-Mellon University, Pittsburgh, PA, US, 15213, February 3, 1986.
  98. C. J. Leggetter and P. C. Woodland. "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models", Computer Speech and Language, vol. 9, pp. 171 – 185, 1995.
  99. B. H. Juang, C. H. Lee, and Wu Chou. "Minimum Classification Error Rate Methods for Speech Recognition. ", IEEE Transactions on Speech & Audio Processing, T-SA, vol. 5, No. 3, pp. 257 – 265, May 1997.
  100. B. H. Juang and S. Katagiri. "Discriminative Learning for Minimum Error Classification. ", IEEE Trans. Signal Processing, vol. 40, pp. 3043-3054, 1992.
  101. R. P. Lipmann. "Review of Neural Networks for Speech Recognition. ", Neural Computation, 1, 1-38, 1989.
  102. M. A. Anusuya • S. K. Katti, "Front end analysis of speech recognition: a review", Int J Speech Technology, 14: 99–145, 2011.
  103. http://www3. ntu. edu. sg/home/xiaoxiong/TsceG0402426L. pdf

Keywords

Spectral, Cepstral Features, Feature Enhancement, Compensations, Model Adaptation and Hidden Markov Model.