Bimodal Speech Recognition
Overview
The goal of bimodal speech recognition is to
combine audio and visual information to enhance speech recognition rate under
poor audio conditions (noise or acoustically confusing words). A lipreading
system recognizes a spoken word based on the input lip motion. To handle this
problem, we proposed a space-time delay neural network that can automatically
discover the features embedded in spatiotemporal domain in the training process
and use these features to classify different lip motions. Our experimental
results indicated that, using only lip motion video, the lipreading system can
achieve a 77.8%~90% recognition rate for Chinese digits, and 44.7%~48.9%
recognition rate for nineteen Chinese confusing words.
We also implemented an on-line bimodal speech
recognition system to test how lipreading can improve the audio speech
recognition. The recognition system consisting of three DSP processors and one
Pentium processor concurrently processes lip motion video and speech signals.
The whole recognition process, including mouth region centering, 2D-FFT, speech
feature extraction, neural network computation, HMM computation, and decision
fusion, can be executed in real time.
Publication
·
Chin-Teng Lin, Hsi-Wen Nein and Wen-Chieh Lin, “A
Space-Time Delay Neural Network for Motion Recognition and Its Application to
Lipreading, ” International Journal of Neural Systems, Vol. 9, No. 4, Aug
1999, pp. 311-334.
·
Wen-Chieh Lin, A Space-Time Delay Neural
Network for Motion Recognition and Its Application to Lipreading in Bimodal
Speech Recognition, Master thesis, National Chiao-Tung University, Taiwan,
1996.
·
Wen-Chieh Lin, Hsi-Wen Nein, and Shin-Hui Liang, A DSP-based On-line Bimodal Speech Recognition System,
First Prize of the Graduate Student Team in the Texas Instrument DSP Design
Challenge, Taiwan, 1996.