Monday, September 10, 2007

:-- Speech Recognition --:

Name:: BanglaSR

Summary::

This projects aims to develop a Speech Recognizer that can understand the spoken units (characters, words and sentences) uttered in the context of Bangla Language. We used Hidden Markov Model (HMM) technique for pattern classification and also incorporate stochastic language model with the recognizer. Hidden Markov Model Toolkit (HTK) is used to develop and implement the Speech Recognizer.

Details::

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. Research in this area has attracted a great deal of attention over the past five decades where several technologies are applied and the efforts were made to increase the performance up to marketplace standard so that the users will have the benefit in a variety of ways. During this long research period several key technologies were applied where the combination of hidden Markov Model (HMM) and the stochastic language model produces high performance. Till now, most of the research effort on recognizing Bangla Speech is performed using the ANN based classifier. No research work has been reported yet that uses the DTW technique and HMM based classifier and no language model is included with the existing research works.

The area of Automatic Speech Recognition (ASR) is classified into Isolated speech recognition (ISR) and Continuous speech recognition (CSR). An ISR system requires that the speaker pause briefly between words, whereas a CSR system does not. For Isolated word the assumption is that the speech to be recognized comprised a single word or phase and to be recognized as complete entity with no explicit knowledge or regard for the phonetic content of the word or phase. The notion of ISR can be extended for connected speech recognition if we consider a small vocabulary and solve the co-articulation problem that arises between words. In continuous speech recognition, continuously uttered sentences are recognized. In CSR it is very important to use sophisticated linguistic knowledge. The most appropriate units for enabling recognition success depend on the type of recognition and on the size of the vocabulary. Various units of reference templates/models from phonemes to words have been studied. When words are used as units, word recognition can be expected to be highly accurate; however it requires larger memory and more computation. Using phonemes as units does not greatly increase memory size requirements and also computation. In our research project we used word as a unit for ISR and phoneme as a unit for CSR.

The project goal of Speech Recognition system for Bangla language is to enhance the interaction between users and computer. This will help to overcome the literacy barrier, and hence encourage the users to use the technology through interactive voice response. The outcome of this project will be usable for command & control and different data entry applications.

Team::

  • Md. Abul Hasnat [email] [website]
  • Jabir Mowla (Research Intern for Summer 2007) [email]
  • Mumit Khan

Status::

· Implemented version of Isolated Speech Recognizer is ready to release.

· Prototype version of Continuous Speech Recognizer is implemented, however the experiments on training procedure and language models are continuing.

Research Scope::

· Experiment on Training issues for Continuous Speech Recognition.

· Experiment on Language Models for Continuous Speech Recognition.

· Experiment with other available techniques and tools.

· Move towards Audio Visual Speech Recognition.

Development Scope::

· Implement the existing developed versions using different language.

· Implement speech recognizer for specific domain applications.

Timeline:: Not Defined.

What is Speech Recognition?



Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and document preparation. Research in this area has attracted a great deal of attention over the past five decades where several technologies are applied and the efforts were made to increase the performance up to marketplace standard so that the users will have the benefit in a variety of ways. During this long research period several key technologies were applied where the combination of hidden Markov Model (HMM) and the stochastic language model produces high performance.

To convert speech to on-screen text or a computer command, a computer has to go through several complex steps. When we speak, we create vibrations in the air. The analog-to-digital converter (ADC) translates this analog wave into digital data that the computer can understand. To do this, it samples, or digitizes, the sound by taking precise measurements of the wave at frequent intervals. The system filters the digitized sound to remove unwanted noise, and sometimes to separate it into different bands of frequency (frequency is the wavelength of the sound waves, heard by humans as differences in pitch). It also normalizes the sound, or adjusts it to a constant volume level. It may also have to be temporally aligned. In addition with these tasks speech end point detection is necessary in order to extract valid speech data from the spoken signal. These tasks are called preprocessing of speech signal. The next tasks are Feature Extraction and Recognition. Significant amount of research work is already done and also continuing in these areas using variety of different approaches.

The area of Automatic Speech Recognition (ASR) is classified into Isolated speech recognition (ISR) and Continuous speech recognition (CSR). An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. For Isolated word the assumption is that the speech to be recognized comprised a single word or phase and to be recognized as complete entity with no explicit knowledge or regard for the phonetic content of the word or phase. Hence, for a vocabulary of V words (or phases), the recognition algorithm consisted of matching the measured sequence of spectral vectors of the unknown spoken input against each of the set of spectral patterns for V words and selecting the pattern whose accumulated time aligned spectral distance was smallest as the recognized word. The notion of isolated speech recognition can be extended for connected speech recognition if we consider a small vocabulary and solve the co-articulation problem that arises between words. In continuous speech recognition, continuously uttered sentences are recognized. The standard approach continuous speech recognition is to assume a simple probabilistic model of speech production whereby a specified word sequence, W, produce an acoustic observation sequence, so that the decoded string has the maximum a posteriori probability. In continuous speech recognition it is very important to use sophisticated linguistic knowledge. The most appropriate units for enabling recognition success depend on the type of recognition and on the size of the vocabulary. Various units of reference templates/models from phonemes to words have been studied. When words are used as units, word recognition can be expected to be highly accurate; however it requires larger memory and more computation. Using phonemes as units does not greatly increase memory size requirements and also computation.

Some speech recognition systems require speaker enrollment; a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

Research state of Speech Recognition at CRBLP



Center for Research on Bangla Language Processing (CRBLP) is now in a significant position about its research work and development of Automatic Speech Recognition. Right now we are ready to release the first version of Automatic Speech Recognizer named BanglaSR. BanglaSR is a speech recognizer that can recognize the isolated Bangla words. The words to be recognized must be trained by the user, where the training procedure is very simple. BanglaSR provides the opportunity to the user to interact with the computer through voice.

Speech Recognition research has been started at CRBLP since February 2006 by A K M Mahmudul Hoque as a part of his undergraduate thesis work. He was successful to complete his research work on recognizing isolated words using the HTK toolkit. However he didn’t implement his work. Just after that on May 2006, we participated into the PAN Localization Summer School of Asian Language processing at Pakistan, where we learned about Continuous Speech Recognition (CSR) as a part of the course of Speech Processing. The instructor for teaching Speech Recognition part was Dr. Chai Wutiwiwatchai, National Electronics and Computer Technology Center (NECTEC), Thailand. Although it was only a four days course curriculum, however Dr. Chai was able to teach us the complete methodology for creating a very small speech recognizer to recognize bangle digits as a part of our lab task. This training was very much effective for us to learn the basics of CSR. After returning from the summer school I implemented a prototype version of the CSR following the methodology that we learned from Dr. Chai. After that for a certain period I stopped the work of BanglaSR. The research work again started when Iftheker Mohammad (student of CSE, BRAC University) choose Bangla Speech Recognition as his NLP course project. He submitted a report as a part of the project output. During summer’06 semester Jabir Mowla (student of ECE, BRAC University) joined CRBLP as a summer intern to work on speech recognition. He has done some experiment on the preprocessing of the speech signal and implemented some algorithms. After observing the successful outcome of jabir’s work we are encouraged to implement the Isolated Speech Recognizer. Now I have finished the implementation of the isolated speech recognizer and we are ready to release the first version of BanglaSR. Along with some flexible features of BanglaSR it has some limitations also. However, we are considering this as the encouragement of the research work on Speech Recognition for recognizing Bangla language at Center for Research on Bangla Language Processing (CRBLP).