Wednesday, February 20, 2002 1:34:36 PM
Lucent Automatic Speech Recognition engine (LASR)
Lucent Speech Solutions
Technical White Paper
Introduction
This white paper gives a high level description of the features of Lucent Automatic Speech Recognition (LASR) engine that executes on Lucent’s Compact PCI (cPCI) Speech Processing Board. The cPCI Speech Processing Board is part of the Lucent Speech Server (LSS) product which is described in a separate white paper.
The LASR engine, featuring Bell Laboratories’ state-of-the art speech technology, is specifically designed to deliver unsurpassed ASR performance for telecommunication applications. It is optimized to achieve high accuracy at a level of computational complexity that allows for a very high channel density implementation on the cPCI Speech Processing Board. Given the vast computational power of Speech Processing Board (described in Appendix A), the LASR engine delivers industry leading accuracy at a fraction of the per channel cost of other ASR engines on the market.
The LASR engine supports both speaker independent and speaker trained ASR. It has been designed and tuned specifically to achieve optimal performance in telecom settings, including both the landline and wireless environments. In the following sections, we highlight various LASR engine features that underscore the telecom focus of the engine.
Flexibility
The LASR engine executing on the cPCI Speech Processing Board gives application designers flexibility in developing a wide range of speech based applications. The engine is capable of performing the full range of speech recognition tasks, from small vocabulary and connected digit recognition, to very large vocabulary and natural language ASR. The engine allows for a seamless tradeoff between board channel density and vocabulary size. Figure 1 shows a plot of the cPCI Speech Processing Board channel density versus vocabulary size. If an existing application must be updated to use a more complex ASR (e.g., larger vocabulary), the same cPCI Speech Processing Boards can still support the updated ASR task without any hardware modification. Since the LASR engine is “pumped” onto the cPCI board at boot time, any updates to the engine are performed by repumping the board.
High Performance Recognition Engine
The LASR engine is a subword (phoneme) based speech recognition engine. The acoustic models employed by the LASR engine consist of phoneme-unit Hidden Markov Models (HMMs). These phoneme models cover all the sounds in a given language and are used to form models for the words in the recognizer vocabulary set. Since all words in a given language consist of a string if phonemes, word models are formed by concatenating the appropriate phoneme models in the recognizer. In this fashion, vocabulary sets can be updated and changed very quickly by specifying new phoneme strings. The words in the vocabulary set can be connected though a grammar (more about grammars later) to form a set of phrases and sentences that LASR can recognize.
The recognition process in LASR consists of two stages. The first stage is a classification stage where LASR determines the most likely candidate, given the input speech. The second stage is the utterance verification stage where the most likely candidate determined in the first stage is verified and either accepted as the recognition answer or rejected as out of vocabulary (or out of grammar).
The HMMs employed in the classification stage are discriminatively trained using the Generalized Probabilistic Descent (GPD) training framework that was invented and patented by Bell Labs. HMM training is a procedure that is performed offline using a large speech database to estimate the values of the different parameters of the HMM. Compared to the traditional HMM training method, called Maximum Likelihood (ML) training, GPD discriminative training is designed to minimize the recognition error rate by emphasizing the features that differentiate competing candidates (i.e., HMMs). In contrast, ML training seeks to maximize the likelihood for all competing candidates without explicitly trying to minimize the recognition error rate. Since recognition error rate is the most important parameter in speech recognition performance, discriminative training offers higher ASR performance with no increase in run-time recognition computational complexity. Run-time computational complexity depends to a large extent on the size of the HMMs. GPD training does not increase the size of the HMMs. Rather, it is a training procedure that estimates the values of the HMM parameters in such a way that the recognition error rate is minimized. It is true, however, that the computational complexity of the training procedure for GPD is larger than the ML case. But this does affect run-time complexity since the training process is performed offline.
The LASR acoustic models (HMMs) are trained to give very robust and high ASR performance for various accents and calling environments. Specifically, the acoustic models for North American English, for example, have been trained with an extensive database covering all dialectical regions in North America, as well as various calling environments in telecom networks, including wireless and hands free. The application does not need to tell the engine whether the incoming call is from a wireless phone or not. The acoustic models are specifically trained to give high ASR performance regardless of the type of call it is handling. Table 1 shows the ASR performance on telephone numbers spoken by users in various wireless and landline environments.
Standard Microphone Environment Accuracy
LandLine
AMPS
TDMA
CDMA
CDMA Handset
Handset
Handset
Handset
Full Duplex Hands Free Various
Moving car local/hwy
Various
Moving car at 55 mph hwy
Moving car at 55 mph hwy 95%
90%
95%
96%
84%
Table 1. Continuous digit ASR performance in various telecom environments.
It should be noted that the databases corresponding to the rows in the Table 1 have been collected during various trials and data collection efforts. This table indicates that the LASR engine results in very high accuracy, regardless of the whether the call is over a wireless or landline network. Additionally, it is robust to the various wireless standards. Considering the extreme noisy conditions in a hands free call from a car moving at 55 mph on a highway, the engine’s performance of 84% is outstanding.
The second stage in the LASR engine, utterance verification (UV), employs a discriminative utterance verification method that was also invented and patented by Bell Labs. This method uses a second set of HMMs that are trained using Minimum Verification Error (MVE) training. MVE training is a discriminative training method whose goal is to minimize the utterance verification error rate. Typically, utterance verification performance is measured along two dimensions: a) the false alarm rate (i.e., the acceptance of out-of grammar utterances), and b) the false rejection rate of in-grammar utterances. There is always a tradeoff between false alarms and false rejections. Since the goal of MVE is to minimize the UV error rate, it has been shown that the MVE-trained method employed in LASR results in a significantly lower false alarms at any given false rejection rate when compared to other UV methods used by other engines.
Lucent adds a third dimension in measuring LASR’s UV performance; the post-rejection substitution error rate. The post rejection substitution error rate is the in-vocabulary substitution error rate on the speech that was not rejected by the utterance verifier. Although, ideally, we would like a 100% in-vocabulary recognition accuracy, substitution errors do happen in practice. In these cases, it is far more attractive for a telecom application if the engine can detect and reject in-vocabulary substitution errors rather than pass such errors to the application causing confusion to the caller. LASR’s UV method is not only trained to minimize both false alarms and false rejection, but also to detect and convert substitution errors into rejections.
Engine Development and Testing
Lucent uses extensive speech databases from various trials and data collection efforts in its continuing development and enhancements of the LASR engine. These databases are divided into two sets: a training set and a testing set. The training set is used to train the ASR acoustic and utterance verification models. The testing set is used to independently test the ASR performance. Testing ASR performance is performed along 3 dimensions:
1. Accuracy (for both in-vocabulary accuracy and out-of-vocabulary rejection rate)
2. Per channel memory that the ASR engine requires executing on the cPCI Speech Processing Board.
3. Per channel memory that the ASR engine requires executing on the cPCI Speech Processing Board.
Our continuing goal is to jointly optimize with respect to all three dimensions. The LASR engine is now industry leading in both channel density and accuracy. The high channel density of the Lucent Speech Server is a manifestation of this. In testing the LASR engine, we use various tasks in a number of different environments in order to get a complete picture of its performance in different applications. Performance is measured along the above 3 dimensions.
Barge-in
Barge-in is a feature that allows the caller to speak his/her request while the announcement prompt is still playing. This feature is similar to the cut-through feature in a touch-tone service. To support barge-in, echo cancellation is required in order to cancel any prompt echo reaching the recognizer while it is listening for caller input. The LASR engine supports the barge-in feature. The cPCI Speech Processing Board has, as part of its hardware architecture (see Appendix A below), a VLSI chip that is capable of performing 64 channel / 64 msec. delay per channel echo cancellation. This VLSI echo cancellation chip is designed and manufactured by Lucent and is also used for echo cancellation in the AT&T long distance network.
The Lucent ASR engine supports barge-in in 3 different modes:
1) Energy based barge-in. This type of barge-in relies on the detection of energy above a certain level to declare a barge-in. Although energy-based barge-in has been employed by most speech engine vendors, it is susceptible to extraneous and non-speech noises since it relies purely on the detection of energy.
2) End of recognition barge-in. Here, barge-in is reported once a valid word/phrase/sentence is recognized by the recognizer. This type of barge-in is useful if the recognition task consists of short utterances. Since barge-in is reported at the end of a valid utterance, this type of barge-in is not suitable for tasks where the speaker is expected to say long phrases and sentences. The speaker is likely to get confused if he/she is trying to say a long phrase while the prompt is still playing.
3) Recognition-based barge-in. Compared to conventional, energy based barge-in, recognition based barge-in is much less susceptible to noise and extraneous sounds. If the recognition task consists of recognizing long sentences and/or digit strings (e.g., telephone numbers), this recognition-based barge-in is very effective since it does not require that the utterance be completely recognized before generating a barge-in signal. Rather, this method detects the start of a valid (i.e., in-grammar) sentence or digit string and generates a barge-in signal within a very short time after the speaker starts speaking. This typically occurs significantly before the end of a valid utterance. This barge-in algorithm monitors the evolution of the recognition-decoding network and continuously tests if there is a high likelihood that the speaker started to speak a valid utterance. If there are extraneous noises, or if the speaker coughs or produces other spurious sounds, this barge-in algorithm does not typically generate a barge-in signal. Rather, the prompt continues to play until the speaker starts speaking a valid utterance. Employing such a recognition based barge-in is critical in developing applications that result in high user satisfaction.
Early Decision
Early decision is a feature that enables the engine to report the recognition result very quickly after the speaker has finished speaking. Most ASR engine vendors employ an energy based endpoint detector to detect the endpoint of input speech. Typically, in order to ensure that the speaker has finished speaking, energy based endpoint detection requires some period of silence to elapse at the end of the utterance before declaring an endpoint. This adds considerable delay to the interaction that the caller experiences. In our early decision approach, the engine monitors the recognition process on a frame-by-frame basis. If a grammar terminal node is reached, and all the paths leading to the other nodes have been pruned away (i.e., recognition result has been narrowed to only one answer), this implies that the speaker has finished speaking. Therefore, when this happens, the engine reports its recognition result immediately, regardless of endpoint detection. In this way, the engine can report the recognition answer as quickly as possible, adding to the efficiency and user friendliness of the system.
Natural Language
The LASR engine supports large vocabulary natural language ASR. Both finite state grammars and statistical language models (N-grams) are supported. For the finite state grammar, the application developer can easily construct a grammar using the Java Speech Grammar Format (JSGF) . Semantic tags are supported for making the task of developing applications easier. Dynamic (or “drop-in”) grammars are also supported.
While finite state grammars require the application developer to predict all possible responses that a speaker may say, statistical language models relax this constraint. Through proper post processing, N-grams allow the engine and application to correctly recognize a wide range of responses, even those that were not predicted at application development time. For example, the caller can be asked a general question like “Welcome to XYZ Corporation service center, how may I help you?” The application using the LASR engine with N-grams is capable of correctly handling a wide range of responses to such an open question.
Speaker Trained ASR
The LASR engine relies entirely on speaker-independent, subword (phoneme) based speech recognition models for both speaker-dependent and speaker-independent ASR. In the training mode of a speaker dependent ASR (e.g., name dialing application), the LASR engine activates a subword recognition task with a free grammar where any phoneme can follow any other phoneme. The subscriber will then be instructed to the say the name that he/she wants to train. The engine determines the most likely phoneme string that matches what the subscriber spoke. This string becomes the voiceprint for that name for that particular subscriber. The voiceprint includes the speaker dependent information since it basically describes how the subscriber pronounced the particular name. The names in the name list for each subscriber represent a grammar that the engine can interpret and use to perform speech recognition. The speaker-dependent grammars are dynamically loaded (or “dropped in”) to the grammar structure defined in the call flow. Since the voiceprint for a given name is basically a text string describing the pronunciation of the name by the subscriber, the storage requirement for the subscriber name list including the voiceprint is very small. For example, for a speaker dependent list of 100 names the storage requirement is about 16 KBytes. This 16 KByte figure assumes that the subscriber is asked to train each name twice and two voiceprints are stored per name. In the case of speaker independent names, only one transcription is needed. Therefore, the storage requirement for a 100 speaker independent names is about 8 KBytes. In the training mode, the engine is also capable of detecting collisions of the new name that is being trained with names that are already on the subscriber’s name list. This feature allows the application to indicate to the subscriber that the name the subscriber had picked sounds too close to a name already on his/her list and suggest ways to remedy the situation.
Extensive Engine Control
The application developer has a high level of control over the engine though the API. For example, rejection (confidence) thresholds can be specified through the API. In addition, there is a set of 7 timers that can be set, allowing for fine tuning of the interface presented to the caller. The engine returns to the application the following:
1. The text strings of the top N recognized candidates (N can be specified through the API)
2. The confidence score for each of the top N candidates
3. The corresponding semantic tags (if specified) for each of the top N candidates
The confidence scores reported are the scores computed by the utterance verification process. They give a very reliable indication of whether a given candidate is correctly recognized or not.
International Languages
The Lucent ASR engine supports speech recognition in a number of languages, including North American English, North American Spanish, European Spanish, Canadian French, German, and Italian. There are plans to support additional languages.
Run Time Control
The Lucent cPCI Speech Processing Board hosts a number of speech and signal processing modules in addition to the ASR engine. For example, Text-to-Speech Synthesis (TTS), play, record, and DTMF can run simultaneously with ASR on the same board. An efficient, multi-processing Run Time Environment (RTE) executes on the board, acting as a real time operating system. An extensive real-time control mechanism has been developed and implemented as part of the RTE. This on-board real-time control allows various functions (e.g., playing prompts, DTMF detection, and ASR) to interact to trigger or stop one another. This real-time control facility enables very rapid response to input events because the run-time messaging can be contained to a single board. Take for example ASR with barge-in. Without real-time control, the ASR engine would need to send a message to the application indicating the detection of a barge-in event. The application then sends a message to the play function to stop playing the announcement. This can add significant delays between the time barge-in is detected and the time the announcement stops, resulting in possible caller confusion and dissatisfaction. With real-time control, the application can instruct the board, from the start, to stop play if a barge-in is detected by the ASR. The messaging is all done within the board. This close coupling of various functions allows for very robust and user-friendly telecom applications.
Conclusion
The LASR engine delivers industry leading performance in the telecom enviroment. Whether the call is a mobile call, a landline call, or a hands free call, the LASR engine delivers unparalled accuracy at the highest channel density in the industry. Various features that make the interaction between the caller and the ASR more user-friendly have been developed and incorporated into the LASR engine. As an integral part of the Lucent Speech Server, the LASR engine executing on the cPCI Speech Processing Board, allows the LSS to be a high channel density and scalable solution for service providers.
APPENDIX A
The Lucent Compact PCI Speech Processing Board
Figure A1. A functional diagram of the Lucent Compact PCI Speech Processing Board.
The Lucent cPCI Speech Processing Board uses advanced hardware technology that offers unparalleled reliability and flexibility. They feature three PowerPC 750 processors--the computing horsepower of three workstations on one board. They offer high density support of many simultaneous channels of speech, while consuming half as much power as the leading competitor’s board. Echo cancellation is integrated within each board to support the Bell Lab’s patented “Barge-in”.
The variety of speech and signal processing capabilities delivered by the Lucent Compact PCI Speech Processing Board includes
Automatic Speech Recogntion (ASR),
Text-to-Speech Synthesis (TTS),
Speaker Verification (SV) ,
Speech Compressions/Coding,
Speech Play,
Conferencing,
DTMF Detection,
Tone Generation,
Call Progress Tone Detection,
Telecommunications Devices for the Deaf (TDD) Tone Detection,
Speech Recording, and
Echo Cancellation.
The ASR and TTS capabilities support a number of different languages. Advanced ASR features include barge-in to interrupt prompts, rejection of erroneous sounds, N-best recognition, and N-gram statistical language modeling.
One board could do both TTS and ASR. In general, supporting functions can also be combined with ASR and/or TTS. For example, each ASR task can be supplied with echo cancellation, DTMF detection, and the playing of TTS prompts embedded in mu-law prompts, all on the same board.
In addition to the algorithmic engines for the speech and signal processing functions listed above, the software suite that accompanies the Lucent cPCI Speech Processing Board includes an efficient, on-board Run Time Environment (RTE) and a Software Development Kit (SDK) featuring powerful Application Programming Interfaces (APIs). These tools feature utilities to configure on-board run-time control so that various functions (e.g., playing prompts and DTMF detection) can interact to trigger or stop one another. This facility enables very rapid response to input events, because the run-time messaging can be contained to a single board.
http://www.lucent.com/livelink/153151_Whitepaper.doc
Lucent Speech Solutions
Technical White Paper
Introduction
This white paper gives a high level description of the features of Lucent Automatic Speech Recognition (LASR) engine that executes on Lucent’s Compact PCI (cPCI) Speech Processing Board. The cPCI Speech Processing Board is part of the Lucent Speech Server (LSS) product which is described in a separate white paper.
The LASR engine, featuring Bell Laboratories’ state-of-the art speech technology, is specifically designed to deliver unsurpassed ASR performance for telecommunication applications. It is optimized to achieve high accuracy at a level of computational complexity that allows for a very high channel density implementation on the cPCI Speech Processing Board. Given the vast computational power of Speech Processing Board (described in Appendix A), the LASR engine delivers industry leading accuracy at a fraction of the per channel cost of other ASR engines on the market.
The LASR engine supports both speaker independent and speaker trained ASR. It has been designed and tuned specifically to achieve optimal performance in telecom settings, including both the landline and wireless environments. In the following sections, we highlight various LASR engine features that underscore the telecom focus of the engine.
Flexibility
The LASR engine executing on the cPCI Speech Processing Board gives application designers flexibility in developing a wide range of speech based applications. The engine is capable of performing the full range of speech recognition tasks, from small vocabulary and connected digit recognition, to very large vocabulary and natural language ASR. The engine allows for a seamless tradeoff between board channel density and vocabulary size. Figure 1 shows a plot of the cPCI Speech Processing Board channel density versus vocabulary size. If an existing application must be updated to use a more complex ASR (e.g., larger vocabulary), the same cPCI Speech Processing Boards can still support the updated ASR task without any hardware modification. Since the LASR engine is “pumped” onto the cPCI board at boot time, any updates to the engine are performed by repumping the board.
High Performance Recognition Engine
The LASR engine is a subword (phoneme) based speech recognition engine. The acoustic models employed by the LASR engine consist of phoneme-unit Hidden Markov Models (HMMs). These phoneme models cover all the sounds in a given language and are used to form models for the words in the recognizer vocabulary set. Since all words in a given language consist of a string if phonemes, word models are formed by concatenating the appropriate phoneme models in the recognizer. In this fashion, vocabulary sets can be updated and changed very quickly by specifying new phoneme strings. The words in the vocabulary set can be connected though a grammar (more about grammars later) to form a set of phrases and sentences that LASR can recognize.
The recognition process in LASR consists of two stages. The first stage is a classification stage where LASR determines the most likely candidate, given the input speech. The second stage is the utterance verification stage where the most likely candidate determined in the first stage is verified and either accepted as the recognition answer or rejected as out of vocabulary (or out of grammar).
The HMMs employed in the classification stage are discriminatively trained using the Generalized Probabilistic Descent (GPD) training framework that was invented and patented by Bell Labs. HMM training is a procedure that is performed offline using a large speech database to estimate the values of the different parameters of the HMM. Compared to the traditional HMM training method, called Maximum Likelihood (ML) training, GPD discriminative training is designed to minimize the recognition error rate by emphasizing the features that differentiate competing candidates (i.e., HMMs). In contrast, ML training seeks to maximize the likelihood for all competing candidates without explicitly trying to minimize the recognition error rate. Since recognition error rate is the most important parameter in speech recognition performance, discriminative training offers higher ASR performance with no increase in run-time recognition computational complexity. Run-time computational complexity depends to a large extent on the size of the HMMs. GPD training does not increase the size of the HMMs. Rather, it is a training procedure that estimates the values of the HMM parameters in such a way that the recognition error rate is minimized. It is true, however, that the computational complexity of the training procedure for GPD is larger than the ML case. But this does affect run-time complexity since the training process is performed offline.
The LASR acoustic models (HMMs) are trained to give very robust and high ASR performance for various accents and calling environments. Specifically, the acoustic models for North American English, for example, have been trained with an extensive database covering all dialectical regions in North America, as well as various calling environments in telecom networks, including wireless and hands free. The application does not need to tell the engine whether the incoming call is from a wireless phone or not. The acoustic models are specifically trained to give high ASR performance regardless of the type of call it is handling. Table 1 shows the ASR performance on telephone numbers spoken by users in various wireless and landline environments.
Standard Microphone Environment Accuracy
LandLine
AMPS
TDMA
CDMA
CDMA Handset
Handset
Handset
Handset
Full Duplex Hands Free Various
Moving car local/hwy
Various
Moving car at 55 mph hwy
Moving car at 55 mph hwy 95%
90%
95%
96%
84%
Table 1. Continuous digit ASR performance in various telecom environments.
It should be noted that the databases corresponding to the rows in the Table 1 have been collected during various trials and data collection efforts. This table indicates that the LASR engine results in very high accuracy, regardless of the whether the call is over a wireless or landline network. Additionally, it is robust to the various wireless standards. Considering the extreme noisy conditions in a hands free call from a car moving at 55 mph on a highway, the engine’s performance of 84% is outstanding.
The second stage in the LASR engine, utterance verification (UV), employs a discriminative utterance verification method that was also invented and patented by Bell Labs. This method uses a second set of HMMs that are trained using Minimum Verification Error (MVE) training. MVE training is a discriminative training method whose goal is to minimize the utterance verification error rate. Typically, utterance verification performance is measured along two dimensions: a) the false alarm rate (i.e., the acceptance of out-of grammar utterances), and b) the false rejection rate of in-grammar utterances. There is always a tradeoff between false alarms and false rejections. Since the goal of MVE is to minimize the UV error rate, it has been shown that the MVE-trained method employed in LASR results in a significantly lower false alarms at any given false rejection rate when compared to other UV methods used by other engines.
Lucent adds a third dimension in measuring LASR’s UV performance; the post-rejection substitution error rate. The post rejection substitution error rate is the in-vocabulary substitution error rate on the speech that was not rejected by the utterance verifier. Although, ideally, we would like a 100% in-vocabulary recognition accuracy, substitution errors do happen in practice. In these cases, it is far more attractive for a telecom application if the engine can detect and reject in-vocabulary substitution errors rather than pass such errors to the application causing confusion to the caller. LASR’s UV method is not only trained to minimize both false alarms and false rejection, but also to detect and convert substitution errors into rejections.
Engine Development and Testing
Lucent uses extensive speech databases from various trials and data collection efforts in its continuing development and enhancements of the LASR engine. These databases are divided into two sets: a training set and a testing set. The training set is used to train the ASR acoustic and utterance verification models. The testing set is used to independently test the ASR performance. Testing ASR performance is performed along 3 dimensions:
1. Accuracy (for both in-vocabulary accuracy and out-of-vocabulary rejection rate)
2. Per channel memory that the ASR engine requires executing on the cPCI Speech Processing Board.
3. Per channel memory that the ASR engine requires executing on the cPCI Speech Processing Board.
Our continuing goal is to jointly optimize with respect to all three dimensions. The LASR engine is now industry leading in both channel density and accuracy. The high channel density of the Lucent Speech Server is a manifestation of this. In testing the LASR engine, we use various tasks in a number of different environments in order to get a complete picture of its performance in different applications. Performance is measured along the above 3 dimensions.
Barge-in
Barge-in is a feature that allows the caller to speak his/her request while the announcement prompt is still playing. This feature is similar to the cut-through feature in a touch-tone service. To support barge-in, echo cancellation is required in order to cancel any prompt echo reaching the recognizer while it is listening for caller input. The LASR engine supports the barge-in feature. The cPCI Speech Processing Board has, as part of its hardware architecture (see Appendix A below), a VLSI chip that is capable of performing 64 channel / 64 msec. delay per channel echo cancellation. This VLSI echo cancellation chip is designed and manufactured by Lucent and is also used for echo cancellation in the AT&T long distance network.
The Lucent ASR engine supports barge-in in 3 different modes:
1) Energy based barge-in. This type of barge-in relies on the detection of energy above a certain level to declare a barge-in. Although energy-based barge-in has been employed by most speech engine vendors, it is susceptible to extraneous and non-speech noises since it relies purely on the detection of energy.
2) End of recognition barge-in. Here, barge-in is reported once a valid word/phrase/sentence is recognized by the recognizer. This type of barge-in is useful if the recognition task consists of short utterances. Since barge-in is reported at the end of a valid utterance, this type of barge-in is not suitable for tasks where the speaker is expected to say long phrases and sentences. The speaker is likely to get confused if he/she is trying to say a long phrase while the prompt is still playing.
3) Recognition-based barge-in. Compared to conventional, energy based barge-in, recognition based barge-in is much less susceptible to noise and extraneous sounds. If the recognition task consists of recognizing long sentences and/or digit strings (e.g., telephone numbers), this recognition-based barge-in is very effective since it does not require that the utterance be completely recognized before generating a barge-in signal. Rather, this method detects the start of a valid (i.e., in-grammar) sentence or digit string and generates a barge-in signal within a very short time after the speaker starts speaking. This typically occurs significantly before the end of a valid utterance. This barge-in algorithm monitors the evolution of the recognition-decoding network and continuously tests if there is a high likelihood that the speaker started to speak a valid utterance. If there are extraneous noises, or if the speaker coughs or produces other spurious sounds, this barge-in algorithm does not typically generate a barge-in signal. Rather, the prompt continues to play until the speaker starts speaking a valid utterance. Employing such a recognition based barge-in is critical in developing applications that result in high user satisfaction.
Early Decision
Early decision is a feature that enables the engine to report the recognition result very quickly after the speaker has finished speaking. Most ASR engine vendors employ an energy based endpoint detector to detect the endpoint of input speech. Typically, in order to ensure that the speaker has finished speaking, energy based endpoint detection requires some period of silence to elapse at the end of the utterance before declaring an endpoint. This adds considerable delay to the interaction that the caller experiences. In our early decision approach, the engine monitors the recognition process on a frame-by-frame basis. If a grammar terminal node is reached, and all the paths leading to the other nodes have been pruned away (i.e., recognition result has been narrowed to only one answer), this implies that the speaker has finished speaking. Therefore, when this happens, the engine reports its recognition result immediately, regardless of endpoint detection. In this way, the engine can report the recognition answer as quickly as possible, adding to the efficiency and user friendliness of the system.
Natural Language
The LASR engine supports large vocabulary natural language ASR. Both finite state grammars and statistical language models (N-grams) are supported. For the finite state grammar, the application developer can easily construct a grammar using the Java Speech Grammar Format (JSGF) . Semantic tags are supported for making the task of developing applications easier. Dynamic (or “drop-in”) grammars are also supported.
While finite state grammars require the application developer to predict all possible responses that a speaker may say, statistical language models relax this constraint. Through proper post processing, N-grams allow the engine and application to correctly recognize a wide range of responses, even those that were not predicted at application development time. For example, the caller can be asked a general question like “Welcome to XYZ Corporation service center, how may I help you?” The application using the LASR engine with N-grams is capable of correctly handling a wide range of responses to such an open question.
Speaker Trained ASR
The LASR engine relies entirely on speaker-independent, subword (phoneme) based speech recognition models for both speaker-dependent and speaker-independent ASR. In the training mode of a speaker dependent ASR (e.g., name dialing application), the LASR engine activates a subword recognition task with a free grammar where any phoneme can follow any other phoneme. The subscriber will then be instructed to the say the name that he/she wants to train. The engine determines the most likely phoneme string that matches what the subscriber spoke. This string becomes the voiceprint for that name for that particular subscriber. The voiceprint includes the speaker dependent information since it basically describes how the subscriber pronounced the particular name. The names in the name list for each subscriber represent a grammar that the engine can interpret and use to perform speech recognition. The speaker-dependent grammars are dynamically loaded (or “dropped in”) to the grammar structure defined in the call flow. Since the voiceprint for a given name is basically a text string describing the pronunciation of the name by the subscriber, the storage requirement for the subscriber name list including the voiceprint is very small. For example, for a speaker dependent list of 100 names the storage requirement is about 16 KBytes. This 16 KByte figure assumes that the subscriber is asked to train each name twice and two voiceprints are stored per name. In the case of speaker independent names, only one transcription is needed. Therefore, the storage requirement for a 100 speaker independent names is about 8 KBytes. In the training mode, the engine is also capable of detecting collisions of the new name that is being trained with names that are already on the subscriber’s name list. This feature allows the application to indicate to the subscriber that the name the subscriber had picked sounds too close to a name already on his/her list and suggest ways to remedy the situation.
Extensive Engine Control
The application developer has a high level of control over the engine though the API. For example, rejection (confidence) thresholds can be specified through the API. In addition, there is a set of 7 timers that can be set, allowing for fine tuning of the interface presented to the caller. The engine returns to the application the following:
1. The text strings of the top N recognized candidates (N can be specified through the API)
2. The confidence score for each of the top N candidates
3. The corresponding semantic tags (if specified) for each of the top N candidates
The confidence scores reported are the scores computed by the utterance verification process. They give a very reliable indication of whether a given candidate is correctly recognized or not.
International Languages
The Lucent ASR engine supports speech recognition in a number of languages, including North American English, North American Spanish, European Spanish, Canadian French, German, and Italian. There are plans to support additional languages.
Run Time Control
The Lucent cPCI Speech Processing Board hosts a number of speech and signal processing modules in addition to the ASR engine. For example, Text-to-Speech Synthesis (TTS), play, record, and DTMF can run simultaneously with ASR on the same board. An efficient, multi-processing Run Time Environment (RTE) executes on the board, acting as a real time operating system. An extensive real-time control mechanism has been developed and implemented as part of the RTE. This on-board real-time control allows various functions (e.g., playing prompts, DTMF detection, and ASR) to interact to trigger or stop one another. This real-time control facility enables very rapid response to input events because the run-time messaging can be contained to a single board. Take for example ASR with barge-in. Without real-time control, the ASR engine would need to send a message to the application indicating the detection of a barge-in event. The application then sends a message to the play function to stop playing the announcement. This can add significant delays between the time barge-in is detected and the time the announcement stops, resulting in possible caller confusion and dissatisfaction. With real-time control, the application can instruct the board, from the start, to stop play if a barge-in is detected by the ASR. The messaging is all done within the board. This close coupling of various functions allows for very robust and user-friendly telecom applications.
Conclusion
The LASR engine delivers industry leading performance in the telecom enviroment. Whether the call is a mobile call, a landline call, or a hands free call, the LASR engine delivers unparalled accuracy at the highest channel density in the industry. Various features that make the interaction between the caller and the ASR more user-friendly have been developed and incorporated into the LASR engine. As an integral part of the Lucent Speech Server, the LASR engine executing on the cPCI Speech Processing Board, allows the LSS to be a high channel density and scalable solution for service providers.
APPENDIX A
The Lucent Compact PCI Speech Processing Board
Figure A1. A functional diagram of the Lucent Compact PCI Speech Processing Board.
The Lucent cPCI Speech Processing Board uses advanced hardware technology that offers unparalleled reliability and flexibility. They feature three PowerPC 750 processors--the computing horsepower of three workstations on one board. They offer high density support of many simultaneous channels of speech, while consuming half as much power as the leading competitor’s board. Echo cancellation is integrated within each board to support the Bell Lab’s patented “Barge-in”.
The variety of speech and signal processing capabilities delivered by the Lucent Compact PCI Speech Processing Board includes
Automatic Speech Recogntion (ASR),
Text-to-Speech Synthesis (TTS),
Speaker Verification (SV) ,
Speech Compressions/Coding,
Speech Play,
Conferencing,
DTMF Detection,
Tone Generation,
Call Progress Tone Detection,
Telecommunications Devices for the Deaf (TDD) Tone Detection,
Speech Recording, and
Echo Cancellation.
The ASR and TTS capabilities support a number of different languages. Advanced ASR features include barge-in to interrupt prompts, rejection of erroneous sounds, N-best recognition, and N-gram statistical language modeling.
One board could do both TTS and ASR. In general, supporting functions can also be combined with ASR and/or TTS. For example, each ASR task can be supplied with echo cancellation, DTMF detection, and the playing of TTS prompts embedded in mu-law prompts, all on the same board.
In addition to the algorithmic engines for the speech and signal processing functions listed above, the software suite that accompanies the Lucent cPCI Speech Processing Board includes an efficient, on-board Run Time Environment (RTE) and a Software Development Kit (SDK) featuring powerful Application Programming Interfaces (APIs). These tools feature utilities to configure on-board run-time control so that various functions (e.g., playing prompts and DTMF detection) can interact to trigger or stop one another. This facility enables very rapid response to input events, because the run-time messaging can be contained to a single board.
http://www.lucent.com/livelink/153151_Whitepaper.doc
Join the InvestorsHub Community
Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.