InvestorsHub Logo
Followers 60
Posts 1211
Boards Moderated 1
Alias Born 03/27/2001

Re: None

Monday, 06/11/2001 1:29:51 PM

Monday, June 11, 2001 1:29:51 PM

Post# of 93820
The Power of Speech - part 1 ( http://www.conversay.com/LitLibrary/Embedded/Conversay_Embedded_White_Paper.pdf )
The Conversay Advanced Symbolic Speech Interface (CASSI) is Conversay’s unique speech engine for embedded devices. Based on over 5 years of research and development, this software solution provides today’s handheld devices and other Internet appliances with unique capabilities and interface choices.

Handheld devices such as cellular phones and PDAs have similar characteristics, such as small displays and limited user input due to size constraints. Some Internet appliances, such as settop boxes (STBs), may allow input via a keyboard or mouse, but only as an absolute last resort. These limitations require manufacturers to invent new methods of providing efficient and friendly user interaction with devices. With Conversay technology in Internet appliances like wireless devices, users can access information using the most natural interface there is—the human voice. Conversay’s technology provides continuous, speaker-independent speech recognition in a small kernel. No user training is required to enjoy the benefits of speech applications. With a virtually unlimited vocabulary, the CASSI engine gives developers the ability to completely customize the user interface and experience. Everything from simple voice dialing to more sophisticated voice-interactive Web “conversations” are available to the system designer. This white paper discusses the applications, system requirements, and capabilities of the CASSI speech engine, and also describes basic implementation details for the developer of embedded client solutions.

INTRODUCTION
CASSI is Conversay’s speech recognizer and text synthesis engine. The engine can be used for a variety of embedded (or client) systems. CASSI can run on single or dual-processor hardware designs. Conversay application and solution developers write application code that uses the CASSI API to integrate speech recognition and Text-To-Speech (TTS) capability into embedded products and into applications that run on those products.

CASSI provides continuous, speaker-independent speech recognition. TTS capabilities are included in the core engine, allowing the TTS component to remain independent, yet able to share a number of common resources with speech recognition. This independence allows the system developer to implement both speech recognition and TTS with lower overall system requirements than would be otherwise possible.

The CASSI engine consists of a small, ANSI-C kernel that uniquely positions CASSI as a highly portable code library, suitable for a variety of devices, processors, and OSes. The TTS component is optional and may be included or excluded from the system design depending on the application designer's requirements.

CLIENT APPLICATIONS
Speech recognition system
CASSI provides functions that:
•Enable and disable word lists
•Initialize the engine
•Retrieve recognition results
•Set various recognition and TTS parameters

As examples, this functionality could comprise a voice operated digit dialer, a voice information kiosk, a voice activated Internet appliance, or an embedded Web browser.

A typical client application is responsible for managing the audio sub-system and submitting audio packets to the CASSI speech engine. After recognition is performed, the recognized speech is placed in a CASSI system queue that the client application can retrieve. On the other hand, text may be placed into the CASSI system queue for processing and the client application retrieves audio buffer data, and then sends it to the device's audio sub-system.

Text-To-Speech (TTS)
CASSI contains two modules for performing TTS: the text-to-phonetics unit and the TTS synthesis module.

The text-to-phonetics unit accepts arbitrary written text as input, and outputs a string of phonemes for CASSI to synthesize. The text-to-phonetics unit performs text-processing including: text normalization, homograph disambiguation, spelling-to-pronunciation (STP), parsing, phrasing, morphological analysis, and dictionary lookup.

The TTS synthesis module converts the phoneme string to an audio buffer, using frequency domain technology. This continuous stream audio, output as 8Khz-16 bit linear samples, is then directly to the device’s codec. A dictionary and STP file are required for TTS operation; the recognition engine shares these two recognition elements.

Application development
Application developers will typically perform several steps in the process of incorporating speech technology into their devices. The steps below illustrate the typical process.

1. Definition of capabilities
What features and benefits does the developer wish to provide to their customers? Will the system features include text-to-speech, speech-activated control, or some combination of these features?

2. Analysis of hardware resources
Do the existing hardware resources provide the capabilities necessary for speech applications? If not, what additional memory or processor requirements are necessary? What does the audio I/O look like on this device? Will the microphone be built in? Will there also be a wired microphone attachment? Will it be noise canceling?

3. User interface design
The application developer prototypes the designs of the speech interface using existing simulation tools. Usability testing and refinement continues until the final speech interface is defined.

4. Development
Actual code development and integration work commences. Conversay provides a number of tools and technologies to assist at each step of the application development process. Conversay provides PC simulation tools to enable rapid prototyping and user interface development in a Windows-based desktop environment. Additional development tools are also provided that allows developers to create additional dictionaries and vocabulary definitions for custom speech recognition solutions.

HARDWARE ENVIRONMENT
Due to its modular nature, CASSI is suitable for a variety of systems. CASSI may be used with single processor designs where one processor handles all component execution. Alternately, the recognition feature extraction and TTS synthesis may be separated onto a DSP (or other signal processor) with the back-end recognition search function and TTS text processing executed on a separate, general-purpose processor.

Front-End block
The front-end block is used for recognition and TTS functions. The codec supplies voice data to the DSP for processing. The front-end processor performs a first-pass analysis and supplies the back-end search module with extracted features of the audio data. In a reverse manner, the back-end search module provides the front-end processor with TTS output as a phoneme string. The front-end processor uses this phoneme string as input and performs TTS synthesis to provide the codec with the audio stream for playback. The DSP sub-system must be designed with sufficiently fast ROM and RAM access times to ensure real-time performance. In a single processor implementation, the front-end block is incorporated into the same processor as the back-end processor block.



Always tell the truth. Then you'll never have to remember what you said the last time.

Join the InvestorsHub Community

Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.