InvestorsHub Logo
Followers 19
Posts 4455
Boards Moderated 0
Alias Born 03/27/2001

Re: cksla post# 15283

Monday, 09/09/2002 2:26:53 PM

Monday, September 09, 2002 2:26:53 PM

Post# of 93819
Voice recognition is a must-have item on next-generation portables

Portable Design, July 2001

By Richard Nass

There are different ways to implement voice recognition, either in hardware, in software, or with a combination. Choosing the right method depends on the application.

Voice recognition offers the ideal input solution for a small form-factor device--if it works properly. Unfortunately, that's not always the case. For lots of reasons, voice recognition hasn't panned out to be an end-all for input. In some cases, it adds to the cost of the system. In other cases, it changes the form factor to be something that's not as user-friendly as it could be. And in other cases, it just flat out doesn't work.

There are different types of voice-recognition solutions available. Some are software only (running on the host CPU), while others contain their own specialized hardware. For the software-only versions, some recent developments make those more attractive, for two reasons. One is that the CPUs in general have more horsepower to handle the application software, and second, some of the microprocessors are putting in special hooks to handle the recognition.

One example of the processor that had voice recognition in mind during the design process comes from Analog Devices. The company's Frio DSP was co-developed with Intel.

"Historically, we've seen 16-bit CPUs used in speech-recognition systems," says Ken Weurin, a DSP product manager at Analog Devices. "Our offering is better for a number of reasons, including a significant increase in performance, and the hardware hooks for OS support."

The Frio DSP core could be used in either speaker-dependent or -independent voice-recognition systems, with continuous- or isolated-word engines. This provides the maximum flexibility for designers (Fig. 1).

"On the performance side, there's twice the amount of computational resources on the Frio core as was included in our previous architecture," continues Weurin. "So we have the ability to more efficiently compute fast FIR FFT convolutional calculations, which are at the heart of speech-recognition algorithms."

Having a high-end CPU allows designers to move to more phonetic or phoneme syllable-based recognition models. This should provide a boost to the accuracy of the recognition.

Equally important on the hardware side is the ability to remove the microcontroller that resides alongside the DSP in most systems. Often, an 8-bit microcontroller is used to handle some of the general housekeeping and I/O functions for which the DSP isn't well suited. Higher end DSPs, like the Frio or the 55X family from Texas Instruments, have enough computational power to eliminate that component. Some features that make this possible include memory protection, MMUs, and support for user and supervisor modes.

The benefit of using a single processor (and just one programming model) is that the development doesn't require two sets of development tools. It also doesn't require the designer to have the knowledge of two different instruction sets.

IBM, one of the leaders in voice-recognition technology, developed a product that runs on a host processor. The software-only solution, called ViaVoice, requires just 5 MIPS from the processor, although if more computer performance is available, it can take advantage of that as well.

"One of our big markets is the telematics (automobile) area," says Ken Houy, a marketing manager for client systems at IBM. "From a voice perspective, it's a hot market because of government regulations and for ease of use. Voice will be a key interface for getting to any of the devices that are running in your car, whether it's a phone; the automobile monitoring and calling back to a service vendor; or being able to interface with your PDA sitting in your briefcase, maybe through a Bluetooth connection."

The folks at IBM claim that their software can be ported to any available mainstream microprocessor or operating system. This lets them "voice-enable" just about any type of portable system, which includes a long list of future Internet-enabled products, such as smart phones and cell phones with browsing capabilities.

One of the features of the ViaVoice solution is that it offers distributed technology, meaning that the processing requirements can be split between the client (portable) device and the server end. For example, some of the recognition can occur directly in the phone, like address or number look-up, or simple dial functions. At the server end, more sophisticated features can be implemented, like dictation or database features.

Another key feature of distributed technology is that if a phone connection is lost, the recognition can continue to occur within the client. When the connection is reestablished, the process can continue almost seamlessly. If it were a server-only solution, the user would have to start the process over from the beginning.

In the telematics area, Motorola is expected to release its iRadio Internet radio solution by the end of the year. The company claims that this is a complete system, including such features as phone, Internet access, directory dialer, and address book (Fig. 2). It handles the voice recognition using IBM's ViaVoice product, which adds the ability to send and receive e-mail by having it read to the user. Sony and JVC will follow shortly with similar products.



Accuracy counts

Accuracy has always been the sticking point for voice recognition, at least from the user's perspective. If the device can't accurately understand the message the user is trying to convey, the application becomes useless. In most cases, the portable system won't be used for dictation, simply because the processing power isn't available. The voice-recognition features offered on a portable device are more likely to be along the lines of command recognition, where a finite list of commands are used. Recognizing on the order of 20 words isn't difficult for the system to handle.

Used in systems where performance is limited is a tree model, where trigger words access other vocabularies. For example, a phone can offer 10 finite commands, things like dial, look-up, hang up, etc. If the look-up command is entered, this would trigger a secondary vocabulary that contains all the numbers stored in the directory to be accessed. Or if the command was "manual dial," the ten digits on the keypad become the active words. This process allows the use of relatively large vocabularies, but with a minimal use of processor power.

There are some vendors that offer hardware solutions as well, such as Sensory. The company can embed a low-power processor into a portable device that removes the recognition burden from the host (Fig. 3). If the processing power is available, Sensory can bundle a software-only platform.

"Our software-only solution subscribes to the theory that as MIPS and memory get cheaper, software-only makes more sense in embedded systems," says Todd Mozer, president and CEO of Sensory.

This is particularly true when you can maintain a small footprint for the software.



Hardware vs. software

When deciding how to partition between what's handled in hardware and what's done in software, know that it's very application dependent. For example, today's cell phones contain relatively powerful DSPs, as well as a microcontroller, a codec, and a relatively large amount of memory. This application is one that makes sense for a software-only solution, for two reasons--adding extra silicon increases both cost and size.

Using dedicated hardware could probably reduce the overall power consumption in the system, because it eliminates having to crank up the powerful DSP every time a word needs to be recognized. But the current crop of DSPs does a fairly good job of employing only the cycles that are needed. And the added cost versus the incremental savings in battery life probably wouldn't merit going with the hardware solution.

"We're excited about some of the new processors that are coming out, from Analog Devices, Intel, and TI, with their OMAP (Open Multimedia Applications Platform) architecture. All the major players are getting to lower power levels and giving us plenty of MIPS to work with," says Mozer.

The current generation of database products works in a speaker-dependent environment. This means that the user would repeat a word, such as a name to be entered into an address book, once or twice. This scenario works well in a small database, say with up to 30 entries.

With large systems, into the hundreds or thousands of listings, you wouldn't want to have to repeat each entry. In those situations, a phonemic-based recognizer is used, where individual sounds are recognized, then put together to form words. That's obviously a much more compute-intensive application, usually reserved for a desktop- or server-based architecture. Eventually, such an architecture will find its way into the portable domain.



Low-power controller

On the hardware side, Sensory offers a 2-MIPS processor that today resides in a voice-activated television remote control. For such a simple application, the designers were able to eliminate the microcontroller that had been present on previous-generation products, instead choosing to employ the Sensory part to handle the RF programming and other functionality in the remote.

"In general, our strategy is that we don't want to sell DSPs because we think there's a lot of good DSPs already on the market," offers Mozer. "So we partner with those vendors. When we do provide hardware, it contains some special-purpose features. For example, our current generation has a small digital filter that does the feature extraction for our neural-network algorithms."

The company's next-generation part will add special-purpose hardware to perform single-cycle multiply-accumulates.

As for which CPU is the most appropriate to run the voice-recognition algorithms, that depends on the intended application. In some cases, a DSP makes the most sense, where some signal processing may need to be performed at the front end of the speech-recognition algorithm. While in others, such as where some searching routines need to be performed, a RISC-based processor, such as an ARM device, makes more sense.

"One of the keys to reducing power on the portable system is to limit the bus activity," says Jordan Chen, the chief technical officer at Voice Signal Technologies. "DSPs tend to be very power efficient when crunching, particularly if the data fits into the DSP's on-chip memory. But the larger algorithms require you to bring the data in and out of the chip, using more power."

In a platform that contains both a DSP and a RISC processor, it's important to ensure that the signal processing can run independently on the DSP, so there's not a lot of bus activity consuming power.

Note that the most power-hungry application on a cell phone is the radio. So any speech recognition that can occur independently of the radio will substantially reduce power. That's why the partitioning discussed earlier becomes very important.

It's important that system developers receive the tools needed to build an intelligent user interface (UI) from the speech-recognition vendor. In most cases, it's the system vendor that provides that UI.

But if the speech technology is packaged in such a way that it provides little or no flexibility, it reduces the amount of creativity that can go into the UI. Hence, the speech-recognition engine must provide and make accessible to the application developer all the available information.

Another vendor of software-only solutions is Advanced Recognition Technologies (ART), who recently unveiled its smARTspeak NG product, voice-recognition software that combines dialing and control functions for speaker-independent or -dependent systems in cellular handsets. The software can run on an ARM 7 CPU. Features include name dialing, continuous digit dialing, menu navigation, and device control.

Join the InvestorsHub Community

Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.