Reply Private New

Next 10 Prev Next

Send PM Follow Ignore

Followers	2
Posts	518
Boards Moderated	0
Alias Born	03/28/2001

culater33

Re: None

Monday, 05/20/2002 8:00:35 PM

Monday, May 20, 2002 8:00:35 PM

OT-2001: A Speech Odyssey

June 9, 2001
By: Alfred Poor

"Open the pod bay doors, Hal."
"I'm sorry, Dave, I'm afraid I can't do that."

That simple exchange in Stanley Kubrick's masterpiece movie, 2001: A Space Odyssey, based on the novel by Arthur Clarke, sparked the imaginations of an entire generation a dozen years before personal computers arrived on the scene. Like Dick Tracy's two-way wrist radio set our expectations for communications, the computer Hal in 2001 made it seem possible that we would talk to machines.

Now that we're actually in the year 2001, those expectations are tantalizingly close to realization. Some of the latest cell phones are indeed small enough to wear on your wrist, and while computers don't understand the spoken word perfectly, enormous gains have been made in recent years.

This overview of PC speech recognition gives an introduction to how computers are able to transform spoken words into text and commands. We'll look into some of speech recognition's limitations and how they are being addressed. We'll also look into applications for speech recognition, and why so much effort is expended to improve the technology in spite of its limited success to date.

The Fundamentals of Speech Recognition
For most people, understanding spoken words is an easy task. The human brain's ability to identify and match vocal patterns is astounding; we can recognize speech even when spoken with a heavy or unfamiliar accent. As a result, it is easy for us to take this skill for granted.

If you try to teach a machine to understand speech, however, the task becomes dauntingly difficult. The spoken sounds must be broken down into data that the computer can process and use to identify the words as you say them (See our related story, "Knock-Knock, Who's There?", for a quick description of how voice recognition is used for security and authentication purposes).

Most speech recognition systems use similar strategies to turn sounds into text. The process can be broken down into five discrete steps:

• Speech input
• Prefiltering
• Feature extraction
• Comparison and matching
• Text output

Speech Input
The first and most important step in speech recognition is to record the spoken input. This task requires a microphone, and a device to convert the analog signal from the microphone into digital data. The role of the microphone is especially important, as it must provide sufficient frequency response to accurately capture the spoken sounds, and must also keep background noise to a minimum. (See the related article on microphone technology, "Testing, One, Two, Three").

The analog signal is converted to digital data, often by the computer's sound card or circuitry. The conversion requires that the levels be recorded--or sampled--at specific intervals. The sampling rate determines the number of samples per second to be recorded, and the bit depth determines how many different levels will be recorded, or their resolution.

Audio CDs--such as the ones you would play on a stereo system--are recorded at a sampling rate of 44.1 KHz, or 44,100 data points every second. This is necessary to provide audio fidelity for high frequency sounds up to about 22 KHz. Human voice, however, can be well described using sounds between 100 and 8,000 Hz. (Note: This is why telephone systems have a relatively narrow frequency response range; visit this story on HowStuffWorks.com for an interesting demonstration of how phones cut off higher frequencies). As a result, speech recognition sounds can be sampled at a rate as low as just 8 KHz (8,000 samples per second, though 16 KHz may provide better results if sufficient processing power and storage is available.

Similarly, the dynamic range--the difference in volume between the quietest and loudest instances--for voice is much less than for high-quality music. Just 8 bits per sample--one byte--is sufficient in many instances for voice, though results may be improved by using 16-bits--two bytes--per sample, which is the same as the Audio CD bit depth.

The difference between different rates and bit depths can have a major impact on the amount of data that the computer must process. One second of sound digitized at 8 KHz and 8 bits per sample creates just 8,000 bytes of data. That same second of sound digitized at 16 KHz and 16 bits creates four times as much data: 32,000 bytes. And the 44 KHz and 16 bits standard for Audio CDs requires 88,000 bytes for just one second of sound.

Fundamentals, Continued

Prefiltering
Once the sound has been digitized and stored, the next step is to filter it. The data can be analyzed, and background noise levels can be identified and reduced, if not eliminated entirely. This can be achieved by a variety of methods, including Linear Predictive Coding (LPC) and spectral analysis. The result is cleaner sound data that can be processed more reliably.
Feature Extraction
Once the sound data has been prepared for analysis, it must be processed so that it can be compared with the stored samples. Typically, the sound data is divided into overlapping frames, each about 5 to 32 ms long.

click on image for full view
The data is then analyzed for its component frequencies. The sound of human speech is composed of many frequencies of different volumes, all occurring at once. The individual data points must be analyzed to determine what sound waves combined to create those points. A common technique used for this process is a Fast Fourier Transformation (FFT). The result is a map of the frame containing the frequency and amplitude--the volume--of its component sounds.

Comparison and Matching
The data is now ready for comparison with the stored sound samples, so that the words can be identified. This objective is far easier to state than it is to achieve.

For most speech recognition systems, the reference sounds are broken down into pieces called "phonemes." A phoneme is the smallest part of a spoken word; change one phoneme for another, and you get a different word. The frames created in the previous feature extraction are matched against the stored database of phonemes to determine which phonemes were recorded.

Now comes the hard part. The speech recognition program has to take the sequences of phonemes, and match them up against the stored words in the program's vocabulary, trying to determine what words were spoken. Most programs rely on a statistical process that tries to predict what will most likely be the next phoneme, based on the words that it has determined might best fit the phonemes that it has already matched.

The program creates a probability tree for the different possible matches it predicts. As the program becomes more and more certain that it has correctly identified a word, the predictions about subsequent words can change. If the program can find a match as predicted for the next sequence of sounds, it has more confidence in that branch. If the predictions prove wrong, however, it must go back up the logic tree of choices that it made, and try a new chain of solutions. This is why some speech programs hesitate for a noticeable time, and then dump a whole phrase of text on the screen at one time.

Different models are used for this process, but one of the most common is called the hidden Markov Model (HMM). It can use large libraries of words; some programs have vocabularies with 70,000 or more words. It also uses libraries of grammar rules that help use contextual clues to decide what word is most likely to come next.

All of this activity is processor intensive. It also requires a great deal of system resources, including physical memory. With all the comparisons that must be made in order to keep track of multiple predictions, the computation would be too slow if the reference data had to be retrieved from a hard disk every time. Sufficient physical memory is a key factor in speech recognition performance. Also, some speech programs also take advantage of SIMD (Single Instruction Multiple Data) instruction-set extensions such as MMX, SSE, and SSE2 to aid in accelerating speech processing, in addition to the streaming memory enhancements of those instruction sets. Further, the use of faster DRDRAM and DDR memories with their higher bandwidth will improve speech algorithm performance.

Text Output
Once the spoken text has been identified, the program is ready to pass it along for the next step. This could be to display the recognized text on a screen, or to trigger a command to another program.

Speech Recognition Distinctions

Speech recognition program designers have to make a number of decisions that result in balancing trade-offs. Some choices may make the program's vocabulary more limited but more accurate. Others choices may result in faster processing but lower accuracy, or in larger storage and memory requirements and larger vocabulary.
The choices designers make have direct effect on the suitability of a program for a given application. Here are some of the factors to consider when evaluating a speech recognition package.

Command vs. Dictation
One of the most fundamental distinctions in speech recognition is whether the spoken words are to be interpreted as commands--also known as "context-sensitive speech"--or as dictated text--also known as "context-insensitive speech."

Commands are easier to implement and typically require fewer resources. This is because the number of available choices is limited.

For example, consider giving commands within Microsoft Word for Windows 2000. If you limit the choices to the menu options, there are just nine words that the program needs to recognize: File, Edit, View, Insert, Format, Tools, Table, Window, and Help. Once the word "Edit" has been spoken and recognized, then there are just 14 possible choices for the next word. With such a limited vocabulary, it's much easier to identify the next word.

In contrast, dictation can go anywhere, and often does. For example, the program may recognize the phrase "It was a…" This gives little indication of what comes next. Even when the phrase is expanded to "It was a dark and stormy…" there still are many different words that could come next in addition to the obvious "night"--"evening," "afternoon," "marriage," "novel," etc.

Clearly, managing all the possibilities of dictation is a large task. One way to contain the problem is to limit the vocabulary. This works particularly well in dictation systems for professions that use a specific and relatively limited vocabulary, such as in some medical or legal fields.

Continuous vs. Discrete Speech
Another important difference is how the speech is interpreted. Most early speech recognition programs for personal computers required discrete speech, which means that the words must be uttered separately, with a break between each one. This requires a stilted delivery that is not natural and difficult for some users to master.

In recent years, speech recognition programs have been able to work with continuous speech, in which the spoken words are run together just as we do during normal conversation. When the computer hears a rapidly spoken phrase (we'll string one together demonstrating what a computer might have to interpret) like "pleasendmeasample," it is difficult to separate the words, and requires a lot of processing power and storage to achieve even moderate accuracy.

Speaker Dependent vs. Independent
Another big distinction is between speaker dependent and speaker independent systems. A speaker dependent one relies on a database of phonemes from that specific user. This usually requires a long training period, where text is displayed on the screen that the user has to read aloud. Many programs offer short or long training options, but in general, more time spent training translates into higher accuracy results.

Some training routines recognize the text as it is spoken and ask the user to repeat unrecognized words or phrases. Ostensibly, this is to train the program to recognize the user's speech, but like the handwriting recognition used by PDAs, it may be that this process also helps train the user to some degree.

Some companies have tried to eliminate the training requirements by using a database of phoneme samples, created by people with a range of accents and vocal pitches. You may be asked to identify yourself as male or female, and possibly with a geographic location, and it then uses the closes matching samples to recognize your voice. Given the wide range of accents found in this country--there are probably some folks from Maine and Louisiana who might barely understand each other if they met--this approach clearly has limitations.

Many packages use a combination of an initial database and some limited training for the specific user's voice, and then continue to learn and refine the database of sample sounds as the system is used. As a result, many programs don't reach their maximum accuracy until after you've worked with them regularly for a week or more.

The Market Outlook

Speech recognition has a wide range of practical applications, both with personal computing devices and on larger platforms. There are a number of speaker-independent systems operating on large-scale server platforms that provide telephone access to data, but these are beyond the scope of this overview. (See the SpeechWorks site for some Flash demos of their products).
Within the personal computing device markets, there is already a wide range of products.

For desktop computing, there are dictation and command packages available. Some are designed for specific professions, like legal or medical, while others are general purpose. Some programs are designed for a specific purpose; Conversay Web Browser is an add-on for Microsoft Internet Explorer 4.0 or later that allows you to speak to activate links on a Web page rather than use the mouse or keyboard ($19.95 with a free 45-day evaluation version available here).

There are also programs for specific professions, such as medical and legal applications mentioned earlier. Some programs are developed specifically for other professions, such as law enforcement.

Speech recognition also plays an important role in adaptive technology applications. Users with limited vision or motor control can speak to the computer rather than rely on a keyboard.

Mobile applications are starting to make use of speech recognition, though their limited processing power and storage capacity results in limited features at this point. Digital pocket recorders have been used to store speech for later conversion to text after being downloaded to a desktop or laptop computer.

Speech recognition on PDAs has been demonstrated, but remains in the not-too-distant future. One of the earliest automotive uses was hands-free dialing for cell phones--which can be an important safety feature while driving. Our experience with speaking the names of people to be called has been mixed, and background road noise can really interfere with the recognition process.

Today, you can use speech recognition fairly successfully with a limited vocabularly in certain GPS/street mapping systems for auto usage, like Travroute's CoPilot 2001. You might ask the system the discrete phrase "next turn", and it will generate both a graphics display on the computer, and a voice-synthesized response telling you the street name how much farther until the turn. This is very useful, particularly in high-speed or crowed traffic conditions.

Many other automotive applications will be possible down the road, so to speak... We expect background noise filtering to improve with microphones being more accurate and tuned to the environment (See "Testing, One, Two, Three)", algorithms will improve, continuous speech recognition will be standard, and certainly far more processing power will be delivered as embedded processors improve. A range of mobile e-commerce, mobile corporate business, and entertainment applications will be speech-enabled to permit hands-free operation.

Factors That Slow Adaption

PC-based speech recognition has been pretty good for a number of years, and the storage and processing power required to make it work has been available in affordable computer configurations. There is something appealing about being able to work with your computer, hands off. Just sit back and tell it what you want it to do, as if it were the electronic assistant that many of us wish our computer could be.
Aside from some niche markets in specific professions, however, speech recognition simply has not caught on. Lernout & Hauspie is one of the industry leaders, having bought a number of competitors including Kurzweil and Dragon Systems in recent years, yet it was forced to file for bankruptcy protection in 2000. Why isn't speech recognition much more successful?

There are many possible factors. First and foremost is accuracy. According to a report from the Center for Language and Speech Processing at Johns Hopkins University, speech recognition can be about 99.5% accurate in replacing touch-tone menu systems over telephones. That means that you can expect just one command in 200 to be interpreted incorrectly.

However, for dictation--even with training on a speaker-dependent system--accuracy is only estimated to be about 95%. That may seem to be a high rate, but if you figure that there are 200 to 300 words on a typical typed page, then this means that the program will get 10 to 15 words wrong on every page. Whether you make the corrections by voice or by keyboard, it can take a significant amount of time to correct these errors. And since the program will be using correctly spelled words when it makes these errors, you can't rely on a spell checker to catch the mistakes--you have to read over the entire dictation carefully.

Another factor is that most people are not accustomed to dictation. Even if the software is capable of screening out "disfluency" errors--the "ahhs" and "umms" that are often a part of conversational speech--it still takes considerable practice to be able to speak the text you want the first time, without backing up to make corrections and changes. If you've ever been deposed by an attorney in a legal case, and have read your depostion, you likely may have been surprised at how fragmented and broken up your speech patterns appear compared to how you would have written the responses.

You must also consider the microphone. Background noise can bring down accuracy, so you need a good quality mike that screens out extraneous noises, as we briefly discussed in the automotive application section above. Where will it be located? For best results, you should use a boom mike that either hangs on your ear or from a headset. This can be uncomfortable to wear all day long, and can interfere with other activities such as walking away from your computer or answering the telephone. (Andrea Electronics has a handy PC/Telephone interface that lets you use the same headset for your computer and your telephone. You can listen to music on your PC, dictate using the microphone, or hold telephone conversations, all with your hands free for the mouse and keyboard.)

All of this adds up to one fact: It takes a commitment--an investment of time and peripherals--in order to get a reliable speech recognition system that you can use with maximum accuracy. Not all users are willing to make this commitment just so they can speak to their computer.

The Changing Market
There are reasons to expect that speech recognition may become more popular in the coming years.

One key factor is Microsoft's commitment to the field. As part of its .NET strategy, the company is working to add rich speech processing services to many of its products. The ultimate goal is to provide a scalable and consistent user interface across the entire range of computing devices, from cell phones and handhelds to desktop computers and beyond. The idea is that users will be able to obtain information anywhere, anytime, from any device.

One part of this development effort is the Speech Application Programming Interface (SAPI) 5.0; a software development kit (SDK) is available for download for free from the Microsoft site. This is a programming interface that is designed to form a common connection between speech recognition engines and application programs, so that it is easier for developers to make their programs speech enabled. For example, a programmer could need as little as three lines of code to translate a sound source to text; Microsoft provides sample code in the SDK.

The SAPI can take the output from any compatible speech recognition engine; in addition to Microsoft's own engine, IBM and Lernout & Hauspie Speech Products have SAPI 5.0-compatible speech engines.

A more visible outcome of Microsoft's efforts is to be found in the new Microsoft Office XP. The new application suite will be voice-enabled, so that you can use its built-in speech recognition features for both command and dictation. The fact that speech recognition will be included at no extra cost may encourage more users to give it a try--with the result that this feature may become more widely used.

Future Developments

Assuming that there is sufficient demand for the products, technological advances promise to make speech recognition easier to use and more widely available.
For example, one of the key technology features of the original Star Trek television series was the fascinating Universal Translator that automatically translated any alien speech into English. We're not there yet, but ViA, Inc. is working on something close. The company is building a speech recognition system based on a wearable computer using the Transmeta Crusoe processor. The output will be run through a language translation program, and the results will be sent to a text-to-speech (TTS) program to produce a spoken translation.

The research for this device is funded by the Office of Naval Research and is designed primarily for military use. The device could also be helpful to other people who need to communicate in unfamiliar languages, such as police, health workers, and tourists. ViA intends to provide nearly-simultaneous translation for major European languages, as well as Korean, Serbian, Arabic, Thai, and Mandarin Chinese. The company has demonstrated prototypes and hopes to have a production model shipping by the end of 2001.

There are also other approaches that could change the way speech recognition is accomplished. Integrated Wave Technologies (IWT) has developed new systems based on technology created in the former Soviet Union. Faced with computing equipment of limited capabilities, the Soviet researchers developed highly efficient algorithms to perform sound analysis.

Instead of using phonemes, IWT has a system that analyzes the frequencies and volume characteristics of a voice sample, which it can then compare directly to templates of specific commands and phrases. The result of this approach is comparatively small programs requiring modest memory and storage resources that can act much faster than phoneme-based systems.

IWT has used this approach to create prototype belt-mounted voice command translation devices--under a National Institute of Justice Science and Technology grant--which have been tested in the field by the Oakland, California Police Department. (Details of the prototype program and the IWT technology can be found here).

Advanced speech recognition could also make other applications more useful and appealing. By adding speech recognition and TTS features to many programs, it will eventually be possible to accurately converse with your computer--something we've all been hearing about for years and waiting for patiently. Beyond the limited recognition of command sets required to control office applications and the like (which is particularly compelling for some handicapped individuals), the challenges of taking it one step further into handling contextually rich speech interactions go far beyond speech recognition hurdles. These interactions require natural language processing with contextual analysis, and sequencing of multple separate events.

For example, if you ask your computer to set up reservations for a business trip and book a room for you, it may fully understand the words accurately. But now it must take action and coordinate with many external and internal software systems and services: handling your personal preferences, delivering notifications for problematic or successful transactions, and so on. This chain of events often has dependencies requiring one stage to complete before the next starts. Actions that today are manually enabled with keyboard and mouse, and often sequenced manually, will have to be done automatically, triggered from rich spoken command sequences to your computer. The idea of web sites talking to other web sites in initiatives like .NET ties into the equation.

We expect eventually that all personal information management (PIM) programs will accept spoken instructions to add appointments to a calendar, or to autodial a phone number from the contact list, or to read your current To Do list aloud. PC games already have used voice recognition to a limited degree--but in the future, the player may be able to have spoken dialogs with the program's characters, adding to the sense of immersion in the game environment.

We are still a ways off until we will be able to converse freely with a computer like HAL from 2001, but we're getting closer. In settings where the vocabulary requirements can be limited to some degree, we now have the technology to make many of our computing applications voice enabled. It remains a question whether or not the users at large will find this to be a valuable addition or not.

Links

Microsoft's Speech Resources page: Numerous links to speech-related journals and companies.
Speech Technology Magazine

Testing, One, Two, Three

June 9, 2001
By: Alfred Poor

When you speak to your computer, how well does it hear you?
This may seem like a silly question, but it's entirely serious. You need to reproduce the sound of your voice with as much fidelity as possible, so that the computer has the most accurate data to work with as it attempts to interpret those sounds. And it stands to reason that some microphones can do the job better than others. Also, the way you use your microphone can make a big difference in how it performs.

Microphone Types
Over the years, a number of different microphone technologies have been created and refined. Some are based on something as simple as a layer of carbon particles--a design that was the mainstay of telephone handset microphones for many years.

Most microphones use one of two basic designs: moving coil--also known as dynamic--and condenser.

In a loudspeaker, an electromagnet is used to move a cone back and forth in response to changes in an electrical current. The vibrations of the cone create pressure waves in the air, which we perceive as sound. A moving coil microphone uses the same principle, only in reverse. Sound waves press on a diaphragm, which cause it to move back and forth. A coil of wire is attached to this diaphragm--called the voice coil--that surrounds a fixed magnet. As the coil moves back and forth with the diaphragm, it moves up and down around the magnet, and the magnetic field induces an electrical current in the coil wire. These currents can be amplified, and used as a signal to record the sound. signal for recording.

Moving coil microphone
Condenser microphones rely on capacitance, which is the difference in charge between two parallel plates. Generally, one plate is fixed, while the other moves in response to the changes in air pressure caused by sound waves. The movement causes the plates to be closer or farther apart, and as the distance changes, so does the capacitance of the device. These changes can also be amplified and used to create the sound signal for recording.

click on image for full view
Some condenser microphones require an electrical current in order to maintain the different charges on the two plates. In many cases, a small battery mounted in the microphone housing supplies this power. Other condenser microphones are designed to use "phantom power," which is drawn from another device.

Newer designs use electret materials. These are special plastics--such as Teflon--that can be permanently charged during manufacture. No external current is required, making it possible to create very small and lightweight microphones. These microphones may have a more limited lifecycle than other condenser or moving coil designs.

Noise Cancellation

Another important factor in sound recording quality is noise cancellation. In some circumstances, you may want a microphone to pick up sounds from all directions. In other cases--such as speech recognition computer--you may want to screen out sounds from all directions except the computer operator. The microphone's design affects its noise cancellation characteristics.

click on image for full view
A microphone can block unwanted noises in two basic ways: passive or active cancellation. Passive methods are the most common, because they can be implemented in the way that the microphone is physically constructed, and as a result, it is almost free in terms of construction costs and added weight.

The most commonly used format is the cardioid microphone. The name comes from the heart-shaped cross-section of the sensitivity pattern. The microphone is most sensitive to sound that occur directly in front of it, and then the sensitivity is sharply reduced as the sound source moves behind the front end of the microphone. Sounds from directly behind the microphone are almost totally blocked.

click on image for full view
This cancellation is achieved by creating two paths for the sound waves to reach the microphone's diaphragm, such as putting the diaphragm at the end of a tube. Sound waves originating from the open end of the tube must travel nearly the same distance to reach both sides of the diaphragm, and so are registered. Sounds originating from the other end, however, reach the open side of the diaphragm first, and then travel to the end of the tube before returning up the tube to hit the other side of the diaphragm. If this tube is tuned properly, the delayed sound waves cancel out the ones taking the shorter path. In practice, a number of delay paths may be used--along with acoustic foam--to block a wider range of frequencies.

click on image for full view
Cardioid microphones are the best for speech recognition because they are best at rejecting sounds that come from in front of the person speaking. This helps eliminate much of the ambient noise in many environments.

A three-dimensional view of the sensitivity pattern of a cardioid microphone
Active noise cancellation microphones are more complicated. They actually rely on two or more microphones. In a two-microphone configuration, one is used to pick up the speaker's voice, and the other is used to gather the ambient noise in the environment. These ambient noise signals are then subtracted from the speaker's signal. This can do an excellent job of pulling the user's voice out of a noisy background, but the design adds weight and cost compared with passive cancellation microphones.

One of the most promising implementations of active noise canceling is the development of array microphones. Most passive microphones require that the user position it very close to the mouth when speaking. Array microphones, such as those available from Andrea Electronics Corporation, allow the device to be placed two to four feet away. The signals from two to eight microphones are digitally processed. Not only does this allow background noise to be reduced, but it is also possible to do "beam steering" which can track a user who is moving within the reception area.

The Digital Advantage

Most microphones use an analog connection to your computer through a jack in the sound card. This approach can be adequate for many applications, but it can also be the cause of lost fidelity and sound quality that can create problems for speech recognition programs.

Some sound cards are built with minimal attention to the microphone channel's circuitry. The market is competitive, consumers are cost-sensitive, and there may be little advantage in building a better microphone circuit in a consumer market sound card. As a result, the quality of analog-to-digital conversion in the sound card may not be as good as it could be. Also, these cards are susceptible to electronic noise generated by emissions from the computer's motherboard, expansion cards, and other components.

One solution is to move the conversion circuitry out of the computer and provide a digital signal from the microphone. This is the concept behind USB microphones, such as those available from Philips and Plantronics.

These microphones have a box that contains the DSP (digital signal processing) circuitry to digitize the analog signal. No external power supply is required, as it can draw power from the USB cable. The digital signal is delivered to the computer through the USB connection, and the potential for interference is greatly reduced. The result is a cleaner signal that can improve speech recognition software performance.

You may recall that USB speakers were heavily promoted a few years ago, but turned out to have many problems--mostly related to driver problems, operating system issues, and USB technology itself. Glitches often happened during certain multitasking scenarios, system lockups would occur at boot, hot-plugging worked sporatically, speakers would sometimes stop working altogether, etc. Windows ME ironed out many of the problems, but consumers are still wary. However, USB microphones tend to be a more stable technology when run under the latest operating systems like Windows 98 SE, 2000, and ME, and our initial experiences have proven successful.

It is also possible to get digital performance from existing analog microphones. Companies such as Andrea Electronics make USB converters that let you use a standard microphone or headset as a USB device.

In order to get the best from your speech recognition software, you need to make sure that your computer hears you clearly. Choose an appropriate microphone and make sure that you have it positioned and adjusted correctly in order to get the best results.

Links
DPA Microphones: This site includes detailed information about microphone specifications and testing. The information is aimed at recording studio applications, but is useful for any use of microphones.

"Sound Bits and Bytes: An Introduction to Microphones" by John L. Butler: this is a short but excellent overview of microphone use in recording. It is not as comprehensive as some sites, but is filled with practical information.

"A Primer on Microphones" by Peter Elsea, UCSC Electronic Music Studios: An excellent overview of microphone technology and specifications, with good diagrams. The same site also includes other good papers on music, recording, and sound topics.

"Microphone Techniques for Music: Sound Reinforcement": a booklet in Adobe Acrobat format from the Shure Brothers, Inc. that provides detailed information about microphone technology, the science of sound, and recording tips.

Knock-Knock, Who's There?

June 9, 2001
By: Alfred Poor

People often use the terms "speech recognition" and "voice recognition" interchangeably, but "voice recognition" is an ambiguous term that has been used--and misused--so widely that it may be confusing. Speech recognition--or speech-to-text--takes spoken words and interprets them as text or commands. Voice recognition can refer to either the process of interpreting what is said, or it can refer to the identification of who is speaking.
The better terms for the identification type of voice recognition are "voice verification" and "speaker authentication"--which uses voice sounds for security purposes. This is possible because each individual has distinct vocal patterns called a "voice print," which is as unique as a fingerprint.

Voice verification is used to determine the identity of a user. This is used for personal computer security systems that substitute spoken passwords for typed ones. The user answers requests for a password by speaking his or her name or some other key phrase into the computer's microphone. The computer compares the speaker's voice to samples of the phrase recorded earlier during a training process. The computer has to confirm or deny that the spoken sample is a match for the stored data.

Speaker authentication or identification is more complicated--it attempts to identify the speaker. Unlike verification applications where the system only has to verify that the user is who he or she claims to be, authentication systems must search a database of stored voice patterns to find a match for an unknown speaker.

For more information on voice verification and speaker identification, see the Michigan State University Biometrics Research site.

Copyright (c) 2002 Ziff Davis Media Inc. All Rights Reserved.

http://www.extremetech.com/print_article/0,3428,a=1623,00.asp
culater

Keep Last Read

Next 10 Prev Next

Join the InvestorsHub Community

Register for free to join our community of investors and share your ideas. You will also get access to streaming quotes, interactive charts, trades, portfolio, live options flow and more tools.

Volume
Day Range:
Bid Price
Ask Price
Last Trade Time:

Boards:

Quotes:

Boards

News

Market Data

Markets

Discover

Discover

Boards:

Quotes:

Join the InvestorsHub Community