
2-2
2—SOFTWARE
ISD-SR3000
Voice Solutions in Silicon
2.2
ISD-SR3000 uses a segmented triphone recognition process. The sampled speech utterance
is split into distinct phonetic sounds, the smallest units of speech. Because these phonemes
vary in both sound and duration, the processor must be able to determine boundaries between
the sounds. The ISD-SR3000 uses Hidden Markov Models to hypothesize boundaries between
sounds and to form probabilistic models on each possible combination.
RECOGNITION ENGINE
The outputs are then classified by determining matches between the phonetic sounds and the
stored phoneme models. The acoustic models for the phonemes are gathered from a large
sample of speakers, allowing for a wide variation across accents, dialect, and gender. This al-
lows the recognizer to associate the sound segments with a number of possible phonemes, en-
abling recognition when words are pronounced differently.
The phonemes are then matched to vocabulary words or phrases using a search routine. The
set of phonemes is compared to the vocabulary models for the active topics, and the recognized
word is returned. If the phonemes do not match any of the active vocabulary words, nothing is
returned. The ISD-SR3000 does not return a score with the word; it either recognizes a word,
or it does not.
2.2.1
The ISD-SR3000 is capable of both speaker-independent and speaker defined recognition.
The recognition engine is continuous, allowing for multiple word commands and connected dig-
its. However, there must be recognized silence before and after valid utterances. The length of
the silence is programmed into the host controller, and may be as small as 100ms. The com-
mands and digits are speaker-independent, with models constructed from a large corpus of
speakers. The speaker-defined voicetags and commands are partially speaker-dependent.
However, they are constructed by creating acoustic models “on-the-fly” from the phoneme
base. This means only one training pass is required for entering the voicetags, and recognition
is possible with some variation in the way the name is spoken. The first pass is used to create
the phoneme model, and a second pass is used for recognition confirmation.
TYPES OF RECOGNITION
2.2.2
A grammar is used to define the structure of the commands. The ISD-SR3000 is designed to
work with multiple topics or a finite-state grammar. This type of grammar is designed to limit
perplexity (the number of possible branches during recognition) by pre-defining the number of
allowable words at a given state. For example, a prompt that requires a “yes” or “no” response
has a perplexity of two. Greater perplexities increase the chances for substitution errors. During
recognition, a limited number of topics are active. Topics are groups of words that are active at
a given time. For example, in a voice dialing application, digit topics are active after the user
issues the “dial” command. No other topics are open (except the global topics such as “cancel”
or “help”) so that the recognizer is only trying to recognize digits. This type of grammar and ac-
tive topics inherently increases recognition accuracy.
GRAMMAR