An experience of application of additional sources of information for solving a problem of identification and recognition of speech answer in noisy conditions
D.N.Babin, I.L.Mazurenko, A.V.Urantsev Faculty of Mechanics and Mathematics,
Department of Mathematical Theory of Intelligent Systems (MaTIS).
Traditional methods, connected with use of one microphone as an
input of speech signal into a computer, do not give us a high reliability of
recognition. If signal-to-noise relation (SNR) is 0-6 dB, the probability of
an error even in identification of speech activity using only one
microphone in typical cases is not less than 10% (as experiments and
literature data show).
We considered a problem of identification of a speech answer in noisy
conditions. To solve this problem, such additional sources of information
as additional microphones, a photo-sensor, wind-sensors, laringophone
were tested.
If SNR is 0-6 dB the fact of speech answer can be reliably separated
from such typical situations that can force an error during recognition as
high impulse noise and even external speech using an additional closely-
positioned microphone and such parameters as an energy of difference
between the first and the second microphones and the difference of
energies in both microphones. The far-positioned (third) microphone can
be used to estimate the average level of an external noise. Using a more
complex array of microphones it is possible to model a directional
microphone with a variable direction and as a result to increase the SNR.
We also used headsets with a microphone, on which the infra-red light-diode and
photo-diode were positioned in order to measure the coefficient of
a reflection of light from lips of a speaker. A signal from the photo-sensor
(corresponding to the level of opening of a mouth) and a derivative of this
signal (corresponding to the speed of lips movements) were used as the
key characteristics for an identification process. Also it appeared to be
possible to distinguish the lips movements typical to a speech production
process from occasional lips movements. A photo-sensor appeared to be
an important source of information not actually depending of the level of
external acoustic and light noise.
Low-frequency microphone along with the temperature wind-sensor
were used as the kinds of wind-sensors. This type of sensors identifies a
fact of breath. Moreover, the ordinal breath can be separated from a breath produced
during a speech process. The typical noise
for this sensor is the infra-sound noise connected with the external streams
of an air. The wind-sensor also allows us to identify the break syllables in
a speech phrase, that can increase the reliability of speech recognition.
The signal from the laringophone does not practically depend on the
external acoustical noise and represents the speech signal, to which the
low-level filter with upper frequency 1-2 kHz is applied. The result is that
only vowels are excluded from the speech signal.
Every such source of information independently can be used to solve
the problem of an identification of speech answer with its own reliability.
However, the reliability of an identification is increased extremely when
we use several sensors of different kinds together. It is connected with
that the typical noise for sensors of different types has a different nature
and a probability of their simultaneous appearance is very low.
In the experiments of authors the probability of an error in
identification was reduced to 10-3, that is the probability of speech answer
identification in noisy conditions was increased in 10-100 times.
The work was fulfilled on a chair of Mathematical theory of
intellectual systems of Mechanics and Mathematics faculty of the Moscow
State University.
|