Chapter 5: Speech

From: Harnad, Stevan (harnad@cogsci.soton.ac.uk)
Date: Thu Mar 13 1997 - 14:12:56 GMT


(How) Is Speech Special?

Speech has a production/perception analogue: We can hear speech sounds and
we can also produce them [but the same is true of facial expressions
and hand movements].

Human language (which is special; no other species has it) is spoken
[but there are also nonspoken sign languages].

A PHONEME is the smallest unit of spoken sound that marks a difference
in meaning: bit vs. pit, ball vs. bull, etc. Phonemes are the building
blocks of speech production and perception.

Phonemes vs allophones: ton vs. turn:
Both vowels are allophones (positional variants) of the same phoneme.

Phonemes have features:

(a) Voicing (back [voiced b] pack [unvoiced p])

(b) Place of articulation: (1) lips (labial), (2) back of teeth (dental or
alveolar), (3) throat (velar)

(a) Manner of articulation: plosives (ba, ta, ga), continuants (ma, la
ra), fricatives (fa, tha), sybillant (sa, sha), vowels (a e o) etc.

Context effects:

Where is the "d" in bad?

What is the difference between "badboy" and "batboy"?

Fourier analysis: Applied to any continuous curve whose shape repeats
itself over and over. No matter what the shape of the repeating
("periodic") part, it can be described and produce as a sum of weighted
sine waves of different frequency (from low to high, which is the same
as from slow to fast).

The larynx produces sounds, and these are modified by the place and
manner in which they are articulated. Depending on the shape of the
articulators, some of the frequencies that make up the speech sound
"resonate" more than others.

A "spectrogram" is a visual version of a sound. Along the X axis is
time, and along the Y axis is frequency.
Loudness is represented as darkness in a spectrogram.

What is found is that there are two main frequency bands (dark
regions). The lower one is called the first "formant" and the higher
one is called the second "formant."

Most of the interesting activity occurs in the second formant.

Voicing (voice onset) is easy to see in a sound spectrogram as the
shape of the second formant transition.

Place (ba, da, ga) is also visible in the 2nd formant.

But this visualisation of the "shape" is not constant: it varies with
how fast we talk, how high or low our voices are, and most
complicatedly of all, it varies depending on the "context" of our
utterance: the phoneme that occurred just before, and the phoneme that
is about to follow the phoneme we are visualising. Context effects
are also called "coarticulation" effects, to emphasise that they are
based on what is pronounced (articulate) together.

There have been many theories of how we perceive phonemes.

The locus theory explains context/coarticulation effects purely
in terms of the locations of articulatory movements when
coarticulating: The muscular "commands" issued by the brain
for each phoneme are invariant, according to this theory, and
local context effects are just the result of combining these
unique commands in different ways.

"Look-ahead" theory: We can perceive the preparatory movements
as much as 7 phonemes in advance.

Neural networks (recurrent ones, feeding back some of their
output to their input) capture some of the features of
context and coarticulation effects.

The Motor Theory of Speech Perception:

We can tell which phoneme we are hearing no matter how much they are
distorted by context, because the sound possibilities are constrained by
what our articulatory organs can do: We perceive the sound in terms
of what we would have had to do to produce it.

Besides explaining context effects, the motor theory explained
the sharp boundaries between certain sounds: Spectrograms
of synhesised speech show that in between /ba/ and /pa/ there
is a continuum, and that subjects can tell apart pairs of sounds
along that continuum much more accurately when the sounds are
on either side of the /ba-pa/ boundary than when they are both on the
are on the same side, even though the physical differences are the same.

This effect is called "categorical perception" (CP) and
is measured by first testing identification and then discrimination.
Discrimination is tested using the "ABX" paradigm: A and B
are always different, and then X is either A or B. (Substitute
different ba's and pa's for A and B to see what the subjects had to do.)
ABX discrimination was better across the phoneme category
boundary than within.

The same is true for the synthesised sound continuum underlying
changes in the place of articulation (ba/da/ka or pa/ta/ka).

Motor theory accounted for a lot of the context effects and for CP. but
it ran into trouble when it was found that preverbal infants perceive
sounds the same way we do. Motor theorists then suggested that this may
be because the motor mechanisms for speech perception were so important
for our species that we evolved a pre-wiring for it: it did not have to
be learned anew by every generation.

But then it was found that chinchillas perceived voicing changes
(ba-pa) categorically, the same way we do. So the motor theory is not
likely to be the explanation for that: Another theory suggests
that speech simply took advantage of an auditory sensitivity boundary
that was already there in our ancestors before speech,

There is also evidence that at least for some context effects,
neither the motor theory nor the inborn sensitivity theory
is needed, because recurrent neural nets can learn to recognise
speech context-independently,

So it is likely that a combination of these theories --
locus, motor theory, inborn sensitivities, and learned
categories -- will be be part of the full explanation of
speech perception and production.



This archive was generated by hypermail 2b30 : Tue Feb 13 2001 - 16:23:51 GMT