Speech perception is complex, technical, and esoteric. While this may appeal to some folks (Linux users, mostly), this glossary is intended for the rest of us. Speech perception is foundational in Cognitive Science, and ideas like embodiment, connectionism, and dynamical systems often appear in some of their earliest forms in speech perception. This glossary, then, is intended to help you ease into many of the foundational papers in speech perception. If there are further terms that trouble you, please email Bob McMurray with suggestions.
The cochlea is arrayed as a tonotopic map such that the hair cells deep in the cochlea respond to low-frequency sounds and those on the outside respond more to high-frequency sounds. Unfortunately, it is not scaled linearly. That is, there is not a X-to-1 mapping between frequencies and hair-cells. Instead there are a lot more hair cells devoted to low frequencies, and a lot fewer at high frequencies. That is, for example, at low-frequencies, there may be hair cells devoted to each frequency (100 hz, 101 hz, 102 hz...) while at high frequencies there may be a bigger span between cells (e.g. 10000 hz, 10050 hz, 10100 hz...). (this is obviously an oversimplification).
Bark scaling is a way to rescale frequency such that the distances between low frequency sounds are larger, and those between high frequency sounds are smaller, to better reflect the natural scaling of the auditory system. That is, it reflects the fact that to the auditory system a 10hz difference between two low frequency sounds is more important than a 10hz difference between two high frequency sounds. It is roughly logarithmic and is based on critical bands of hearing identified by psychophysics. Its not the only scaling function that has been proposed, Mel and Erb do similar things.
In speech perception, discrimination tasks are often used to determine if a listener thinks two sounds are the same or different. In work on categorical perception (CP), in particular, the question is whether they are same or different at any level, not same or different with respect to category membership. So, we're really asking about the ability of the sensory code to distinguish stimuli.
There are a rich history of such tasks developed in psychophysics and they are usually named some incomprehensible combination of letters that will give you a clue to the order of the stimuli and the type of judgement that must be made. Below is a summary of several of the most popular:
1) The AX task is your classic same different task. The subject hears two sounds (A and X) and has to decide if they match or not. This is a terrible task for a number of reasons. First, it uses a totally subjective criteria -- some people might decide that any difference at all is enough to make them say "same"; other people might decide that only a difference in category membership counts. Second, you have to make sure that have the trials are same trials, and this means you throw away a lot of data (since you typically care about the differences, and hence the different trials).
2) The ABX task was most famously used in Liberman et al's (1957) seminal paper on categorical perception. In this task, participants hear three stimuli in a row. The first two (A & B) are always different, and the third one (X) matches either one of them. The subject's job is to answer #1 or #2 (depending on which one matches). Of course, if they cannot distinguish the A&B, they should be at chance.
There are a number of nice advantages to this task. First, since A & B always differ, this eliminates any throw-away trials, so you get to use all the data. Second, the criteria has to be exact match. One of the trials, for example could be ba0 ba10 ba0 (where the number indicates the VOT). All three are ba's, but the right answer is A (since the third stimulus matches the one in A position). So this task implicitly sets the criteria for the subject, and it clearly can be done on the basis of auditory/sensory codes alone.
However, two big disadvantages. First, it's a wierd task -- children and impaired listeners may have a hard time doing it. Second, it has a high memory load. In order to do the task (at least under some strategies), you have to keep both the A and B stimulus in memory in order to compare them to X. This can make it more a measure of what sensory differences people remember, rather than what they percieve.
3) The 4AIX was pioneered by Pisoni and colleagues in the 70s. It consists of two pairs of stimuli, one pair is the same and one pair is different (on some level). You answer #1 or #2 to indicate what was different. For example: for bp bb, the correct answer would be #1; for pp pb, the correct answer is #2. This, like the ABX task forces subjects to pay attention to whatever differences are there (the criteria is not subjective). It also has memory demands, though these are not as much as the ABX task since this can be done as simply two AX tasks (which only require you to retain one item in memory).
The amplitude envelope of a sound describes how the loudness of that sound changes over time. For example, a sound like /wa/ starts fairly quiet and gradually grows in amplitude over about 50-100 ms, while a /b/ starts out loud and stays that way over that same time period. You can see the amplitude envelope when you look at the raw waveform of a sound (and squint a bit).
Here, for example, in this waveform of the word beach, you can see that the amplitude envelope feaures a sudden onset of energy, followed by a sustained period and then a gradual fall-off into the closure. There is a short period of silence, and then a period of lower intensity energy.
When reading a spectrogram, during voiced portions of a word, a formant refers to a band of energy. These are usually numbered from the bottom such that the formant with the lowest frequency is F1, the next lowest is F2 and so forth. Pitch (or the fundamental frequency) while not visible on the spectrogram is often termed F0 though it is technically not a formant.
Formants are the result of the filtering of the laryngeal sound source by the articulators. For most vowels, only F1 and F2 are needed for decent classification of height and backness (F3 participates in roundedness, and length is also involved). Place of articulation is carried by F2 and F3, and F1 is weakly correlated with voicing, and manner. F4 and F5 are not often examined phonetically, and though there are higher formants, they are rarely discussed.
When we take the spectrum of a piece of a sound, we first have to cut out just the timewindow we want. For example, in the recording of "shell" below, we might want to take just a slice of the /sh/.
When we do this, it results in an artificially step amplitude envelope at the cut points. That is, the sound level jumps suddenly from 0 to something quite a bit more. This can cause distortions in the spectrum as these quick jumps would require lots of different frequency to approximate (many of which aren't really in the signal).
To cope with this, the wave form is multiple by a hamming window. A hamming window simply makes the edges of the sound quieter and keeps the center at full volume. Something like the filter displayed below.
When this is multiplied by the waveform we get the same basic waveform but the amplitude envelope is smoother now, as seen below.
The spectra of this slice is shown below both without the hamming window (top) and with it (bottom). There are pretty minimal differences for this particular soundfile, but there is somewhat more harmonicity in the filtered window, while the unfiltered sound has energy spread out over more frequencies (e.g. noise).
Hyper-speech, also known as clear speech, is mode of speaking in which the speaker carefully enunciates resulting in more extreme articulations (e.g. the tongue tip for a /t/ is even more fronted, the tongue blade for an /a/ is backer and lower), a slower rate of speech and generally better perception on the part of percievers. Hyperspeech is often studied in the "clear speech" paradigm, but is also relevant to work on speaking rate (slower speech is an example of hyper-speech), and infant directed speech is often seen as an example of hyper-speech.
Hypo-speech is the opposite, when speech is fast and not clearly articulated. In hypospeech, most of the vowels will be reduced to schwa's, and those that are not fully reduced will have less extreme gestures/formant frequencies. In addition, word medial stops may become flaps or taps, fricatives are shortened, and so forth.
Loci and Locus Equations are used to describe the formant changes that cue place of articulation. Some of the earliest work with spectrograms revealed that formant frequencies at syllable onset were important for distinguishing place of articulation, but there was no one-to-one mapping between a given formant frequency and the place of articulation. This is seen clearly in this diagram from Delattre showing the range of formants that can give rise to a /d/ in various vowel contexts.
This led to the idea that it was the frequency at which the formant appears to have originated that was the underlying cue to place of articulation. This is shown clearly in the figure below (also from Dellatre): /b/'s have a relatively low origin for F2 (the transitions are rising even when the steady state is quite low as in a /u/); /g/ has a high origin (it is always falling even when the steady state is very high as in /i/; and /d/ is in between.
This intuitively made a lot of sense, but it was fairly useless as a perceptual theory for one simple reason: there isn't any energy at the locus frequency. That is, there is nothing to hear! Loci, are sort of a cruel trick of the eyes--we look at these spectrograms and see a pattern, but in fact there is not an actual cue present.
Locus equations are a method developed to estimate the locus from the existing perceptual information. Basically, a regression line is estimated connecting the frequency of the formant at syllable onset and the frequency at the centroid of the vowel. This line will presumeably pass through the locus and can then be used to estimate it. Since it is based on observable events this seemed like a good way to get the same information. There have been numerous studies on locus equations as descriptors for place of articulation (particularly in stop consonatns) and Harvey Sussman has a number of important papers fleshing this out.
See "Regression techniques for categorical data"
P(erceptual) Center. When a listener has to align (in time) a series of rhyming words, it turns out that simply getting the onsets of those words in perfect rhythm sounds bad -- it doesn't sound rhythmic. Rather, the correct alignment point is somewhere in the middle of the first syllable (depending on the phonological structure of that syllable). This "p-center" is sort of the place where the perception is tagged in time and isn't usually the same as the most salient acoustic events (e.g. the onset). It's been used to argue for gestural or motor accounts since the P-center might line up with the peak of the gesture. More importantly, the P-center is nice demonstration of a dissociation between what's actually in the acoustics and what people seem to hear.
Speech perception research often uses independent variables that are continuous (like voice onset time, or VOT), but output variables that are categorical or binary (e.g. whether or not the subject said that the sound was /b/ or /p/). This is typically visualized using identification functions (below) in which the probability of choosing one response is plotted as a function of the continuous independent variable.
In statistics at large, linear regression is often used when the independent variables are continuous. In this case, we might develop a regression equation to predict p(/p/) from VOT:
p( /p/ ) = M * VOT + B.
However, this function is unbounded -- if the VOT was too large or small, it could give us a value that was greater than 1 or less than 0, an illegal probability.
A number of non-linear regression techniques have been developed to cope with this. Rather than assuming that the underlying function is linear, these assume a sigmoid.
Sigmoids are bounded at two values. They start at a fixed lower asymptote, and make a single transition to an upper asymptote at a crossover point along the X axis.
Logistic Regression, (also commonly referred to as logit models) is one such sigmoidal function. It uses the formula
p( /p/ ) = 1 / (1 + exp( -(M*VOT + B)))
(note the linear term in the denominator). There is also a version called multinomial logistic regression that can handle cases in which the dependent variable can have more than two outcomes (e.g. if participants have to choose between three or four categories).
Probit models are similar and use the probit function which has a similar form. Its mathematics is different (it's based on the integral of a Gaussian) but functionally it can be used the same way.
Occasionally sigmoidal functions are referred to as ogives, apparently named for a similar shaped part of a cathedral. In this case, the probit function is called the "normal ogive" (as the normal distribution is another name for the Gaussian on which probit analysis is based).
This is a phonetic feature describing the two major classes of fricatives. S, Sh, z, and zh (as in Jacques) are sibilants and are characterized by the presence of high frequency energy. Th (both the voiced and voiceless versions), f, and z are non-sibilants.
If we plot the spectrum of a sound (particularly fricatives) we often see clusters of energy around certain frequencies. See the below example of the spectrum of an /s/. The spectral moments are a way to describe that. If we visualize the frequency spectrum as sort of a gaussian distribution or bell curve, the first spectral moment is the mean, that is at what frequency is the bulk of the energy found. The difference between s and sh is largely in spectral mean as the frequency of an s is higher than an sh. The second spectral moment is the variance, that is, how much is the energy dispersed. The difference between an /sh/ and a /h/ is perhaps in the second moment as an /h/ has energy everywhere, while for an /sh/ it is more tightly clustered. The third moment is the skew indicating whether the gaussian is symetric or skewed toward higher or lower frequencies and the fourth moment is the kurtosis, or how peaky it is.
A tonotopic map refers to an arrangement of neurons (either actual neurons or in a connectionist model) such that the spatial position corresponds to frequency. For example, the left-most neurons may respond to low frequencies and the right-most to high frequencies. As such, which neurons are firing can code which frequency is being heard.
Inside the cochlea, hair cells (the neurons that fire when they vibrate to code sound) are arranged tonotopically such that the hair cells in the interior respond to low frequencies and those on the exterior respond to high frequencies. This tonotopicity is preserved throughout auditory cortex.
This refers to a unit of sound proposed by Ohala that describes the transition between two phonemes. Given the lack of invariance due to coarticulation, one possibility is that rather than looking for fixed features for phonemes, characterizing the fundamental units in terms of transitions might be more useful. This is quite similar to using diphones (pairs of phonemes) as the fundamental units. For whatever reasons, transemes are not often discussed anymore and diphones are more common when people need to talk about this type of representation.