Speech Perception

One of the primary research interests in our lab is speech perception. How do human listeners process the sound they hear and use it to understand spoken language? How do listeners deal with variablity in speech, such as differences between different speakers or in different contexts? What are the neural correlates of speech processing in the brain? How should we design computer models of speech perception that reflect the behavior of human listeners?

A central theme in our lab's approach to studying these questions is to examine how listeners make use of fine-grained acoustic detail in the speech signal. The differences between many speech sounds occur on small temporal scales, on the order of milliseconds. For example, the difference between the English sounds "b" and "p" is primarily signaled by a timing difference in vocal tract activity. For "b" sounds, the mouth opens and the vocal folds begin vibrating at the same time, and for "p" sounds there is about a 40 ms delay between the two. This is called voice-onset-time or VOT. Sounds can also occur along a continuum anywhere in between the two. Below is a series of spectrograms of the words "bet" and "pet" that gradually change from one sound to the other. Play the movie to hear the sounds played sequentially.

Generally, people hear a series of "bet"s followed by a series of "pet"s, though the ones in the middle may sound ambiguous. Are listeners sensitive to these small differences in acoustic information? We have been conducting experiments to answer this question by studying the whether listeners are sensitive to these small differences in VOT using eye-tracking and electrophysiological components in brain activity that correspond to these acoustic differences (with Joe Toscano).

Since listeners do seem to be sensitive this fine-grained detail, a natural question is what they do with it. Work by Cheyenne Munson and David Gow is examining how people use fine-grained cues in the signal left by place assimilation to anticipate future words, and with with Jennifer Cole of the University of Illinois linguistics department is examining the kind of information that can be carried in vowels.

Another issue we are addressing in the MACLab is the issue of how listeners deal with the variability inherent in the speech signal. Certain characteristics of a particular speaker's voice can be used to identify him or her, but is this information also used for speech recognition? How do listeners deal with other types of variability, such as differences in speaking rate? One crucial strategy the system might take is to use the lexicon to help figure out perceptual ambiguity. For example, a listener heard a word like "lab", but the initial sound was ambiguous between /l/ and /r/, they could use the fact that there is no word "rab" to deduce that they heard an /l/. All of these issues pose difficult problems for the speech system and answering them allows us to form a more complete picture of how speech is processed.

Another type of variability that listeners must cope with arises when particular acoustic cues in speech are ambiguous. Using the same example of "bet" and "pet" described above, we can see that "bet" generally has a VOT close to 0 ms, while "pet" has a delay around 40 ms. What happens when listeners hear a 20 ms delay? Do they commit to a particular interpretation or can they revise their estimate after receiving additional information? We are conducting eye-tracking experiments using a garden path procedure to answer this question.

Finally, many cues in speech do not arrive at the same time. In the bet/pet example, the VOT occurs at the onset of the word, but the length of the following vowel is also an important cue. How does the listener deal with the fact that these do not appear at the same time? Does it wait until it has both? Make a provisional commitment and revise? Work on cue-integration with colleagues from the University of Rochester is trying to answer this question.

Speech Perception Projects


Continuous acoustic detail

Important distinctions in speech (e.g. /b/ vs. /p/, /d/ vs /g/) are cued a large number of very fine-grained sources of information in the speech signal. These cues are gradient in the sense that you can vary them in tiny steps. You can even "morph" a /b/ into a /p/ in small steps (as the figure above demonstrates with bet/pet). An important question in psycholinguistics is the degree to which listeners are sensitive to this fine grained detail. That is, does it matter what type of /b/ it is, or once listeners know that it is a /b/ do they throw out this detail. For a long time we thought the answer was the latter. Pioneering work by Alvin Liberman (Liberman, Harris, Hoffman & Griffiths, 1957)demonstrated that listeners were pretty bad at discriminating small differences that were members of the same category (e.g. different versions of /b/) and pretty good when those same small differences cross a category boundary. This has also been true for infants (Eimas, Siqueland, Jusczyk and Vigorito, 1971).

Work in the MACLab has been reassessing this. We started from the premise that such fine-grained detail may be useful (and at this point we've already demonstrated that it is). If so, it would be foolish for such listeners to discard it. Rather they should take advantage of it. This then suggests, a way to measure listeners sensitivity--put them in a naturalistic task that is most likely to engage their natural processes.

The Visual World Paradigm, developed by Mike Tanenhaus and colleagues (Tanenhaus, Spivey-Knowlton, Eberhart & Sedivy, 1995; Allopenna, Magnuson & Tanenhaus, 1998) is exactly this sort of task. Subjects see four objects on a normal computer screen representing possible interpretations of the stimulus. In our case, subjects saw a bear, a pear, and two unrelated objects, lamp and ship. Subjects hear an auditory stimulus (e.g. "bear") and click on the picture that matches. While they do this, we track where they are looking with a head-mounted eye-tracker. Since people tend to fixate a location before they start to move their mouse, where they look can tells us what they are thinking very early in the process. Moreover, making an eye-movement requires considerably less effort than moving the mouse (or pushing a button)--thus, subjects are more likely to make eye-movements in response to sub-threshold perceptual processes.

In our experiments subject heard gradations between words like bear and pear (we manipulated VOT as described above), and we monitored where they look while they categorized these words. While their ultimate response (the mouse click) showed very categorical responding (something was either bear or pear with no gradations), they're eye-movements reflected more gradiency. As the VOT of the target shifted from /b/ to /p/ subjects made increasingly more looks to the /p/--even if they still called the word /b/ in the end. This was described in McMurray, Tanenhaus & Aslin (2002 see also Mcmurray, Tanenhaus, Aslin & Spivey, 2003). Later work (McMurray, Aslin, Tanenhaus, Spivey & Subik, submitted) demonstrated that the lack of gradiency seen in other tasks may have been due to their "meta-linguistic" nature--when subjects were asked to think about the sound and make an explicit decision (e.g. does it start with a /b/ or a /p/?) the effects were reduced. Ongoing work is also extending this finding to different speech sounds including liquids (l/r), approximants (b/w) and vowels.


Collaborators
Michael Spivey (Cornell University)
Meghan Clayards (University of Rochester)
Michael Tanenhaus (Unviersity of Rochester)
Richard N. Aslin (University of Rochester)

Papers

Back to top


ERPs in speech perception

Earlier work done by our lab has used eye-tracking techniques to show that listeners are sensitive to fine-grained differences in acoustic information, indicating that a great deal of information may be available for speech perception. How does the use of this acoustic information map on to neurophysiological processes? One way we can examine this question is to use event-related potential (ERP) techniques. ERPs are a measure of brain activity that can be obtained in human subjects using non-invasive electrodes attached to the head. Electrical activity produced by neurons in the brain can be detected at the scalp and recorded by these electrodes in real-time. This allows us to obtain a measure of neural activity that has a high temporal precision. This technique has been used to study a variety of cognitive and perceptual processes, and researchers have identified a number of components in ERP waveforms that can be used to understand these processes (REF).

We are using ERP techniques to look at how listeners use fine-grained acoustic differences in speech. Subjects are presented with a series of sounds that vary from one phonetic category to another. For example, the sound sample below varies in voice-onset time (VOT) which signals the difference between the words "dart" and "tart". Click the icon to listen.

ERP waveform In this experiment, subjects were presented with either "dart" or "tart" and asked to identify which word they heard. The stimuli in the range between the two endpoints were presented occasionally. This type of presentation, where one stimulus occurs frequently and others occur infrequently, produces a P3 component in the ERP waveform. We found that the size of the P3 decreased as the distance from the frequent stimulus (either "dart" or "tart") increased. This suggests that listeners are able to detect small changes in acoustic information and that this information persists to the point at which they are identifying words.

Future work on this project will be looking at similar effects in other ERP components, as well as examining other aspects of speech perception, such as allophonic variation and lexical activation.


Students
Joe Toscano
Collaborators
Steve Luck (UC Davis)
Toby Mordkoff (University of Iowa)

Presentations
Dennhardt, J., McMurray, B., Luck, S. J., and Toscano, J. C. (2006, June). Gradient effects of continuous acoustic detail revealed in event-related potentials. Poster presented at the 151st Meeting of the Acoustical Society of America, Providence, RI.

Back to top


Place assimilation

Place assimilation

Certain sounds (those made with the tip of the tongue) undergo substantial modification in running speech. For example, the /n/ in green can be produced as (and sound like) an /m/ when it is adjacent to other sounds that are produced with the lips (e.g. /b/). Thus, "green boat" can often be produced as something like "greem boat." However, this transformation is typically not complete--the /n/ does not completely change into an /m/ as it would if deliberately mispronounced. Rather, it has the articulatory and acoustic properties of both /n/ and /m/. This project seeks to understand how the perceptual system compensates for this variation and may even make use of it to predict upcoming words. Project experiments involve adult listeners' participation in an eye-tracking task. Results indicate that these listeners are able to use acoustic detail to help them predict the word they will hear next. Different experiments in this projects are Place assimilation designed to test how listener's lexical knowledge interacts with the process of word recognition and whether lexical competition slows the facilitative effect provided by assimilation. Additionally, we are interested in the precise timing of these facilitative effects and the acoustic cues that listeners are picking up on in assimilated speech and would like to identify these cues so that we can create artificially synthesized assimilated speech.


Students
Cheyenne Munson
Collaborators
David Gow (Massachusetts General Hospital)
Papers
Gow, D.,W., and McMurray, B. (in press) Word recognition and phonology: The case of English coronal place assimilation. Papers in Laboratory Phonology 9.

Presentations
Munson, C., McMurray, B., and Gow, D.W. (2006, June) Lexical influences on the progressive facilitation during perception of assimilated speech. Paper presented at the 151st meeting of the Acoustical Society of America. Providence, RI.

Back to top


Vowel-to-vowel coarticulation Vowel-to-vowel coarticulation

Speakers vary their vowel sounds when speaking, but rather than being random these variations happen systematically depending on many contextual factors. The vowels that pronounced in proximity to any given target vowel are one such factor. Vowel coarticulation causes shifts in the formant values of a vowel from those of that the vowel has in a neutral context towards those of context vowels. In this study we are collaborating with Jennifer Cole, a linguist at the University of Illinois to to demonstrate the systematic production of vowel-to-vowel coarticulation across word boundaries. For instance, the vowel in "wet" will sound slightly different when pronounced in the phrase "wet eagle" than in the phrase "wet octopus" because of the different vowels at the beginning of "eagle" and octopus." As in our research on place assimilation, we also hope to show that adult listeners are sensitive to this vowel-to-vowel coarticulation and that they can use this information to help them predict upcoming words in an eye-tracking task.


Students
Cheyenne Munson
Gary Linebaugh (University of Illinois)
Collaborators
Jennifer Cole (University of Illinois)

Papers
Cole, J., Linebaugh, G., Munson, C., and McMurray, B. (submitted) Vowel-to-vowel coarticulation across words in English: Acoustic evidence. Email for more details.


Back to top


Speaker voice (indexical) information in word recognition

A large body of research by David Pisoni, Steve Goldinger and Anne Bradlowe has shown that when listeners process spoken words they are sensitive to the speaker's voice. That is, if you've heard a speaker say a specific word once, you are faster to recognize it in the same voice later (than if it was said in a different voice). This effect could arise out of two different sources. First, it is possible that the way we represent words is senstive to the indexical properties of a speakers voice. For example, if we literally kept little recordings of each person saying a word in our head, we'd be faster to recognize words that had recordings that matched. However, an alternative is that words are represented more abstractly (e.g. a string of phonemes or letters) and that the word recognition system learns to recognize each voice. Indexical effects then would arise out of the fact that you are well-attuned to speakers you are familiar with. Using the visual world paradigm we are current undertaking a series of experiments to try to tease apart these two hypotheses.

Students
Molly Robinson

Back to top


Lexical Feedback and Compensation for Coarticulation

Classic research has demonstrated that listeners use their knowledge of what is and is not a word to help decode speech. For example, Ganong (1980) showed that when people were given a sound that was ambiguous between "duke" and "tuke" they heard a /d/ while that same sound appended to "oot" ("doot" vs. "toot") was heard as a /t/. A critical question is at what level this happens. Are subjects really hearing a /t/ or are they simply making an educated guess after the fact? To test this, Jim Magnuson and I have been using the compeonsation for coarticulation paradigm (CfC). This paradigm allows us to measure the ganong effect (the ability for lexical knowledg to influence phoneme decisions) indirectly in a way that subjects are not aware of. Thus, if a ganong effect is found it cannot be due to post-hoc guessing (Spoiler: it wasn't). This can be modeled using the TRACE model of word recognition which suggests that as lexical units are building evidence they are simultaneously feeding backj to affect phoneme decisions.

Collaborators
Jim Magnuson (University of Connecticutt)

Papers

Back to top


Speaking rate normalization

Listeners must also cope with variations in speaking rate. If someone is speaking quickly or slowly, this will affect the properties of temporal cues used to perceive speech. For example, at fast speaking rates, VOTs tend to be shorter, and at slow speaking rates, they tend to be longer. This, in turn, affects how listeners cateogrize speech sounds spoken at two different rates. The figure at the left shows two examples of the word "beach" spoken at fast (top) and slow (bottom) speaking rates.

How do listeners deal with these variations in speaking rate? One possibility is that they might adjust their use of other acoustic cues depending on the speaking rate. Using an eye-trakcing procedure, we can exmaine how listeners use rate information as they hear sentences. By varying the speaking rate of the sentence, as well as other temporal cues in the signal, such as VOT and vowel length, we are able to look at how speaking rate affects the way listeners perceive speech sounds.


Students
Joe Toscano

Back to top


Commitment and Ambiguity in Garden-Path Stimuli

Our work on gradiency has demonstrated that listeners are systematically sensitive to fine-grained detail in the speech signal, and that this can be used to anticipate upcoming material. However, are there other ways that this is beneficial for word recognition?

One example of such a benefit may derive from misperception or mispronunciation. Consider pairs of words such as barricade and parakeet. If barricade were mispronounced with a VOT that was ambiguous between /b/ and /p/, then the word would be ambiguous for quite some time (until the “ade” or “eet” was heard). In this case, a categorical system that was not sensitive to within-category detail would be forced to make an immediate decision that may, in fact, be wrong. Should this be the case (e.g. the system decided it was a /b/, and then ended up hearing "keet”), it would be difficult to deactivate the incorrect competitor and reactivate the correct target. On the other hand, if the system kept both alternatives active (to the degree they matched the input) until the resolving information arrived, the system would be in a good position to further activate the correct word. In fact, since the likelihood of misperceiving a token (or hearing a misproduced token) is related to distance from the category boundary, a system that maintained activation for both competitors in proportion to the VOT (as was found in McMurray et al 2002) would be ideally suited to deal with this problem.

Ongoing work in the MACLab is testing this hypothesis by using pairs of words like barricade/parakeet and bassinet/passenger in a visual world task. Subjects hear tokens from a barricade/parricade continuum (or one of 9 other similar continua) and select the correct picture while eye-movements to the target (“barricade”) and competitor (“parakeet”) are monitored. In this case, a categorical system ought to treat a token of “parricade” that was at the category boundary no differently than a token of “parricade” that was far from the boundary—it would make a discrete decision in favor of a /p/ and then make a difficult recovery at the onset of final vowel. A continuously-sensitive system on the other hand would preserve activation for both barricade and parakeet and recover faster when the VOT was near the boundary than far from it. Results favor the continuously sensitive model—recovery from the mispronunciation was faster when the VOT was near the boundary than far from it.

These experiments are establishing that not only is lexical activation sensitive to acoustic detail, but also that this detail must be retained for a significant amount of time (across all of the stimuli, such information would need to be retained on average for 220 ms in order to improve word recognition). While this is a substantial amount of time (in the context of online spoken word recognition), one might still ask whether such detail can be retained indefinitely or whether there is a natural decay rate. Thus, follow-up work is examining sentential garden-paths to determine how long the system can retain this information.


Alumni Collaborators
Stephanie Huette
Collaborators
Michael Tanenhaus (University of Rochester)
Richard Aslin (University of Rochester)

Papers
McMurray, B., Tanenhaus, M.K., and Aslin, R.N. (submitted) Gradient sensitivity to sub-phonemic detail facilitates lexical ambiguity resolution.

Presentations
McMurray, B., Tanenhaus, M., and Aslin, R.N. (2006, June) Garden-path phenomena in spoken word recognition: Gradient sensitivity to continuous acoustic detail facilitates ambiguity resolution. Paper presented at the 151st meeting of the Acoustical Society of America. Providence, RI.

Back to top


Some acoustic cues in speech Acoustic cue integration

In speech, individual sounds can often be ambiguous, which can make word recognition difficult. One way to cope with ambiguity is to take advantage of multiple sources of information in the sound signal, a process called cue integration. Multiple sources of acoustic information often indicate what phonetic category a word belongs to. For example, there are at least 16 acoustic cues to the distinction between the words "rabid" and "rapid" (Liberman, 1978).

One of the questions our lab has been studying is how listeners combine these cues during perception. One approach we have taken is to use eye-tracking techniques to examine how individual acoustic cues are used. For example, the difference between the words "beach" and "peach" is signaled both by the voice-onset time (VOT) of the initial sound and by the vowel length (VL) in the word. By varying both of these cues in stimuli presented to subjects in an eye-tracking task, we can examine how they use the two cues.

Locus of acoustic cue integration In a typical experiment, a subject is presented with a display on a computer similar to the one shown here, with a picture of a beach, a peach, a lamp, and a ship. The subject is instructed to click on one of the pictures while their eye movements are recorded with the eye-tracker. By averaging across the patterns of eye movements from multiple trials, we can obtain the likelihood that the subject will look at a given object at each point in time. This can be used to tell when each acoustic cue has an effect on the subject's perception of the word. We have found that listeners tend to use each acoustic cue as it becomes available rather than waiting until they have heard the entire word to make a judgment about what the word is. This helps us to understand when in language processing cue integration occurs.

Another approach that we also use to understand these problems is computational modeling. By creating a computer model of speech perception, we are able to better understand the mechanisms that underlie cue integration and other speech processes. Our model of cue integration works by weighting acoustic cues by their reliability and combining them to form estimates of a particular phonetic category, such as the difference between "beach" and "peach". By simulating these processes using a computer model, we can better understand how they are organized in human listeners.


Students
Joe Toscano
Marcus Galle
Collaborators
Richard Aslin (University of Rochester)
Michael Tanenhaus (University of Rochester)
Meghan Clayards (University of Rochester)

Papers
Toscano, J. C. and McMurray, B. (in preparation). Acoustic cue integration in natural speech.

Presentations

Back to top