One of the primary research interests in our lab is speech perception. How do human listeners process the sound they hear and use it to understand spoken language? How do listeners deal with variablity in speech, such as differences between different speakers or in different contexts? What are the neural correlates of speech processing in the brain? How should we design computer models of speech perception that reflect the behavior of human listeners?
A central theme in our lab's approach to studying these questions is to examine how listeners make use of fine-grained acoustic detail in the speech signal. The differences between many speech sounds occur on small temporal scales, on the order of milliseconds. For example, the difference between the English sounds "b" and "p" is primarily signaled by a timing difference in vocal tract activity. For "b" sounds, the mouth opens and the vocal folds begin vibrating at the same time, and for "p" sounds there is about a 40 ms delay between the two. This is called voice-onset-time or VOT. Sounds can also occur along a continuum anywhere in between the two. Below is a series of spectrograms of the words "bet" and "pet" that gradually change from one sound to the other. Play the movie to hear the sounds played sequentially.
Generally, people hear a series of "bet"s followed by a series of "pet"s, though the ones in the middle may sound ambiguous. Are listeners sensitive to these small differences in acoustic information? We have been conducting experiments to answer this question by studying the whether listeners are sensitive to these small differences in VOT using eye-tracking and electrophysiological components in brain activity that correspond to these acoustic differences (with Joe Toscano).
Since listeners do seem to be sensitive this fine-grained detail, a natural question is what they do with it. Work by Cheyenne Munson and David Gow is examining how people use fine-grained cues in the signal left by place assimilation to anticipate future words, and with with Jennifer Cole of the University of Illinois linguistics department is examining the kind of information that can be carried in vowels.
Another issue we are addressing in the MACLab is the issue of how listeners deal with the variability inherent in the speech signal. Certain characteristics of a particular speaker's voice can be used to identify him or her, but is this information also used for speech recognition? How do listeners deal with other types of variability, such as differences in speaking rate? One crucial strategy the system might take is to use the lexicon to help figure out perceptual ambiguity. For example, a listener heard a word like "lab", but the initial sound was ambiguous between /l/ and /r/, they could use the fact that there is no word "rab" to deduce that they heard an /l/. All of these issues pose difficult problems for the speech system and answering them allows us to form a more complete picture of how speech is processed.
Another type of variability that listeners must cope with arises when particular acoustic cues in speech are ambiguous. Using the same example of "bet" and "pet" described above, we can see that "bet" generally has a VOT close to 0 ms, while "pet" has a delay around 40 ms. What happens when listeners hear a 20 ms delay? Do they commit to a particular interpretation or can they revise their estimate after receiving additional information? We are conducting eye-tracking experiments using a garden path procedure to answer this question.
Finally, many cues in speech do not arrive at the same time. In the bet/pet example, the VOT occurs at the onset of the word, but the length of the following vowel is also an important cue. How does the listener deal with the fact that these do not appear at the same time? Does it wait until it has both? Make a provisional commitment and revise? Work on cue-integration with colleagues from the University of Rochester is trying to answer this question.
Important distinctions in speech (e.g. /b/ vs. /p/, /d/ vs /g/) are cued a large number of very fine-grained sources of information in the speech signal. These cues are gradient in the sense that you can vary them in tiny steps. You can even "morph" a /b/ into a /p/ in small steps (as the figure above demonstrates with bet/pet). An important question in psycholinguistics is the degree to which listeners are sensitive to this fine grained detail. That is, does it matter what type of /b/ it is, or once listeners know that it is a /b/ do they throw out this detail. For a long time we thought the answer was the latter. Pioneering work by Alvin Liberman (Liberman, Harris, Hoffman & Griffiths, 1957)demonstrated that listeners were pretty bad at discriminating small differences that were members of the same category (e.g. different versions of /b/) and pretty good when those same small differences cross a category boundary. This has also been true for infants (Eimas, Siqueland, Jusczyk and Vigorito, 1971).
Work in the MACLab has been reassessing this. We started from the premise that such fine-grained detail may be useful (and at this point we've already demonstrated that it is). If so, it would be foolish for such listeners to discard it. Rather they should take advantage of it. This then suggests, a way to measure listeners sensitivity--put them in a naturalistic task that is most likely to engage their natural processes.
The Visual World Paradigm, developed by Mike Tanenhaus and colleagues (Tanenhaus, Spivey-Knowlton, Eberhart & Sedivy, 1995; Allopenna, Magnuson & Tanenhaus, 1998) is exactly this sort of task. Subjects see four objects on a normal computer screen representing possible interpretations of the stimulus. In our case, subjects saw a bear, a pear, and two unrelated objects, lamp and ship. Subjects hear an auditory stimulus (e.g. "bear") and click on the picture that matches. While they do this, we track where they are looking with a head-mounted eye-tracker. Since people tend to fixate a location before they start to move their mouse, where they look can tells us what they are thinking very early in the process. Moreover, making an eye-movement requires considerably less effort than moving the mouse (or pushing a button)--thus, subjects are more likely to make eye-movements in response to sub-threshold perceptual processes.
In our experiments subject heard gradations between words like bear and pear (we manipulated VOT as described above), and we monitored where they look while they categorized these words. While their ultimate response (the mouse click) showed very categorical responding (something was either bear or pear with no gradations), they're eye-movements reflected more gradiency. As the VOT of the target shifted from /b/ to /p/ subjects made increasingly more looks to the /p/--even if they still called the word /b/ in the end. This was described in McMurray, Tanenhaus & Aslin (2002 see also Mcmurray, Tanenhaus, Aslin & Spivey, 2003). Later work (McMurray, Aslin, Tanenhaus, Spivey & Subik, submitted) demonstrated that the lack of gradiency seen in other tasks may have been due to their "meta-linguistic" nature--when subjects were asked to think about the sound and make an explicit decision (e.g. does it start with a /b/ or a /p/?) the effects were reduced. Ongoing work is also extending this finding to different speech sounds including liquids (l/r), approximants (b/w) and vowels.
Collaborators
Michael Spivey (Cornell University)
Meghan Clayards (University of Rochester)
Michael Tanenhaus (Unviersity of Rochester)
Richard N. Aslin (University of Rochester)
Papers
Earlier work done by our lab has used eye-tracking techniques to show that listeners are sensitive to fine-grained differences in acoustic information, indicating that a great deal of information may be available for speech perception. How does the use of this acoustic information map on to neurophysiological processes? One way we can examine this question is to use event-related potential (ERP) techniques. ERPs are a measure of brain activity that can be obtained in human subjects using non-invasive electrodes attached to the head. Electrical activity produced by neurons in the brain can be detected at the scalp and recorded by these electrodes in real-time. This allows us to obtain a measure of neural activity that has a high temporal precision. This technique has been used to study a variety of cognitive and perceptual processes, and researchers have identified a number of components in ERP waveforms that can be used to understand these processes (REF).
We are using ERP techniques to look at how listeners use fine-grained acoustic differences in speech. Subjects are presented with a series of sounds that vary from one phonetic category to another. For example, the sound sample below varies in voice-onset time (VOT) which signals the difference between the words "dart" and "tart". Click the icon to listen.
In this experiment, subjects were presented with either "dart" or "tart" and asked to identify which word they heard. The stimuli in the range between the two endpoints were presented occasionally. This type of presentation, where one stimulus occurs frequently and others occur infrequently, produces a P3 component in the ERP waveform. We found that the size of the P3 decreased as the distance from the frequent stimulus (either "dart" or "tart") increased. This suggests that listeners are able to detect small changes in acoustic information and that this information persists to the point at which they are identifying words.
Future work on this project will be looking at similar effects in other ERP components, as well as examining other aspects of speech perception, such as allophonic variation and lexical activation.
|
Students Joe Toscano |
Collaborators Steve Luck (UC Davis) Toby Mordkoff (University of Iowa) |
Presentations
Dennhardt, J., McMurray, B., Luck, S. J., and Toscano, J. C. (2006, June). Gradient effects of continuous acoustic detail revealed in event-related potentials. Poster presented at the 151st Meeting of the Acoustical Society of America, Providence, RI.
Certain sounds (those made with the tip of the tongue) undergo substantial modification in running speech. For example, the /n/ in green can be
produced as (and sound like) an /m/ when it is adjacent to other sounds that are produced with the lips (e.g. /b/). Thus, "green boat" can often be
produced as something like "greem boat." However, this transformation is typically not complete--the /n/ does not completely change into an /m/ as it
would if deliberately mispronounced. Rather, it has the articulatory and acoustic properties of both /n/ and /m/. This project seeks to understand how
the perceptual system compensates for this variation and may even make use of it to predict upcoming words. Project experiments involve adult listeners'
participation in an eye-tracking task. Results indicate that these listeners are able to use acoustic detail to help them predict the word they will
hear next. Different experiments in this projects are
designed to test how listener's lexical knowledge interacts with the process of word recognition
and whether lexical competition slows the facilitative effect provided by assimilation. Additionally, we are interested in the precise timing of these
facilitative effects and the acoustic cues that listeners are picking up on in assimilated speech and would like to identify these cues so that we can
create artificially synthesized assimilated speech.
|
Students Cheyenne Munson |
Collaborators David Gow (Massachusetts General Hospital) |
Vowel-to-vowel coarticulationSpeakers vary their vowel sounds when speaking, but rather than being random these variations happen systematically depending on many contextual factors. The vowels that pronounced in proximity to any given target vowel are one such factor. Vowel coarticulation causes shifts in the formant values of a vowel from those of that the vowel has in a neutral context towards those of context vowels. In this study we are collaborating with Jennifer Cole, a linguist at the University of Illinois to to demonstrate the systematic production of vowel-to-vowel coarticulation across word boundaries. For instance, the vowel in "wet" will sound slightly different when pronounced in the phrase "wet eagle" than in the phrase "wet octopus" because of the different vowels at the beginning of "eagle" and octopus." As in our research on place assimilation, we also hope to show that adult listeners are sensitive to this vowel-to-vowel coarticulation and that they can use this information to help them predict upcoming words in an eye-tracking task.
|
Students Cheyenne Munson Gary Linebaugh (University of Illinois) |
Collaborators Jennifer Cole (University of Illinois) |
Papers
Cole, J., Linebaugh, G., Munson, C., and McMurray, B. (submitted) Vowel-to-vowel coarticulation across words in English: Acoustic evidence. Email for more details.
|
Students Molly Robinson |
Collaborators
Jim Magnuson (University of Connecticutt)
Papers
Listeners must also cope with variations in speaking rate. If someone is speaking quickly or slowly, this will affect the properties of temporal cues used to perceive speech. For example, at fast speaking rates, VOTs tend to be shorter, and at slow speaking rates, they tend to be longer. This, in turn, affects how listeners cateogrize speech sounds spoken at two different rates. The figure at the left shows two examples of the word "beach" spoken at fast (top) and slow (bottom) speaking rates.
How do listeners deal with these variations in speaking rate? One possibility is that they might adjust their use of other acoustic cues depending on the speaking rate. Using an eye-trakcing procedure, we can exmaine how listeners use rate information as they hear sentences. By varying the speaking rate of the sentence, as well as other temporal cues in the signal, such as VOT and vowel length, we are able to look at how speaking rate affects the way listeners perceive speech sounds.
|
Students Joe Toscano |
Our work on gradiency has demonstrated that listeners are systematically sensitive to fine-grained detail in the speech signal, and that this can be used to anticipate upcoming material. However, are there other ways that this is beneficial for word recognition?
One example of such a benefit may derive from misperception or mispronunciation. Consider pairs of words such as barricade and parakeet. If barricade were mispronounced with a VOT that was ambiguous between /b/ and /p/, then the word would be ambiguous for quite some time (until the “ade” or “eet” was heard). In this case, a categorical system that was not sensitive to within-category detail would be forced to make an immediate decision that may, in fact, be wrong. Should this be the case (e.g. the system decided it was a /b/, and then ended up hearing "keet”), it would be difficult to deactivate the incorrect competitor and reactivate the correct target. On the other hand, if the system kept both alternatives active (to the degree they matched the input) until the resolving information arrived, the system would be in a good position to further activate the correct word. In fact, since the likelihood of misperceiving a token (or hearing a misproduced token) is related to distance from the category boundary, a system that maintained activation for both competitors in proportion to the VOT (as was found in McMurray et al 2002) would be ideally suited to deal with this problem.
Ongoing work in the MACLab is testing this hypothesis by using pairs of words like barricade/parakeet and bassinet/passenger in a visual world task. Subjects hear tokens from a barricade/parricade continuum (or one of 9 other similar continua) and select the correct picture while eye-movements to the target (“barricade”) and competitor (“parakeet”) are monitored. In this case, a categorical system ought to treat a token of “parricade” that was at the category boundary no differently than a token of “parricade” that was far from the boundary—it would make a discrete decision in favor of a /p/ and then make a difficult recovery at the onset of final vowel. A continuously-sensitive system on the other hand would preserve activation for both barricade and parakeet and recover faster when the VOT was near the boundary than far from it. Results favor the continuously sensitive model—recovery from the mispronunciation was faster when the VOT was near the boundary than far from it.
These experiments are establishing that not only is lexical activation sensitive to acoustic detail, but also that this detail must be retained for a significant amount of time (across all of the stimuli, such information would need to be retained on average for 220 ms in order to improve word recognition). While this is a substantial amount of time (in the context of online spoken word recognition), one might still ask whether such detail can be retained indefinitely or whether there is a natural decay rate. Thus, follow-up work is examining sentential garden-paths to determine how long the system can retain this information.
|
Alumni Collaborators Stephanie Huette |
Collaborators Michael Tanenhaus (University of Rochester) Richard Aslin (University of Rochester) |
Acoustic cue integrationIn speech, individual sounds can often be ambiguous, which can make word recognition difficult. One way to cope with ambiguity is to take advantage of multiple sources of information in the sound signal, a process called cue integration. Multiple sources of acoustic information often indicate what phonetic category a word belongs to. For example, there are at least 16 acoustic cues to the distinction between the words "rabid" and "rapid" (Liberman, 1978).
One of the questions our lab has been studying is how listeners combine these cues during perception. One approach we have taken is to use eye-tracking techniques to examine how individual acoustic cues are used. For example, the difference between the words "beach" and "peach" is signaled both by the voice-onset time (VOT) of the initial sound and by the vowel length (VL) in the word. By varying both of these cues in stimuli presented to subjects in an eye-tracking task, we can examine how they use the two cues.
In a typical experiment, a subject is presented with a display on a computer similar to the one shown here, with a picture of a beach, a peach, a lamp,
and a ship. The subject is instructed to click on one of the pictures while their eye movements are recorded with the eye-tracker. By averaging across
the patterns of eye movements from multiple trials, we can obtain the likelihood that the subject will look at a given object at each point in time. This
can be used to tell when each acoustic cue has an effect on the subject's perception of the word. We have found that listeners tend to use each acoustic
cue as it becomes available rather than waiting until they have heard the entire word to make a judgment about what the word is. This helps us to
understand when in language processing cue integration occurs.
Another approach that we also use to understand these problems is computational modeling. By creating a computer model of speech perception, we are able to better understand the mechanisms that underlie cue integration and other speech processes. Our model of cue integration works by weighting acoustic cues by their reliability and combining them to form estimates of a particular phonetic category, such as the difference between "beach" and "peach". By simulating these processes using a computer model, we can better understand how they are organized in human listeners.
|
Students Joe Toscano Marcus Galle |
Collaborators Richard Aslin (University of Rochester) Michael Tanenhaus (University of Rochester) Meghan Clayards (University of Rochester) |
Papers
Toscano, J. C. and McMurray, B. (in preparation). Acoustic cue integration in natural speech.
Presentations