
Computational models of speech perception and learning
One of the key challenges in describing the perceptual and cognitive systems that underlie spoken language is understanding how a particular stimulus produces a particular behavior. Computational models offer a tool for doing this. By building a model of the speech system or a component of it, we can better understand the process of how sounds are classified into meaningful categories.
We use several different types of models to study speech, including neural network and statistical models. Two of the specific problems I am working on are unsupervised learning and cue weighting.
Unsupervised learning
The languages of the world use acoustic information in many different ways, and, as a result, different languages have different categories of speech sounds. For example, Japanese has a single category on the R/L dimension, while English has two. Similarly, English has only two categories of voicing (e.g. B and P), while Thai has four. In the process of learning language, one of the first tasks encountered by infants is to determine which sounds form different categories in their language and which do not. Careful measurement of the acoustic cues used in speech has revealed that there are statistical properties in the sound signal that reflect the categories of a given language. A number of researchers have suggested that infants may keep track of this information and use it to learn speech categories.
We have modeled this process using a type of computational model called a mixture of Gaussians (MOG). The movie below shows one of these models learning the voicing categories of English. Initially, the model starts out with a number of categories (since, like infants, it does not know how many categories its native langauge has along this dimension) in random places along the dimension. The model is presented with VOT values and adjusts the parameters of its categories using unsupervised learning and competition between the categories. With enough exposure, the model determines the correct number of categories in the language and their properties. This model provides us with a solution to the problem of acquiring categories though unsupervised learning. In addition, the model demonstrates that statistical learning and competition are sufficiently powerful mechanisms for acquiring speech categories, and it allows us to examine the process of speech development over time.
More information: McMurray, Aslin, & Toscano (2009). Developmental Science
Cue weighting
In addition to learning individual acoustic cues, we would also like to know how children and adults learn to weight and combine multiple acoustic cues in order to perceive speech. One possibility is that the weights assigned to cues are determined based on the reliability of individual acoustic cues - more reliable cues would be weighted higher, and less reliable ones would be weighted lower. We have developed a MOG model using this approach to determine cue weights. The model is able to learn the approximate weights for a variety of acoustic cues occuring in different contexts and different languages.
The model has also provided us with clues about why certain acoustic cues may be weighted differently that we would expect based on their reliability. For example, listeners use vowel length information for word-initial voicing judgments, but the statistical reliability of this cue is very low. The MOG model weights this cue similarly to human listeners. Since learning in the model is unsupervised, it does its best to approximate the distribution of the inputs. However, through learning, the model arrives at a set of categories along the vowel length dimension that is more robust than that categories in the input. Human listeners may overweight certain cues in speech for the same reason. This suggests that listeners may not be perfectly optimal in their use of perceptual cues and that learning may play a crucial role in determining how they weight cues.
More information: