Auditory-Visual Integration of Sine-Wave Speech
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
It has long been known that observers use visual information from a talker’s face to supplement auditory input to understand speech in situations where the auditory signal is compromised in some way, such as in a noisy environment. However, researchers have demonstrated that even when the auditory signal is perfect, a paired visual stimulus will give rise to a different percept from that without the visual stimulus. This was demonstrated by McGurk and McDonald (1976) when they discovered that when a person is presented with an auditory CV combination (e.g., /ba/), and visual speech stimulus (e.g., /ga/), the resulting perception is often a fusion (e.g., /da/) of the two. This phenomenon can be observed in both degraded and non-degraded speech stimuli, suggesting that the integration is not a function of having a poor auditory stimulus. However, other studies have shown that the normal acoustic speech stimulus is highly redundant in the sense that the signal contains more information than necessary for sound identification. This redundancy may play an important role in auditory-visual integration. Shannon et al. (1995) reduced the spectral information in speech to one, two, three, and four bands of modulated noise using the original speech envelope to modulate the same spectral band. The results showed very high intelligibility even for reductions to three or four bands, suggesting that there are tremendous amounts of redundancy in the normal speech signal. Furthermore, Remez et al. (1981) reduced the speech signal to three time-varying sinusoids that matched the center frequencies and amplitudes at the first three formants of the natural speech signal. Again, the results showed high intelligibility (when the subjects were told that the sounds were, in fact, reduced human speech). A remaining question is whether reducing the redundancy in the auditory signal changes the auditory-visual integration process in either quantitative or qualitative ways. The present study addressed this issue by using, like Remez, sine wave reductions of the auditory stimuli, with the addition of visual stimuli. A total of 10 normal-hearing adult listeners were asked to identify speech syllables produced by five talkers, in which the auditory portions of the signals were degraded using sine wave reduction. Participants were tested with four different sinewave reductions: F0, F1, F2, and F0+F1+F2. Stimuli were presented under auditory only, visual only, and auditory plus visual conditions. Preliminary analysis of the results showed very low levels of performance under auditory only presentation conditions for all of the sinewave reductions, even F0+F1+F2. Visual-only performance was approximately 30%, consistent with previous studies. Little evidence of improvement in the auditory plus visual condition was observed, suggesting that this level of reduction in the auditory stimulus removes so much auditory information that listeners are unable to use the stimulus to achieve any meaningful audiovisual speech integration. These results have implications for the design of processors for assistive devices such as cochlear implants.