A Stimulus-Computable Model for Audiovisual Perception and Spatial Orienting in Mammals
Abstract
Animals excel at seamlessly integrating information from different senses, a capability critical for navigating complex environments. Despite recent progress in multisensory research, the absence of stimulus-computable perceptual models fundamentally limits our understanding of how the brain extracts and combines task-relevant cues from the continuous flow of natural multisensory stimuli. Here, we introduce an image- and sound-computable population model for audiovisual perception, based on biologically plausible units that detect spatiotemporal correlations across auditory and visual streams. In a large-scale simulation spanning 69 psychophysical, eye-tracking, and pharmacological experiments, our model replicates human, monkey, and rat behaviour in response to diverse audiovisual stimuli with an average correlation exceeding 0.97. Despite relying on as few as 0 to 4 free parameters, our model provides an end-to-end account of audiovisual integration in mammals—from individual pixels and audio samples to behavioural responses. Remarkably, the population response to natural audiovisual scenes generates saliency maps that predict spontaneous gaze direction, Bayesian causal inference, and a variety of previously reported multisensory illusions. This study demonstrates that the integration of audiovisual stimuli, regardless of their spatiotemporal complexity, can be accounted for in terms of elementary joint analyses of luminance and sound level. Beyond advancing our understanding of the computational principles underlying multisensory integration in mammals, this model provides a bio-inspired, general-purpose solution for multimodal machine perception.
Related articles
Related articles are currently not available for this article.