Finding the Human Voice in AI: Insights on the Perception of AI-Voice Clones from Naturalness and Similarity Ratings

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

AI-generated voice clones are important tools in language learning, audiobooks, and assistive technology, but often struggle to replicate key prosodic features such as dynamic F0 variation. The impact of these differences on speech perception remain underexplored. To address this, we conducted two behavioural tasks, evaluating listeners’ ratings of naturalness and similarity for human speech, three AI voice clones (ElevenLabs, StyleTTS-2, XTTS-v2), and a 30% F0 variation condition. ElevenLabs was rated comparably to human speech, while StyleTTS-2 and XTTS-v2 received lower ratings. Reduced F0 variation also led to lower ratings, suggesting that prosody is key to perceived naturalness and similarity. Listener ratings were further influenced by speaker accent and sex, but not by AI tool experience. These findings suggest that prosodic features and speaker-specific characteristics could be drivers for the varying performance of AI-voice clones. This manuscript has been accepted at Interspeech 2025.

Related articles

Related articles are currently not available for this article.