Skilful voice impersonators are able to fool state-of-the-art speaker recognition systems, as these systems generally aren’t efficient yet in recognising voice modifications, according to new research from the University of Eastern Finland. The vulnerability of speaker recognition systems poses significant security concerns.
Nowadays, mobile devices are increasingly equipped with applications that function with voice commands. The user is able to dictate messages, translate phrases and do search queries by voice only. The widespread use of electronic services has increased the demand of applications that use voice to recognise the speaker either for authentication purposes or for public safety. However, with the popularity of voice applications, their misuse may also increase.
Voice attacks against speaker recognition can be done using technical means, such as voice conversion, speech synthesis and replay attacks. The scientific community is systematically developing techniques and countermeasures against technically generated attacks. However, voice modifications produced by a human, such as impersonation and voice disguise, cannot be easily detected with the developed countermeasures.
Voice impersonation is common in the entertainment industry where professionals and amateurs are able to copy voice characteristics and speech behaviour of other speakers, usually public figures. An easier way of voice modification is voice disguise where speakers modify their voices to avoid being recognised as themselves. The latter type of modification is common in situations that do not require face-to-face communications and may vary from innocent prank calls to crimes such as blackmailing or threatening calls. Consequently, this issue prompts an interest to improve the robustness of speaker recognition against human-induced voice modifications.
The study analysed speech from two professional impersonators who mimicked eight Finnish public figures. Additionally, the study of voice disguise included acted speech from 60 Finnish speakers who participated in two recording sessions. The speakers were asked to modify their voices to fake their age, attempting to sound like an old person and like a child. The study found that impersonators were able to fool automatic systems and listeners in mimicking some speakers. In the case of acted speech, a successful strategy for voice modification was to sound like a child, as both automatic systems’ and listeners’ performance degraded with this type of disguise.
R. González Hautamäki, M. Sahidullah, T. Kinnunen and V. Hautamäki, “Acoustical and perceptual study of voice disguise by age modification in speaker verification”, Speech Communication,95, 1–15 (2017). https://doi.org/10.1016/j.
The doctoral dissertation of Rosa González Hautamäki, entitled Human-induced voice modifications and speaker recognition. Automatic, perceptual and acoustic perspectives, is available for download at http://epublications.uef.fi/