SAGE: Speaker Recognition across Age-Groups for Cantonal Law Enforcement Agencies

The SAGE project responds to the increasing demands placed on Zurich law enforcement agencies (LEAs) like the Zurich Forensic Science Institute (FOR) or the Office of the Public Prosecutor, Zurich) for the processing of legally acquired speech audio data. It will provide cutting edge speaker number estimation and diarization methods and software.

Eckdaten

Kontakt

Beschreibung

The volume and complexity of audio data are growing steadily, as criminal activities are frequently coordinated via telecommunication channels, leading to the accumulation of large digital audio corpora. Among prominent examples are the “false policeman” (falsche Polizisten) frauds, which have been widely covered in all major Swiss media over the past years and repeatedly addressed in official public LEA warnings in Zurich and other cantons (e.g. ZH Sicherheitsdirektion). In such cases, perpetrators systematically contact predominantly elderly citizens, impersonate police officers, and manipulate victims into handing over valuables—often worth tens of thousands of francs—to accomplices posing as police representatives.

LEAs are subsequently confronted with dozens of hours of perpetrator recordings containing complex conversational interactions. Traditional manual analysis of such data is not feasible: the sheer volume of recordings is overwhelming for human experts. This becomes especially pronounced for ongoing investigations with time-constraints. Therefore crucial investigative questions, such as: (a)
how many distinct speakers occur in the data (speaker number estimation) and (b) who is speaking at which point in time (speaker diarization) and (c) what is being said in the recordings (speech recognition) remain unresolved or require immense efforts and time. The processing problems are further compounded when the data involve speech that traditional algorithms are not trained for: audio with children or female voices (e.g. callers in the ‘false policeman’ scenarios were dominantly female, presumably because these voices were expected to have higher trustworthiness; childrens’ voices appear frequently as audio evidence in different scenarios of child sexual abuse cases), poor quality audio, and audio with multiple speakers with rapid turn-taking.

Existing open-source audio-processing technologies are not yet sufficiently robust or adapted to forensic requirements, and their poor performance eliminates any time-saving benefits. Commercial alternatives are prohibitively expensive while offering the same level of performance, lack transparency in the underlying methods, lag behind the rapid technological advances of today's open-source alternatives, and do not fully address practice partner’s forensic needs. Given the rapid rise of digital and communication-based crime, it is essential that cantonal LEAs develop and deploy specialized methods for large-scale audio analysis.

The SAGE project will address this need by creating a tailor-made speech-processing system specifically designed for forensic contexts in the Canton of Zurich. SAGE will integrate cutting-edge methods for the aforementioned two most timely problems identified by Zurich LEAs in large data processing: speaker number estimation and diarization, with a particular focus on robustness to gender-, age-groups, audio quality and complex speaker dynamics. Towards this, SAGE will develop robust speaker embeddings - machine-learned computational “footprints” of a voice - that generalizes across children as well as male and female adult voices, and robust to challenging audio conditions such as noisy recordings and short speaker segments. The speaker embeddings will be optimized to improve the two aforementioned problems. An important, unique feature of SAGE is that it allows users to provide feedback to the system to further improve its decisions. SAGE will deliver deployable software solutions—comprising both a library and an application—tailored for use in law-enforcement scenarios.