Pablo Bernabeu, Researcher, Department of Education
The landscape of speech-to-text transcription has undergone a remarkable transformation in recent years, driven by the proliferation of GenAI tools, and I’ll share how I’ve harnessed this in my research. From basic dictation software to sophisticated neural networks, transcription technology has evolved to handle diverse audio conditions, multiple languages, and complex speech patterns with unprecedented accuracy.
At the forefront of this revolution stands OpenAI’s Whisper model family, released as open-source tools that have democratised access to state-of-the-art automatic speech recognition (ASR) capabilities. True to the ‘Open’ in OpenAI’s original mission, these models have become the gold standard for transcription tasks, offering researchers and developers robust, multilingual speech recognition that rivals proprietary commercial solutions. The Whisper architecture, trained on 680,000 hours of multilingual audio data, represents a paradigm shift towards generalisable, production-ready ASR systems that can handle real-world audio conditions without extensive fine-tuning.
To profit from the benefits of Whisper while preserving the privacy of speech recordings, I developed a secure, scalable speech transcription pipeline, which you can find on my GitHub.
I leveraged GitHub Copilot to engineer a secure, scalable speech transcription pipeline compatible with both local machines and High-Performance Computing (HPC) environments. By synergising my programming expertise with GitHub Copilot’s rapid code generation in Visual Studio Code, I developed a novel system that harnesses state-of-the-art Whisper models while strictly maintaining data privacy. The development followed a rigorous, human-in-the-loop workflow: I iteratively reviewed Copilot’s outputs to correct errors, integrated complex functionality, and produced thorough technical documentation, resulting in a robust tool for secure research data processing.
Implementing GenAI has revolutionised my workflow, by replacing the prohibitive bottleneck of manual transcription with a secure, automated pipeline. Previously, transcribing a single hour of interview data required hours of manual typing. In contrast, Whisper-based transcription now allows processing that same hour in minutes, a 95% reduction in workload that scales linearly if the transcription is parallelised across several computers (using High-Performance Computing).
Unlike off-the-shelf AI chatbots, which are restricted by file-size limits and data privacy risks, this self-contained workflow allows for the secure, batch processing of hundreds of hours of sensitive audio without ever exposing data to the cloud. This has unlocked an entirely new way of working: we can now conduct large-scale qualitative analysis on GDPR-protected datasets that were previously impossible to process automatically. Furthermore, using GitHub Copilot to build this system allowed me to ‘punch above my weight’ technically, compressing months of complex software engineering into weeks of iterative, AI-assisted development.
I hope we can continue enjoying the benefits of this technology while controlling the risks. This will require responsibility at all levels. I would highlight two principles:
- Copilot, not autopilot: Review and revise the output conscientiously and relentlessly.
- Reproducibility: Ensure that the output can be reproduced, modified, and improved by anyone in the future, as easily as possible.
GitHub Copilot and Whisper AI models can be valuable for certain uses but they are not covered by University enterprise agreements so they should only be used following careful consideration of data privacy risks. Please refer to Information Security guidelines when using third-party GenAI tools.