I’m implementing a ‘Speech Parts’ tools for https://osr4rightstools.org/. A collaborator has put together Python code to do it using https://pypi.org/project/speechbrain/

VM Build Script shows how to install dependencies.

Python code and shell scripts

Essentially we

  • Take audio files in the form of .mp3, .ogg, .flac, m4a or video in .webm, .mp4
  • Use ffmpeg to convert these tiles to .wav audio
  • Use Python to run SpeechBrain to extract only the parts with audio
  • Save the clips only with audio
  • Print out corresponding frames where they are.


Use cases for this tool include:

  • CCTV Camera analysis to see where there is speech audio which may be interesting (ie if lots of silence and only a bit of audio we want to know where)

  • Bodycam footage

YouTube can provide us with samples to test the system:

Bodycam footage of police during Minneapolic protests - BBC News 3minutes 30secs.

cctv 1 hour version - 1hour

Longest Video Ever On YouTube - 9hours 15minutes, 393MB

100 Hour Timer Countdown - 100 hours. Approx 3GB

Youtube dlp

To get these raw files I used yt-dlp


# worked on Ubuntu 20
sudo apt install yt-dlp

# WSL2 Ubuntu 18 I used another method.. pip perhaps?

# this gets an mp4 or whatever is 'best' 
yt-dlp https://www.youtube.com/watch?v=cbXOhnudzxk -S "ext"


Converting raw video to .WAV can create huge files.

4GB is a problem for WAV files. RF64 can help

Essentially I found that with a VM of size:

  • 32GB RAM
  • 32GB Disk Space
  • 2 Core AMD (ffmpeg and SpeechBrain implementations both single threaded) Use AMD as slightly cheaper

I could successfully process up to

  • 30 min, 800MB video mp4 file (expands to 60MB WAV)
  • 1 hour, 174MB vide (expands to 680MB)
  • 9 hour, 393MB video file. (expands to 2.9GB WAV)

alt text

My test inputs including the zip of everything together.