Extracting Speech Parts of audio or video using ffmpeg and SpeechBrain

I’m implementing a ‘Speech Parts’ tools for https://osr4rightstools.org/. A collaborator has put together Python code to do it using https://pypi.org/project/speechbrain/

VM Build Script shows how to install dependencies.

Python code and shell scripts

Essentially we

Take audio files in the form of .mp3, .ogg, .flac, m4a or video in .webm, .mp4
Use ffmpeg to convert these tiles to .wav audio
Use Python to run SpeechBrain to extract only the parts with audio
Save the clips only with audio
Print out corresponding frames where they are.

Footage

Use cases for this tool include:

CCTV Camera analysis to see where there is speech audio which may be interesting (ie if lots of silence and only a bit of audio we want to know where)
Bodycam footage

YouTube can provide us with samples to test the system:

Bodycam footage of police during Minneapolic protests - BBC News 3minutes 30secs.

cctv 1 hour version - 1hour

Longest Video Ever On YouTube - 9hours 15minutes, 393MB

100 Hour Timer Countdown - 100 hours. Approx 3GB

Youtube dlp

To get these raw files I used yt-dlp

Installation

# worked on Ubuntu 20
sudo apt install yt-dlp

# WSL2 Ubuntu 18 I used another method.. pip perhaps?

# this gets an mp4 or whatever is 'best' 
yt-dlp https://www.youtube.com/watch?v=cbXOhnudzxk -S "ext"

Performance

Converting raw video to .WAV can create huge files.

4GB is a problem for WAV files. RF64 can help

Essentially I found that with a VM of size:

32GB RAM
32GB Disk Space
2 Core AMD (ffmpeg and SpeechBrain implementations both single threaded) Use AMD as slightly cheaper

I could successfully process up to

30 min, 800MB video mp4 file (expands to 60MB WAV)
1 hour, 174MB vide (expands to 680MB)
9 hour, 393MB video file. (expands to 2.9GB WAV)

My test inputs including the zip of everything together.