Extracting Speech Parts of audio or video using ffmpeg and SpeechBrain
I’m implementing a ‘Speech Parts’ tools for https://osr4rightstools.org/. A collaborator has put together Python code to do it using https://pypi.org/project/speechbrain/
VM Build Script shows how to install dependencies.
Essentially we
- Take audio files in the form of .mp3, .ogg, .flac, m4a or video in .webm, .mp4
- Use ffmpeg to convert these tiles to .wav audio
- Use Python to run SpeechBrain to extract only the parts with audio
- Save the clips only with audio
- Print out corresponding frames where they are.
Footage
Use cases for this tool include:
-
CCTV Camera analysis to see where there is speech audio which may be interesting (ie if lots of silence and only a bit of audio we want to know where)
-
Bodycam footage
YouTube can provide us with samples to test the system:
Bodycam footage of police during Minneapolic protests - BBC News 3minutes 30secs.
cctv 1 hour version - 1hour
Longest Video Ever On YouTube - 9hours 15minutes, 393MB
100 Hour Timer Countdown - 100 hours. Approx 3GB
Youtube dlp
To get these raw files I used yt-dlp
# worked on Ubuntu 20
sudo apt install yt-dlp
# WSL2 Ubuntu 18 I used another method.. pip perhaps?
# this gets an mp4 or whatever is 'best'
yt-dlp https://www.youtube.com/watch?v=cbXOhnudzxk -S "ext"
Performance
Converting raw video to .WAV can create huge files.
4GB is a problem for WAV files. RF64 can help
Essentially I found that with a VM of size:
- 32GB RAM
- 32GB Disk Space
- 2 Core AMD (ffmpeg and SpeechBrain implementations both single threaded) Use AMD as slightly cheaper
I could successfully process up to
- 30 min, 800MB video mp4 file (expands to 60MB WAV)
- 1 hour, 174MB vide (expands to 680MB)
- 9 hour, 393MB video file. (expands to 2.9GB WAV)
My test inputs including the zip of everything together.