Automatically Caption Your Videos with Whisper and ffmpeg

Aug 12, 2023 · 5 min read

Big video handlers like YouTube and even Slack use AI to automatically caption videos that you upload. This is a great win for accessibility, and the AI is extremely accurate. It sometimes stumbles on uncommon words and bad audio signals, but these can be easily fixed by a human. I think most would agree that the benefit and cost savings of generating 95% accurate captions greatly outweighs the cost of having none at all!

So at work we were talking about how we could build this feature ourselves using open source tools. I came up with this suggestion that uses OpenAI’s Whisper and the venerable ffmpeg.

There are two steps to this (1) generate captions from the video as an SRT file, and (2) bundle the captions with the video.

Whisper in my experience is incredibly good at audio transcription. And it can directly transcribe mp4 files to SRT format. You don’t have to extract the audio from the video or anything like that. Once you have an SRT file, ffmpeg can easily bundle an mp4 and SRT file. I’ve tried this on some personal videos, and the results have been near perfect.

Installation

Here is my project directory on GitHub. The README includes installation instructions, reproduced here.

https://github.com/whusterj/whisper-transcribe

The only two hard dependencies are the ffmpeg system package and the openai-whisper Python package.

This installation has been tested with Python 3.10.12 on Ubuntu 20.02. It should work on other Platforms as well, and OpenAI says Whisper should work with all Python versions 3.9 to 3.11.

First, you will need ffmpeg on your system, if you don’t have it already:

# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg

# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg

For other platforms, see the [Whisper GitHub repo][1].

Now you can install the python requirements. Create a virtual environment, activate it, and pip install from requirements.txt:

python -m venv .venv/
source .venv/bin/activate
pip install -r requirements.txt

If you encounter errors during installation, you may need to install rust on your platform as well. If my requirements.txt fails to install for some reason, try just installing the openai-whisper package - this should install its Python sub-dependencies:

pip install openai-whisper

How to Do It

Step 1: Use whisper to generate an SRT transcription of the video:

whisper infile.mp4 \
    --model small.en \
    --language English \
    -f 'srt'

I tested this on a two minute video and it took less than 30 seconds to produce a complete transcription. This output is a subtitle file for your video - an SRT file. SRT is a plaintext file format, so you can edit it with your word processor of choice. Open the SRT, and you will see that each line of the file contains start and end timecodes and the caption text. This can be interpreted by video players to display the caption at the right time in sync with the video.

I used the small.en language model, which is about 400MB to download. It may be less accurate than the larger models, but in my experience it does a really good job. The option -f 'srt' specifies that you want an SRT file, but you have your choice of text output format.

You may want to quickly review the SRT file for any misinterpreted words or names

Step 2: Use ffmpeg to add the SRT as a subtitle track:

ffmpeg -i infile.mp4 \
       -i infile.srt \
       -c copy -c:s mov_text \
       outfile.mp4

This took less than 2 seconds to run. It doesn’t have to re-render the video, it just bundles the SRT alongside the video track inside of the mp4 container.

You will need a player like VLC that can show you the subtitle tracks. It is also possible to use ffmpeg to “burn in” the subtitles from your SRT file, but this requires re-rendering the whole video.

Automate It All!

The above would be the “by hand” procedure, but perhaps you can see how easily this process might be automated in bulk.

Automatic Video Captioning Pipeline Sketch The idea would be that you can drop the mp4 files you want subtitled into an S3 bucket. Then you’d have a scheduler script detect them and queue up jobs for one or more workers. This isn’t strictly necessary, but would allow you to fan out the process to as many workers as you like to process more videos faster. The jobs would run the above commands and produce captioned video files automatically, which are then saved back to S3 somewhere.

Show Captions in HTML without Embedding

If you’re captioning videos for viewing in web browsers, there’s an even easier way.

After my initial write-up, I learned that it’s also possible to attach the track to the video using HTML5’s built-in <video> and <track> tags. This cuts out step two above, meaning you won’t have to use ffmpeg to embed the track in the video file.

In this case, you’ll want to generate a .vtt file instead of the .srt above. Fortunately, Whisper supports this format, too!

whisper infile.mp4 \
    --model small.en \
    --language English \
    -f 'vtt'

Once you’ve generated your captions, you can use them with the HTML5 video player like this:

<video src="myvideo.mp4">
  <source src="myvideo.mp4" type="video/mp4" />
  <track
    label="English"
    kind="captions"
    srclang="en"
    src="/path/to/captions.vtt"
    default
  />
</video>

William Huster

A Natural General Intelligence

Automatically Caption Your Videos with Whisper and ffmpeg

Installation

How to Do It

Automate It All!

Show Captions in HTML without Embedding