Big video handlers like YouTube and even Slack use AI to automatically caption videos that you upload. This is a great win for accessibility, and the AI is extremely accurate. It sometimes stumbles on uncommon words and bad audio signals, but these can be easily fixed by a human. I think most would agree that the benefit and cost savings of generating 95% accurate captions greatly outweighs the cost of having none at all!
There are two steps to this (1) generate captions from the video as an SRT file, and (2) bundle the captions with the video.
Whisper in my experience is incredibly good at audio transcription. And it can directly transcribe mp4 files to SRT format. You don’t have to extract the audio from the video or anything like that. Once you have an SRT file,
ffmpeg can easily bundle an mp4 and SRT file. I’ve tried this on some personal videos, and the results have been near perfect.
How to Do It
Step 1: Use
whisper to generate an SRT transcription of the video:
whisper infile.mp4 \ --model small.en \ --language English \ -f 'srt'
I tested this on a two minute video and it took less than 30 seconds to produce a complete transcription. This output is a subtitle file for your video - an SRT file. SRT is a plaintext file format, so you can edit it with your word processor of choice. Open the SRT, and you will see that each line of the file contains start and end timecodes and the caption text. This can be interpreted by video players to display the caption at the right time in sync with the video.
I used the
small.en language model, which is about 400MB to download. It may be less accurate than the larger models, but in my experience it does a really good job. The option
-f 'srt' specifies that you want an SRT file, but you have your choice of text output format.
You may want to quickly review the SRT file for any misinterpreted words or names
Step 2: Use
ffmpeg to add the SRT as a subtitle track:
ffmpeg -i infile.mp4 \ -i infile.srt \ -c copy -c:s mov_text \ outfile.mp4
This took less than 2 seconds to run. It doesn’t have to re-render the video, it just bundles the SRT alongside the video track inside of the mp4 container.
You will need a player like VLC that can show you the subtitle tracks. It is also possible to use ffmpeg to “burn in” the subtitles from your SRT file, but this requires re-rendering the whole video.
Automate It All!
The above would be the “by hand” procedure, but perhaps you can see how easily this process might be automated in bulk.
The idea would be that you can drop the mp4 files you want subtitled into an S3 bucket. Then you’d have a scheduler script detect them and queue up jobs for one or more workers. This isn’t strictly necessary, but would allow you to fan out the process to as many workers as you like to process more videos faster. The jobs would run the above commands and produce captioned video files automatically, which are then saved back to S3 somewhere.
Show Captions in HTML without Embedding
If you’re captioning videos for viewing in web browsers, there’s an even easier way.
After my initial write-up, I learned that it’s also possible to attach the track to the video using HTML5’s built-in
<track> tags. This cuts out step two above, meaning you won’t have to use ffmpeg to embed the track in the video file.
In this case, you’ll want to generate a
.vtt file instead of the
.srt above. Fortunately, Whisper supports this format, too!
whisper infile.mp4 \ --model small.en \ --language English \ -f 'vtt'
Once you’ve generated your captions, you can use them with the HTML5 video player like this:
<video src="myvideo.mp4"> <source src="myvideo.mp4" type="video/mp4" /> <track label="English" kind="captions" srclang="en" src="/path/to/captions.vtt" default /> </video>