Probably you’ve been binge-watching a series and the intro/outro song is already stuck in your head, the visuals already burned onto your retina, you’re ready to do anything to not hear one more time those songs.
Some streaming services like Netflix already have a skip intro button, but how does it work? Do you really think that people at Netflix in charge of their content specify in their contracts that content providers should also provide the intro/outro time range? Don’t be silly, that would be too easy.
I’ll try to automate that using
chromaprint mostly for fun and profit and maybe I’ll learn something along the way.
I’ll also need to install
ffmpeg to do some work on the video files themselves and
golang wrappers for chromaprint
Let’s imagine I own some JoJo’s Bizarre Adventure episodes.
First, I need to strip the audio from the video files, using
ffmpeg is fast and it looks like this in my case.
Strip the audio stream:
ffmpeg -i input.mkv -vn -c:a copy output.m4a
Next, I’ll need to uncompress it into a .wav file also drop as many bytes as possible, making it mono and trimming it to first 3 minutes where usually the intro song plays should be good enough.
Convert to .wav:
ffmpeg -i output.m4a output.wav
Convert stereo .wav to mono .wav:
ffmpeg -I output.wav -ac 1 m_output.wav
Trim down to 180 seconds:
ffmpeg -i m_output.wav -af atrim=0:180 m_t_output.wav
Yes, it can be automated with a bash script, but I needed to do for like ten files… I should’ve written a script.
Finding the common region
By running a piece of raw audio data (a .wav file with no header) through chromaprint you get the fingerprints of those files which are actually just spectrograms, more info on how it works.
Comparing two perfectly aligned audio files results in this image.
The black area at the beginning is where the spectrograms
XOR-ed perfectly resulting mostly in a black area.
But in some cases, most of the cases actually intros aren’t aligned perfectly, before the intro song begins there could be some scenes from previous episodes or some pre-intro scenes from the current episode. Those scenes always vary in length and if I were to compare the spectrograms it would look like a bunch of noise.
One way to find the common areas on two different spectrograms is to slide them past each other like a puzzle each iteration resulting in a match score. I start with an 50% offset between the slices (golang view of an array) and end on -50% offset, good enough for the intro I’m searching, each time the offset decreases the slices get actually a bit bigger then again smaller (highlighted in blue). Another way could be using the “Longest Common Substring” approach with some tolerance between values but that’s for another time.
After all the slide and compare action, I’ll pick the iteration with the best score and do a comparison, usually resulting in something like the picture above.
By the way, raw fingerprints of those two files are just
int32 slices, images above are just for visual aid and the
int32 values are one-pixel width vertical slices from each fingerprint.
The comparison between the values is done using Hamming distance for each pair of
Once I compared the best match I get a similar result but with more numbers:
[15, 20, 9, 13, 12, 10, 6, 7, 3, 2, 2, 1, 0, 3, 2, 1, 9, 13, 12, 14]
Each value is the Hamming distance between two
int32 pairs and its easy to spot somewhere in the middle there is a subsequence that gets below 10 and sits there for quite a while.
That is the matching area I was looking for, If I were to compare those fingerprints and output an image it would have a blacker area somewhere in the middle, next step is to calculate how long it is, then taking into account the offset it is trivial to calculate where the intro song started and ended for both files.
The end result is a pretty reliable way of finding common regions/fingerprints in two or more audio files.
The implementation can be seen