How We Made the Candidates Speak 30 Times Slower

By Edward Richardson

Published February 22, 2016

Earlier this month, TIME published a series of slow motion videos of 2016 presidential campaign events in Iowa, overlaid with audio of the candidates speaking against a haunting background noise. As one of the technicians who worked on the series, I’ve had several people ask me about that underlying audio. What is it? Is that music? How did you do that?

While we were discussing the approach to audio in The Candidates, filmmaker Christopher Morris described his vision of how it should sound. He wanted audio of each candidate’s event: the sound of the crowd and of the candidate speaking. But because the camera would slow down the action, he wanted to slow down the audio as well.

The camera would be filming at 720 frames per second (fps), while typical playback speed is 24 fps. This means that 1 second of action captured in real-time will be 30 seconds of footage when played back. There would be some use of standard-speed audio of the candidates speaking, but to complement the camera’s visual aesthetic, we wanted background audio that was slowed to the same degree that the camera was slowing the visual action. So we needed to slow down the audio by 30 times.

There are two main approaches to lengthening audio. First, you can simply slow down the playback, much like playing a record player or cassette tape at a slower speed. But this also lowers the pitch of the sound, in our case by five octaves–far too low to be practical. The other way is to repeat tiny chunks of the audio to lengthen its playback time without altering the pitch. This is the method used in most editing programs like Final Cut Pro. However, this introduces noticeable artifacts after a speed change of around 20 – 50 percent, and we needed 3000 percent. So neither of these traditional methods would work for The Candidates. What would?

A few years ago, the WNYC show RadioLab aired an episode on the subject of time. One segment was about a 24-hour version of Beethoven’s 9th Symphony. Stretching the music creates a striking and ethereal sonic landscape. It allows the listener to fully saturate in the moment-to-moment complexity of the piece. A quick online search will reveal many works of music transformed in this way, including several familiar pop songs. Most are unrecognizable and eerily detached from their original versions, but are endlessly meditative and remarkable.

It turns out this type of audio time-stretching is accomplished with an open-source software application called PaulStretch. It can lengthen any sound, not just music, and lengthened human speech often takes on an intensified musical quality. This seemed ideal for The Candidates, where the slowed-down audio of the candidates’ speeches would provide us with an abstracted sense of place, but also serve as a soundtrack for each film.

For the first few events we only had enough time between shots to record perhaps 10-15 minutes of ambient audio – more than enough for a 5-10 minute film – but we wanted longer amounts so we would have more variety to best match the visual content. So at later events, we recorded the entire time we were there, with a separate track set aside for the feed from the public address system. This was much more effective at bringing out the inner musicality of the speaker’s voice.

At this point, you might be wondering how PaulStretch works. Basically, it looks at individual portions of the sound one at a time. For convenience, we’ll refer to this portion of sound as a “window.” For each window, we get an impression of the frequency spectrum of that sound. For example: are there a lot of high-pitched sounds or low-pitched sounds? The longer the window, the more the sound will smear, much like a long camera exposure might cause an image to smear from motion blur. In The Candidates, we used a window that was around one second in length. To create the sense of smearing, we have to detach the frequency information further from where it exists in time. So we take that component of the information and randomize it. Now that we have a smeared sonic impression of the window, we play it back. So far we have one second of input audio and one second of output audio.

But this doesn’t seem like it would stretch the sense of time, does it? Here’s how the sound is lengthened: For the next window, rather than moving forward in the audio file by one second, we move forward by only a fraction of a second. We’re actually looking at much of the same information again, having lost a small amount at the beginning of the window and gained a little at the end. Even though we only moved forward by a fraction of a second in the original input audio, we’ve generated another full second of output audio. This process is repeated over and over as we move incrementally though the file – and consequently the audio coming out is much longer than the audio going in.

If you’ve made it this far and you’re curious to look at the Python version of the app, it’s astonishingly simple: both a testament to the brevity of the Python programming language, and the clarity of the algorithm itself.

A few tech specs: we used the Zoom H6 recorder with an XY microphone, and recorded everything at 24-bit 96KHz for maximum flexibility. We ended up converting all the files to 16-bit 48KHz for consistency with Final Cut Pro. For the PaulStretch conversions, we used the same settings for all the films and stretching them by 30 times–exactly the same as the camera is doing with the images. So for a 15 minute recording, we would have almost 8 hours of stretched audio. Plenty (if not too much) for the filmmaker, Chris, to select from.

To choose the slowed-down audio we wanted, I left it playing on my laptop while he was editing. Every once in a while, he’d ask me to make a note of where the playback was, or to find another section to try out. Then he put these in Final Cut and we made selections of the real-time audio of the candidate that would layer briefly above the stretched audio during the first few moments of each film.

We were immediately moved by the effect: a slow, reverberant echo of the candidates’ voices. It was both jarring and oddly familiar – a striking compliment to the powerful images Chris had filmed – and a bridge between the known and the unseen.

Edward Richardson specializes in high speed cinematography with the Phantom camera. On projects where existing technologies are inadequate, he designs and builds ranges of hardware, software, and workflow solutions.