How I Built Realtime AI For The Super Bowl

...and ran it off my laptop

Aug 05, 2024

I can tell you I don't have money, but what I do have are a very particular set of skills (read in my best Liam Neeson voice). Most of my career since 2017 has been about pushing what is possible with machine learning and computer vision at the edge of computing. Perhaps nothing has tested those skills more than when I was approached by Nickelodeon to build something special for the Super Bowl.

Let me start with where I ended up: I built a first-of-its-kind application that could do realtime AI + AR on live TV at 4K/60fps. Oh, and it ran through my laptop. Oh, and it had the lowest latency of any visual pass-through system in CBS’s TV production trailers. I take great pride in what I built, but alas, the powers that be decided to not risk it on the greatest stage of all.

Testing the software on site in a production trailer below Allegiant Stadium

About a year ago, I was asked if anything like I described above was technically possible. And like every overconfident engineer, I replied, anything is possible. But they didn’t believe me. Perhaps expecting I would would be deterred by an extremely tight deadline, I was given 3 weeks to build a full prototype and prove it worked live at an NFL preseason game. Here’s how I built it, and indeed proved, anything is possible (stop rolling your eyes!).

Building the thing

The critical design factor achieving an end-to-end latency of under 16 milliseconds (60 fps). So starting with the hardware, I decided to build this for the Apple Silicon chips. Perhaps an odd choice in the world dominated by NVIDIA and PCs, but my primary reasoning for this was having a predictable and unified stack vs gluing together the frameworks for video processing, ML inference, rendering, etc. Also, I had a lot of experience building on Swift/CoreML for iPhone, so I was hoping to carry that over to Mac. The only issue was decoding and re-encoding the video signal from the production trailer, so I decided upon on a Thunderbolt-powered box from Blackmagic Design to accomplish that.

Blackmagic Design UltraStudio video capture hardware

Ultimately what my software would have to do is decode the frames from the BlackMagic box from a Y'UV TV broadcast format, run inference some kind of ML detection model, track the results, render the augmented scene to and RGB Metal buffer, then convert it back into Y’UV with perfect color match to the original signal. All in less time than it’d take a mantis shrimp to punch some unwitting sea creature in the face.

< 16 milliseconds or your frame is free…d from memory and dropped

1. Color Conversion

There are usually a few ways to skin a cat with Apple frameworks, starting with the simplest (and generally slowest) to the complex, low-level (and generally fastest). You could convert an entire video frame to a different colorspace using AVFoundation or Accelerate frameworks, but I opted to do it using a Metal shader. For one thing, Metal is very fast — the only slow part is copying the frame over from CPU to GPU, but I’d need to do that eventually to render the scene anyway. Converting the buffer back to YUV was similar, though every other column of pixels encodes different parts of the color. Example Metal shader functions for color conversion:

float4 yuvToRgb(float3 yuv) {
    float3 rgb;
    float Y = yuv.x;
    float U = yuv.y - 0.5;
    float V = yuv.z - 0.5;

    rgb.r = Y + 1.402 * V;
    rgb.g = Y - 0.344136 * U - 0.714136 * V;
    rgb.h = Y + 1.772 * U;
    
    return float4(rgb, 1.0);
}

float2 rgbaToYUV(float2 uv, float w, float4 col) {
    
    int column = int(w * uv.x);
    bool isOdd = column % 2 == 1;
    
    //HD Spec
    float Y =  0.183 * col.r + 0.614 * col.g + 0.062 * col.b + 0.0625;
    float U = -0.101 * col.r - 0.338 * col.g + 0.439 * col.b + 0.5;
    float V =  0.439 * col.r - 0.399 * col.g - 0.040 * col.b + 0.5;
    
    
    float3 rgb = float3(V, Y, U);
    
    if (isOdd) return rgb.rg;
    
    return rgb.bg;
}

2. Running ML Inference

At the time the best model suited for this task was YOLOv8. It could handle detection, key points, and instance segmentation. It ported over to CoreML nicely, for the most part. It was low latency and came in a variety of sizes to experiment with. It could handle detecting small objects across a chaotic environment. All I needed to do was label the data I wanted to detect using Roboflow. Even with relatively small initial datasets, I could see what I was starting to build would work. I built a variety of different models, for example segmenting players’ heads, or tracking the football, depending on what the filter in question needed it. I had built Snapchat for live sports.

Tracking player helmets over time and by team

3. Tracking The Results

An NFL game might be one the most difficult environments to do any kind of multi-object tracking. You have 22 players, half of which look the same due to helmets/uniforms, moving full speed, crossing paths, from a single camera that’s trying to follow the ball. YOLO could identify the team for me, so that reduced the tracking error by a factor of two (ie match tracking results based on team). Initially, I tried porting ByteTrack to my application, but found it too slow for realtime. So I created my own tracking algorithm, which was not quite as accurate, but it the speed efficiency made up for it. I match the new predictions with the previous ones based on characteristics like their size and position deltas. Then I take the matched and unmatched results and filter them, removing old unmatched results (ie if I couldn’t match a result after N frames, I discard that result). This allows me to lose tracking of a player for a few frames and recover them if I can under the same ID. My end goal was generally to lock onto a specific player like the quarterback and apply an effect to them throughout the play. An operator would choose the player with their mouse and activate the filter with a click.

4. Rendering The Scene

Fortunately, Apple’s frameworks provide a decent, if limited, 2D/3D graphics engines. SpriteKit and SceneKit allowed me to render whatever I wanted directly into my Metal pipeline. And anything they could handle, I could build custom Metal shaders for (like the Big Head example at the top of this post). With all of the technical stuff out of the way I got to create a lot of fun filters in the spirit you’d expect out of Nickelodeon. Unfortunately I can’t share all of them, but suffice it to say that I believe their Super Bowl broadcast would have been even more fun with them.

Where to next?

Since coming off of this project I’ve been working on bringing this software, PopFX, to other live events. Who knows where it will end up, but it has certainly been one of the highlights of my career. Feel free to reach out to me with any questions, comments, or opportunities!

Advancing Reality

Discussion about this post