501 10 20 20 1 5
Assume you are shooting for some movie where some scenes are shot with a green background and at the time of movie editing, the green background changed with some other scenes. Google picked this choice and implemented this with Youtube stories using the neural network.
This “video segmentation” tool is rolling out to YouTube Stories on mobile in a limited fashion starting now for beta testers only. Yes, if you see it on your mobile device, you are a beta tester.
It was not easy to implement, Google developers designed the neural network to achieve this using machine learning algorithm. But there were many constraints:
- A mobile solution should be lightweight and run at least 10-30 times faster than existing state-of-the-art photo segmentation models. For real-time inference, such a model needs to provide results at 30 frames per second.
- A video model should leverage temporal redundancy (neighbouring frames look similar) and exhibit temporal consistency (neighbouring results should be similar)
- High-quality segmentation results require high-quality annotations.
The network learned to pick out the common features of a head and shoulders, and a series of optimizations lowered the amount of data it needed to crunch in order to do so. And — although it’s cheating a bit — the result of the previous calculation (so, a sort of cutout of your head) gets used as raw material for the next one, further reducing load.
To do this, Google needs to train the network with tens of thousands of pictures, so that the network can learn to recognize the foreground poses and patterns, glasses and other elements like lips, head, hairs etc.
Training pattern of the network was also not simple, to achieve frame-to-frame temporal continuity, while also accounting for temporal discontinuities such as people suddenly appearing in the field of view of the camera. To train our model to robustly handle those use cases, we transform the annotated ground truth of each photo in several ways and use it as a previous frame mask:
- Empty previous mask – Trains the network to work correctly for the first frame and new objects in the scene. This emulates the case of someone appearing in the camera’s frame.
- Affine transformed ground truth mask – Minor transformations train the network to propagate and adjust to the previous frame mask. Major transformations train the network to understand inadequate masks and discard them.
- Transformed image – We implement thin plate spline smoothing of the original image to emulate fast camera movements and rotations.
The result is a fast, relatively accurate segmentation engine that runs more than fast enough to be used in video — 40 frames per second on the Pixel 2 and over 100 on the iPhone 7.