February 22, 2022
Utilizing key point and pose estimation for the task of autonomous driving
Imagine: you’re driving a car and you see a person standing on the corner of a street. How do you know if they're going to cross? Interpreting another road user's intent and actions can be challenging and complex, even for human drivers. It is a driver's job to gauge whether another road user wants you to wait and let them cross, whether they are waiting for you to cross after you pass, or if they are just waiting there for a different reason. Even then, what an individual signals might differ from the action they complete. To help navigate these nuanced situations, one of the important signals the Waymo Driver uses is key points.
Making more accurate and efficient models with less compute
Key points are a simplified way to represent the complex human form using a limited number of points, which typically correspond to body joints.
Key points are a compact and structured way to convey human pose information otherwise encoded in the pixels and lidar scans for pedestrian actions. These points help the Waymo Driver gain a deeper understanding of an individual's actions and intentions, like whether they’re planning to cross the street. For example, a person's head direction often indicates where they plan to go, whereas a person's body orientation tells you which direction they are already heading. While the Waymo Driver can recognize a human's behavior without using key points directly using camera and lidar data, pose estimation also teaches the Waymo Driver to understand different patterns, like a person propelling a wheelchair, and correlate them to a predictable future action versus a specific object, such as the wheelchair itself.
Applying state of the art technology to the autonomous driving domain
Up until now, key points have been used in relatively controlled environments to help make them easier to apply, such as augmenting a dinosaur next to a singular person or filming a set number of actors to control a video game. The Waymo Driver generates key points in the "wild" for all nearby road users, which is orders of magnitudes harder as our Driver often encounters up to hundreds of pedestrians at a single intersection, many of which can be occluded by other objects.
The Waymo Driver uses real-time data from our sensor suite, including our lidars, which feed into our neural-network models to localize key points in three dimensional space. Waymo created its own methodologies to generate high-quality labels to identify the joints in a 3D space, which enabled training human pose models to further improve the safety of the Waymo Driver. This also means that Waymo's key point technology doesn’t identify an individual person, but rather aggregates data points and provides us with a better capability to recognize that a person exists and where they may be going, which is especially beneficial for partially visible pedestrians that might be stepping out of a vehicle or sitting near the road. Additionally, we’ve optimized our system to run onboard the vehicle in real-time, with high precision and low latency, to enhance its behavior-prediction models and allow the Waymo Driver to quickly and safely handle any situation.
Crowds
Gestures
Occlusions
By design, cities are denser environments leading to more unique challenges. Narrow city streets are lined with cars and large crowds of people, and objects are often blocked or hidden with people walking out of buildings or popping out from behind vehicles. With the addition of key points, the Waymo Driver can better understand and recognize partially occluded objects, such as just a leg or arm of a person stepping out of a vehicle or a person hidden between two vehicles, and reason about their next move.
Indecisive pedestrians
Key points are an enabler of Waymo’s autonomous driving stack, from perception through behavior prediction, allowing our Driver to safely get people and things where they are going. If you’re interested in learning more about Waymo’s work on key points, check out this paper.