Apollo Synthetic Dataset

Apollo Synthetic is a photo-realistic synthetic dataset for autonomous driving. It contains 273k distinct (not continuous frames from a video) from various virtual scenes of high visual fidelity, including highway, urban, residential, downtown, indoor parking garage environments. These virtual worlds were created using Unity 3D engine. The biggest advantage of a synthetic dataset is precise ground truth data it provides. Another benefit is more environmental variations (which are relatively harder & costlier to achieve in the real world), such as different times of day, different weather conditions, different traffic / obstacles, and varied road surface qualities. Our dataset provides extensive set of ground truth data: 2D/3D object data, semantic/instance-level segmentation, depth, and 3D lane line data.


Dynamic variations

Time of Day
Road Degration


DatasetYearSynthetic?#labeled framesResolutionDiversityGround truthSupported sensors
VKITTI2016Yes21k1242x3755 urban scenes under different imaging and weather conditions2D/3D box, semantic/instance-level segmentation, optical flow, depthCamera
Synthia2016Yes213k1280x760Urban / highway / green area scenes under different times of day / weather conditions / seasonsSemantic segmentation, depthCamera
FCAV2017Yes200k1914x1052Diverse scenes from GTA 5 under different times of day / weather conditions2D box, segmentationCamera
Playing for Benchmarks2017Yes250k1920x1080Diverse scenes from GTA 5 under different times of day / weather conditions2D/3D box, semantic/instance-level segmentation, optical flowCamera
ApolloScape2018No144k3384x27104 regions in China under different times of day / weather conditionsSemantic/instance-level segmentation, depth, 3D semantic point cloudCamera, Lidar
BDD100k2018No100k1280x7204 regions in US under different times of day / weather conditions2D box, semantic/instance-level segmentationCamera
nuScenes2019No40k1600x900Boston / Singapore under different times of day / weather conditions3D box with semanticCamera, Lidar, Radar
Apollo Synthetic2019Yes273k1920x1080Highway / urban / residential scenes under different times of day / weather conditions / road qualities and an indoor parking garage scene2D/3D box, semantic/instance-level segmentation, depth, 3D lane lineCamera

Data format

RGB image

Jpg in HD resolution (1920x1080)

Segmentation image

HD resolution png with a color encoding text file

A png files contain semantic and instance-level segmentation per pixel.

An encoding text file (per variation) contains one line per category formatted like ‘<category>[:<tid>] <R> <G> <B>’, where ‘<category>’ is the name of the semantic category of that pixel, ‘<tid>’ is the (optional) integer track identifier to differentiate between instances of the same category (only vehicles and pedestrians have the instance distinction), and ‘<R> <G> <B>’ is the color encoding of that label in the corresponding ground truth images.

Supported semantic categories (bold ones have instance distinction)

  • • Coupe
  • • SUV
  • • Hatchback
  • • Van
  • • PickupTruck
  • • Truck
  • • Bus
  • • Cyclist
  • • Motorcyclist
  • • Pedestrian
  • • TrafficCone
  • • Barricade
  • • Road
  • • LaneMarking
  • • TrafficSign
  • • TrafficLight
  • • Sidewalk
  • • GuardRail
  • • Sky
  • • Terrain
  • • Pole
  • • StreetLight
  • • Building
  • • Vegetation

Depth image

HD resolution png whose R / G channels contain 16-bit depth info.

Depth values are distances to the camera plane obtained from the z-buffer (https://en.wikipedia.org/wiki/Z-buffering). They correspond to the z coordinate of each pixel in camera coordinate space (not the distance to the camera optical center). We assume a fixed far plane of 655.35 meters, i.e. points at infinity like sky pixels are clipped to a depth of 655.35m. This allows us to truncate and normalize the z values to the [0;2^16 – 1] integer range such that we can have 1cm precision. This 16-bit values is encoded into Red / Green channel of a png file. You can decode the depth by (R + G / 255.0) * 65536.0 where R / G are a normalized float value ([0.0;1.0]) of the pixel's red / green channel.

Red channel only:

Green channel only:

Remarks about 3D information

The images are rendered using the following camera intrinsics:

    •	Resolution: 1920x1080
    •	Vertical FOV: 30 °
    •	K = [[2015.0,      0, 960.0],
             [     0, 2015.0, 540.0],
             [     0,      0,     1]]

A right-handed coordinate system is used. In our system of 3D camera coordinates, x is going to the right, y is going down, and z is going forward (the origin is the optical center of the camera). Y is up in world space.

Object ground truth

Our object ground truth is similar to KITTI one, but there are some differences, too, that you need to be aware of. Space-separated fields in each row has following meanings:

    1 frame id
    1 object id
    1 object category (sedan, pedestrian, cyclist, ...)                       ====== KITTI compatible part start
    1 KITTI-like truncation flag (0.0 ~ 1.0) => from non-truncated to truncated
    1 KITTI-like occlusion flag (0 = fully visible, 1 = partly occluded, 2 = largely occluded, 3 = unknown)
    1 KITTI-like observation angle in radian
    4 2D bounding box [left, top, right, bottom]
    3 dimension [height, width, length]
    3 object center in camera coordinates [X, Y, Z]
    1 Y rotation in camera space (in radian)                                  ====== KITTI compatible part end
    1 occlusion percentage (0.0 ~ 1.0) => fully visible to fully blocked
    2 horizontal/verticaltruncation [truncation_x, truncation_y]
    3 object's rotation in camera space in Euler angles [X, Y, Z] (in radian)
    3 velocity in world coordinates [X, Y, Z]
    3 object center in world coordinates [X, Y, Z]
    3 object's rotation in world space in Euler angles [X, Y, Z] (in radian)

(The order of the Euler angle is Z-X-Y and uses ‘extrinsic rotation’.)

3D lane line ground truth

Visible portion of each lane line is sampled regularly along its 3D length (every 1m) and outputted as a sequence of points. The lines represent inner boundaries of the lane markings in the perspective of the ego. For each line, its samples are listed per row in the direction of the ego's progression. Each row representing a single point has following space-separated fields:

    1 global id
    1 lane marker type (SingleSolid, SingleDash, DoubleSolid, DoubleDash, LeftDashRightSolid, LeftSolidRightDash, Curb, Imaginary, Other)
    2 normalized pixel position of this lane point sample (the origin at top-left)
    1 ego-centric lane index (-4 ~ 4, e.g. -1 means the left boundary of the ego lane and 1 means the right boundary of the ego lane.)
    1 lane topology type (ForkLaneLeft, ForkLaneRight, MergeLaneLeft, MergeLaneRight, ParkingLane, CenterLane)
    1 lane marker color (White, Yellow)
    3 3D position of this sample in camera coordinate

Camera pose

Each row consists of the frame index followed by row-wise flattened 4×4 extrinsic matrix at that frame (again space-separated):

        r1,1 r1,2 r1,3 t1
    M = r2,1 r2,2 r2,3 t2
        r3,1 r3,2 r3,3 t3
        0     0   0    1

where ri,j are the coefficients of the camera rotation matrix R and ti are the coefficients of the camera translation vector T.

Download Data

  • Sample (4.87GB)

Terms of Use