Notes of Dense Trajectories

  • densely sample feature points in each frame
  • track points in the video based on optical flow.
  • compute multiple descriptors along the trajectories of feature points to capture shape, appearance and motion information.
  • Dense Sampling

    • Sampling step size \( W=5 \) pixels
    • # spatial scales ≤ 8
    • Spatial scale increase: \( 1 / \sqrt{2} \)
    • Removing points in homogeneous areas: $$ T=0.001 \times \max_{i \in l}\min(\lambda_{i}^{1},\lambda_{i}^{2}) $$, where \( (\lambda_{i}^{1},\lambda_{i}^{2}) \) are eigenvalues of point \(i\) in image \(I\) (the auto-correlation matrix).
  • Descriptors

    • Trajectory shape descriptor(TR):

where L is the length of trajectory, and the displacement vectors

  • HOG – static appearance information
  • HOF – local motion information
  • MBH – motion descriptor for trajectories
  • Format of DTF features

The format of the computed features

The features are computed one by one, and each one in a single line, with the following format:

frameNum mean_x mean_y var_x var_y length scale x_pos y_pos t_pos Trajectory HOG HOF MBHx MBHy

The first 10 elements are information about the trajectory:

  • frameNum:     The trajectory ends on which frame
  • mean_x:       The mean value of the x coordinates of the trajectory
  • mean_y:       The mean value of the y coordinates of the trajectory
  • var_x:        The variance of the x coordinates of the trajectory
  • var_y:        The variance of the y coordinates of the trajectory
  • length:       The length of the trajectory
  • scale:        The trajectory is computed on which scale
  • x_pos:        The normalized x position w.r.t. the video (0~0.999), for spatio-temporal pyramid
  • y_pos:        The normalized y position w.r.t. the video (0~0.999), for spatio-temporal pyramid
  • t_pos:        The normalized t position w.r.t. the video (0~0.999), for spatio-temporal pyramid

The following element are five descriptors concatenated one by one:

  • Trajectory:    2x[trajectory length] (default 30 dimension)
  • HOG:           8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
  • HOF:           9x[spatial cells]x[spatial cells]x[temporal cells] (default 108 dimension)
  • MBHx:          8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
  • MBHy:          8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
  1. Improved Dense Trajectories

  • Explicit camera motion estimation
  • Assumption: two consecutive frames are related by a homography.
  • Match feature points between frames using SURF descriptors and dense optical flow
  • Removing inconsistent matches due to humans: use a human detector to remove matches from human regions (computation expensive)
  • Estimate a homography with RANSAC with these matches

References:

  1. H Wang, C Schmid, Action recognition with improved trajectories, ICCV 2013
  2. H Wang, A Kläser, C Schmid, CL Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, May 2013, Volume 103, Issue 1, pp 60-79