Notes of Dense Trajectories

• densely sample feature points in each frame
• track points in the video based on optical flow.
• compute multiple descriptors along the trajectories of feature points to capture shape, appearance and motion information.
• Dense Sampling

• Sampling step size $$W=5$$ pixels
• # spatial scales ≤ 8
• Spatial scale increase: $$1 / \sqrt{2}$$
• Removing points in homogeneous areas: $$T=0.001 \times \max_{i \in l}\min(\lambda_{i}^{1},\lambda_{i}^{2})$$, where $$(\lambda_{i}^{1},\lambda_{i}^{2})$$ are eigenvalues of point $$i$$ in image $$I$$ (the auto-correlation matrix).
• Descriptors

• Trajectory shape descriptor(TR):

where L is the length of trajectory, and the displacement vectors

• HOG – static appearance information
• HOF – local motion information
• MBH – motion descriptor for trajectories
• Format of DTF features

The format of the computed features

The features are computed one by one, and each one in a single line, with the following format:

frameNum mean_x mean_y var_x var_y length scale x_pos y_pos t_pos Trajectory HOG HOF MBHx MBHy

The first 10 elements are information about the trajectory:

• frameNum:     The trajectory ends on which frame
• mean_x:       The mean value of the x coordinates of the trajectory
• mean_y:       The mean value of the y coordinates of the trajectory
• var_x:        The variance of the x coordinates of the trajectory
• var_y:        The variance of the y coordinates of the trajectory
• length:       The length of the trajectory
• scale:        The trajectory is computed on which scale
• x_pos:        The normalized x position w.r.t. the video (0~0.999), for spatio-temporal pyramid
• y_pos:        The normalized y position w.r.t. the video (0~0.999), for spatio-temporal pyramid
• t_pos:        The normalized t position w.r.t. the video (0~0.999), for spatio-temporal pyramid

The following element are five descriptors concatenated one by one:

• Trajectory:    2x[trajectory length] (default 30 dimension)
• HOG:           8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
• HOF:           9x[spatial cells]x[spatial cells]x[temporal cells] (default 108 dimension)
• MBHx:          8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
• MBHy:          8x[spatial cells]x[spatial cells]x[temporal cells] (default 96 dimension)
1. Improved Dense Trajectories

• Explicit camera motion estimation
• Assumption: two consecutive frames are related by a homography.
• Match feature points between frames using SURF descriptors and dense optical flow
• Removing inconsistent matches due to humans: use a human detector to remove matches from human regions (computation expensive)
• Estimate a homography with RANSAC with these matches

References:

1. H Wang, C Schmid, Action recognition with improved trajectories, ICCV 2013
2. H Wang, A Kläser, C Schmid, CL Liu, Dense trajectories and motion boundary descriptors for action recognition, International Journal of Computer Vision, May 2013, Volume 103, Issue 1, pp 60-79