Paper:

Zia, M. Zeeshan, et al. “Detailed 3d representations for object recognition and modeling.” *IEEE transactions on pattern analysis and machine intelligence* 35.11 (2013): 2608-2623.

Leave a reply

Paper:

Zia, M. Zeeshan, et al. “Detailed 3d representations for object recognition and modeling.” *IEEE transactions on pattern analysis and machine intelligence* 35.11 (2013): 2608-2623.

Paper:

Li, Jun, Reinhard Klein, and Angela Yao. “Learning fine-scaled depth maps from single RGB images.” *arXiv preprint arXiv:1607.00730* (2016).

Paper:

Chen, Xianjie, and Alan L. Yuille. “Articulated pose estimation by a graphical model with image dependent pairwise relations.” *Advances in neural information processing systems*. 2014.

Paper:

Fragkiadaki, Katerina, et al. “Recurrent network models for human dynamics.” *Proceedings of the IEEE International Conference on Computer Vision*. 2015.

Paper:

Sun, Lin, et al. “Lattice long short-term memory for human action recognition.” *arXiv preprint arXiv:1708.03958* (2017).

- Basics
- CNN methods for spatial appearance
- RNN methods (LSTM) for temporal dynamics. — Natively applying RNN only suitable for short term motions.

- Main methods
- Lattice-LSTM. — extend LSTM by learning independent hidden state transitions of memory cells for individual spatial locations.
- Control gates are shared between RGB and optical flow stream.
- Greatly enhance the capacity of the memory cell to learn motion dynamics.

- Multi-model training procedure. — Train both input gates and forgor gates in the network. (Other two-stream network training these two separately)

- Lattice-LSTM. — extend LSTM by learning independent hidden state transitions of memory cells for individual spatial locations.

- Take home message
- Other methods mentioned
- Extension of CNN. –C3D learns both space and time.– Only covers a short range of the sequence.
- Training another nerual network on optical flow.
- Methods for obtain a better combination of appearance and motion: spatial-temporal features using sequential procedure. 2D spatial (short) and 1D temporal (long)information.
- ResNets
- RNN, LSTM — encoder and decoder

Paper:

Wang, Hongsong, and Liang Wang. “Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks.” *e Conference on Computer Vision and Pa ern Recognition (CVPR)*. 2017.

- Basics
- Skeleton based action recognition
- Two-stream RNN
- Two architectures for temporal streams
- Stacked RNN
- Hieratical RNN

- Model spatial structure by converting spatial graph into a sequence of joints.
- Obtain 3D skeletons from depth images.

- Main method
- End- to -end two-stream RNN
- Fusion is performed by combining the softmax class posteriors from the two nets.
- Temporal channel. — Concatenate the 3D coordinates of different joints at each time step, get the generated sequence with a RNN.
- Stacked RNN
- Feed the concatenated coordinates of all joints into RNN. Stack two layers. Adding more layer will not improve the performance.

- Hierarchical RNN
- Divide human skeleton into 5 parts.
- Use hierarchical RNN to model the motions of different parts of the body (first layer) and the whole body (second layer).

- Stacked RNN
- Spatial RNN
- Nodes denote the joints and edges denote the physical connections.
- Action == the undirected graph displays some varied patterns of spatial structures.

- Select a temporal window centered at the time step and feed the coordinates of one joint inside the window to model the spatial relationship of joints.
- Three graph representations
- Undirected graph
- Chain sequence
- Traversal sequence

- Spatial RNN can recognize action based on just one graph representations.

- Nodes denote the joints and edges denote the physical connections.
- Data augmentation
- 3D transformation of skeletons
- Rotation
- Scaling
- Shear

- 3D transformation of skeletons

- Take home messages
- Other methods mentioned
- Body part based action recognition and Joint based action recognition.
- Based on hand-crafted low level features, use Markov Random Fields.
- Fully connected deep LSTM network with regularization terms to learn co-occurrence features of joints.
- — These methods

- RGB based action recognition
- Hieratical RNN, RNN with regularizations, differential RNN and part-aware Long Short Term memory

- Body part based action recognition and Joint based action recognition.

Paper:

Wang, Yunbo, et al. “Spatiotemporal Pyramid Network for Video Action Recognition.” *2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. IEEE, 2017.

- Basics
- Main methods
- Spatio-temporal pyramid network — Reinforce each other
- Hierarchical strategies for fusion — Using a unified spatiotemporal loss — To maximally complement each other.
- Tackle the problem for two-stream method — Usually for most misclassification cases, there is one stream failing and the other correct. — Present end-to-end pyramid architecture to let the two facilitate each other.
- Temporal part:
- To learn more global (two actions may not be distinct in a short term) video features, use multi-path temporal sub-networks to sample optical flow in a long sequences. — Use different fusion methods to combine the temporal information.
- Enlarge the video chunks(块) by using multiple CNNs with shared network parameters

- Spatio part:
- If two videos have similar background, the spatio cannot tell, because the background is strongest feature.
- Use temporal part as a guidance — Inform the spatio network where the motion happens (help to extract the significant locations on feature maps of spatio network)

- Joint optimization for temporal and spatio
- Compact bilinear fusion strategy.

- Details for compact fusion
- Maximal the information from both parts while maximizing the interactions.
- Bilinear fusion — Lead to high dimensional representations. — Use spatiotemporal compact bilinear (STCB) to transfer to low dimension.
- STCB can preserve the temporal cues to supervise the spatiotemporal attention module.

- Spatiotemporal Attention
- Taking advantage of the motion information to locate salient regions on the feature maps.

- Integrate all the techniques mentioned above in the pyramid architecture.
- Use STCB three times
- Bottom of the pyramid, combine multiple optical flow representations from longer videos. — More global temporal features.
- Spatiotemporal attention subnet — Fuse the spatial feature maps with the motion representations.
- Top, fuse all.

- Use STCB three times

- Take home messages
- Other methods mentioned
- C3D –3D convolution filters and 3D pooling layers operating over space and time simultaneously.
- Two stream networks.
- Note: the Related works part is very good!