Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Li, Chengyang; Tong, Ruofeng; Tang, Min

doi:10.1007/s13369-018-3189-z

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Research Article - Computer Engineering and Computer Science
Published: 27 March 2018

Volume 43, pages 7777–7788, (2018)
Cite this article

Arabian Journal for Science and Engineering Aims and scope Submit manuscript

266 Accesses
18 Citations
Explore all metrics

Abstract

Body pose is an important indicator of human actions. The existing pose-based action recognition approaches are typically designed for individual human bodies and require a fixed-size (e.g., \(13\times 2\)) input vector. This requirement is questionable and may degrade the recognition accuracy, particularly for real-world videos, in which scenes with multiple people or partially visible bodies are common. Inspired by the recent success of convolutional neural networks (CNNs) in various computer vision tasks, we propose an approach based on a deep neural network architecture for 2D pose-based action recognition tasks in this work. To this end, a human pose encoding scheme is designed to eliminate the above requirement and to provide a general representation of 2D human body joints, which can be used as the input for CNNs. We also propose a weighted fusion scheme to integrate RGB and optical flow with human pose features to perform action classification. We evaluate our approach on two real-world datasets and achieve better performances compared to state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Human Action Recognition Based on Temporal Pose CNN and Multi-dimensional Fusion

Improved Modular Convolution Neural Network for Human Pose Estimation

Understanding holistic human pose using class-specific convolutional neural network

Article 23 January 2018

References

Cristani, M.; Raghavendra, R.; Del Bue, A.; Murino, V.: Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100, 86–97 (2013)
Article Google Scholar
Rautaray, S.S.; Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)
Article Google Scholar
Papachristou, K.; Nikolaidis, N.; Pitas, I.; Linnemann, A.; Liu, M.; Gerke, S.: Human-centered 2d/3d video content analysis and description. In: International Conference on Electrical and Computer Engineering, pp. 385–388 (2014)
Sadanand, S.; Corso, J.J.: Action bank: a high-level representation of activity in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1241 (2012)
Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Article MathSciNet Google Scholar
Wang, H.; Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Zhu, J.; Wang, B.; Yang, X.; Zhang, W.; Tu, Z.: Action recognition with actons. In: IEEE International Conference on Computer Vision, pp. 3559–3566 (2013)
Huang, S.; Ye, J.; Wang, T.; Jiang, L.; Li, Y.; Wu, X.: Extracting discriminative parts with flexible number from low-rank features for human action recognition. Arab. J. Sci. Eng. 41(8), 2987–3001 (2016)
Article Google Scholar
Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Annual Conference on Neural Information Processing Systems, pp. 568–576 (2014)
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
Wang, X.; Farhadi, A.; Gupta, A.: Actions\(\sim \) transformations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2658–2667 (2016)
Feichtenhofer, C.; Pinz, A.; Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)
Wang, C.; Wang, Y.; Yuille, A.L.: An approach to pose-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)
Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)
Moussa, M.M.; Hemayed, E.E.; El Nemr, H.A.; Fayek, M.B.: Human action recognition utilizing variations in skeleton dimensions. Arab. J. Sci. Eng. pp. 1–14 (2017)
Bulat, A.; Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: European Conference on Computer Vision, pp. 717–732 (2016)
Google Scholar
Newell, A.; Yang, K.; Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499 (2016)
Chapter Google Scholar
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016)
Ramanathan, V.; Huang, J.; Abu-El-Haija, S.; Gorban, A.; Murphy, K.; Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3043–3053 (2016)
Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Annual Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)
Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)
Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)
He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973)
Article Google Scholar
Feng, X.; Perona, P.: Human action recognition by sequence of movelet codewords. In: Proceedings of First International Symposium on 3D Data Processing Visualization and Transmission, pp. 717–721 (2002)
Thurau, C.; Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Schuldt, C.; Laptev, I.; Caputo, B.: Recognizing human actions: a local SVM approach. Int. Conf. Pattern Recognit. 3, 32–36 (2004)
Google Scholar
Blank, M.; Gorelick, L.; Shechtman, E.; Irani, M.; Basri, R.: Actions as space–time shapes. IEEE Int. Conf. Comput. Vis. 2, 1395–1402 (2005)
Google Scholar
Yang, Y.; Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)
Yao, B.; Fei-Fei, L.: Action recognition with exemplar based 2.5 d graph matching. In: European Conference on Computer Vision, pp. 173–186 (2012)
Chapter Google Scholar
Yu, T.H.; Kim, T.K.; Cipolla, R.: Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649 (2013)
Xu, R.; Agarwal, P.; Kumar, S.; Krovi, V.; Corso, J.: Combining skeletal pose with local motion for human activity recognition. In: International Conference on Articulated Motion and Deformable Objects, pp. 114–123 (2012)
Chapter Google Scholar
Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C.: Cross-view action modeling, learning and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)
Garbade, M.; Gall, J.: Handcrafting vs deep learning: an evaluation of ntraj + features for pose based action recognition. In: Workshop on New Challenges in Neural Computation and Machine Learning (\(NC^2\)), pp. 85–92 (2016)
Chéron, G.; Laptev, I.; Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)
Cao, C.; Zhang, Y.; Zhang, C.; Lu, H.: Action recognition with joints-pooled 3d deep convolutional descriptors. In: International Joint Conference on Artificial Intelligence, pp. 3324–3330 (2016)
Du, W.; Wang, Y.; Qiao, Y.: Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3725–3734 (2017)
Carreira, J.; Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)
Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36 (2004)
Google Scholar
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Soomro, K.; Zamir, A.R.; Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
Zhang, W.; Zhu, M.; Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)
Iqbal, U.; Garbade, M.; Gall, J.: Pose for action-action for pose. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 438–445 (2017)
Xiaohan Nie, B.; Xiong, C.; Zhu, S.C.: Joint action recognition and pose estimation from video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)
Yao, A.; Gall, J.; Van Gool, L.: Coupled action recognition and pose estimation from multiple views. Int. J. Comput. Vis. 100(1), 16–37 (2012)
Article Google Scholar

Download references

Acknowledgements

The research is supported in part by NSFC (61572424) and the People Programme (Marie Curie Actions) of the European Unions Seventh Framework Programme FP7 (2007–2013) under REA Grant Agreement No. 612627-"AniNex". Min Tang is supported in part by NSFC (61572423) and Zhejiang Provincial NSFC (LZ16F020003).

Author information

Authors and Affiliations

State Key Lab of CAD&CG, Zhejiang University, Hangzhou, Zhejiang, China
Chengyang Li, Ruofeng Tong & Min Tang

Authors

Chengyang Li
View author publications
You can also search for this author in PubMed Google Scholar
Ruofeng Tong
View author publications
You can also search for this author in PubMed Google Scholar
Min Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruofeng Tong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, C., Tong, R. & Tang, M. Modelling Human Body Pose for Action Recognition Using Deep Neural Networks. Arab J Sci Eng 43, 7777–7788 (2018). https://doi.org/10.1007/s13369-018-3189-z

Download citation

Received: 15 November 2017
Accepted: 19 March 2018
Published: 27 March 2018
Issue Date: December 2018
DOI: https://doi.org/10.1007/s13369-018-3189-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition Based on Temporal Pose CNN and Multi-dimensional Fusion

Improved Modular Convolution Neural Network for Human Pose Estimation

Understanding holistic human pose using class-specific convolutional neural network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

Abstract

Access this article

Similar content being viewed by others

Human Action Recognition Based on Temporal Pose CNN and Multi-dimensional Fusion

Improved Modular Convolution Neural Network for Human Pose Estimation

Understanding holistic human pose using class-specific convolutional neural network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation