Skip to main content
Log in

Modelling Human Body Pose for Action Recognition Using Deep Neural Networks

  • Research Article - Computer Engineering and Computer Science
  • Published:
Arabian Journal for Science and Engineering Aims and scope Submit manuscript

Abstract

Body pose is an important indicator of human actions. The existing pose-based action recognition approaches are typically designed for individual human bodies and require a fixed-size (e.g., \(13\times 2\)) input vector. This requirement is questionable and may degrade the recognition accuracy, particularly for real-world videos, in which scenes with multiple people or partially visible bodies are common. Inspired by the recent success of convolutional neural networks (CNNs) in various computer vision tasks, we propose an approach based on a deep neural network architecture for 2D pose-based action recognition tasks in this work. To this end, a human pose encoding scheme is designed to eliminate the above requirement and to provide a general representation of 2D human body joints, which can be used as the input for CNNs. We also propose a weighted fusion scheme to integrate RGB and optical flow with human pose features to perform action classification. We evaluate our approach on two real-world datasets and achieve better performances compared to state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cristani, M.; Raghavendra, R.; Del Bue, A.; Murino, V.: Human behavior analysis in video surveillance: a social signal processing perspective. Neurocomputing 100, 86–97 (2013)

    Article  Google Scholar 

  2. Rautaray, S.S.; Agrawal, A.: Vision based hand gesture recognition for human computer interaction: a survey. Artif. Intell. Rev. 43(1), 1–54 (2015)

    Article  Google Scholar 

  3. Papachristou, K.; Nikolaidis, N.; Pitas, I.; Linnemann, A.; Liu, M.; Gerke, S.: Human-centered 2d/3d video content analysis and description. In: International Conference on Electrical and Computer Engineering, pp. 385–388 (2014)

  4. Sadanand, S.; Corso, J.J.: Action bank: a high-level representation of activity in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1234–1241 (2012)

  5. Wang, H.; Kläser, A.; Schmid, C.; Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)

    Article  MathSciNet  Google Scholar 

  6. Wang, H.; Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

  7. Zhu, J.; Wang, B.; Yang, X.; Zhang, W.; Tu, Z.: Action recognition with actons. In: IEEE International Conference on Computer Vision, pp. 3559–3566 (2013)

  8. Huang, S.; Ye, J.; Wang, T.; Jiang, L.; Li, Y.; Wu, X.: Extracting discriminative parts with flexible number from low-rank features for human action recognition. Arab. J. Sci. Eng. 41(8), 2987–3001 (2016)

    Article  Google Scholar 

  9. Simonyan, K.; Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Annual Conference on Neural Information Processing Systems, pp. 568–576 (2014)

  10. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)

  11. Wang, X.; Farhadi, A.; Gupta, A.: Actions\(\sim \) transformations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2658–2667 (2016)

  12. Feichtenhofer, C.; Pinz, A.; Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016)

  13. Wang, C.; Wang, Y.; Yuille, A.L.: An approach to pose-based action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)

  14. Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J.: Towards understanding action recognition. In: IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

  15. Moussa, M.M.; Hemayed, E.E.; El Nemr, H.A.; Fayek, M.B.: Human action recognition utilizing variations in skeleton dimensions. Arab. J. Sci. Eng. pp. 1–14 (2017)

  16. Bulat, A.; Tzimiropoulos, G.: Human pose estimation via convolutional part heatmap regression. In: European Conference on Computer Vision, pp. 717–732 (2016)

    Google Scholar 

  17. Newell, A.; Yang, K.; Deng, J.: Stacked hourglass networks for human pose estimation. In: European Conference on Computer Vision, pp. 483–499 (2016)

    Chapter  Google Scholar 

  18. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1611.08050 (2016)

  19. Ramanathan, V.; Huang, J.; Abu-El-Haija, S.; Gorban, A.; Murphy, K.; Fei-Fei, L.: Detecting events and key actors in multi-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3043–3053 (2016)

  20. Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Annual Conference on Neural Information Processing Systems, pp. 1097–1105 (2012)

  21. Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint arXiv:1405.3531 (2014)

  22. Huang, G.; Liu, Z.; Weinberger, K.Q.; van der Maaten, L.: Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016)

  23. He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  24. Johansson, G.: Visual perception of biological motion and a model for its analysis. Percept. Psychophys. 14(2), 201–211 (1973)

    Article  Google Scholar 

  25. Feng, X.; Perona, P.: Human action recognition by sequence of movelet codewords. In: Proceedings of First International Symposium on 3D Data Processing Visualization and Transmission, pp. 717–721 (2002)

  26. Thurau, C.; Hlavác, V.: Pose primitive based human action recognition in videos or still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)

  27. Schuldt, C.; Laptev, I.; Caputo, B.: Recognizing human actions: a local SVM approach. Int. Conf. Pattern Recognit. 3, 32–36 (2004)

    Google Scholar 

  28. Blank, M.; Gorelick, L.; Shechtman, E.; Irani, M.; Basri, R.: Actions as space–time shapes. IEEE Int. Conf. Comput. Vis. 2, 1395–1402 (2005)

    Google Scholar 

  29. Yang, Y.; Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1385–1392 (2011)

  30. Yao, B.; Fei-Fei, L.: Action recognition with exemplar based 2.5 d graph matching. In: European Conference on Computer Vision, pp. 173–186 (2012)

    Chapter  Google Scholar 

  31. Yu, T.H.; Kim, T.K.; Cipolla, R.: Unconstrained monocular 3d human pose estimation by action detection and cross-modality regression forest. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649 (2013)

  32. Xu, R.; Agarwal, P.; Kumar, S.; Krovi, V.; Corso, J.: Combining skeletal pose with local motion for human activity recognition. In: International Conference on Articulated Motion and Deformable Objects, pp. 114–123 (2012)

    Chapter  Google Scholar 

  33. Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C.: Cross-view action modeling, learning and recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2649–2656 (2014)

  34. Garbade, M.; Gall, J.: Handcrafting vs deep learning: an evaluation of ntraj + features for pose based action recognition. In: Workshop on New Challenges in Neural Computation and Machine Learning (\(NC^2\)), pp. 85–92 (2016)

  35. Chéron, G.; Laptev, I.; Schmid, C.: P-cnn: Pose-based cnn features for action recognition. In: IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)

  36. Cao, C.; Zhang, Y.; Zhang, C.; Lu, H.: Action recognition with joints-pooled 3d deep convolutional descriptors. In: International Joint Conference on Artificial Intelligence, pp. 3324–3330 (2016)

  37. Du, W.; Wang, Y.; Qiao, Y.: Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3725–3734 (2017)

  38. Carreira, J.; Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. arXiv preprint arXiv:1705.07750 (2017)

  39. Brox, T.; Bruhn, A.; Papenberg, N.; Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision, pp. 25–36 (2004)

    Google Scholar 

  40. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

  41. Soomro, K.; Zamir, A.R.; Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  42. Zhang, W.; Zhu, M.; Derpanis, K.G.: From actemes to action: A strongly-supervised representation for detailed action understanding. In: IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)

  43. Iqbal, U.; Garbade, M.; Gall, J.: Pose for action-action for pose. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 438–445 (2017)

  44. Xiaohan Nie, B.; Xiong, C.; Zhu, S.C.: Joint action recognition and pose estimation from video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1293–1301 (2015)

  45. Yao, A.; Gall, J.; Van Gool, L.: Coupled action recognition and pose estimation from multiple views. Int. J. Comput. Vis. 100(1), 16–37 (2012)

    Article  Google Scholar 

Download references

Acknowledgements

The research is supported in part by NSFC (61572424) and the People Programme (Marie Curie Actions) of the European Unions Seventh Framework Programme FP7 (2007–2013) under REA Grant Agreement No. 612627-"AniNex". Min Tang is supported in part by NSFC (61572423) and Zhejiang Provincial NSFC (LZ16F020003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruofeng Tong.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, C., Tong, R. & Tang, M. Modelling Human Body Pose for Action Recognition Using Deep Neural Networks. Arab J Sci Eng 43, 7777–7788 (2018). https://doi.org/10.1007/s13369-018-3189-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13369-018-3189-z

Keywords

Navigation