[en] In this paper, a novel approach for action detection from RGB sequences is proposed. This concept takes advantage of the recent development of CNNs to estimate 3D human poses from a monocular camera. To show the validity of our method, we propose a 3D skeleton-based two-stage action detection approach. For localizing actions in unsegmented sequences, Relative Joint Position (RJP) and Histogram Of Displacements (HOD) are used as inputs to a k-nearest neighbor binary classifier in order to define action segments. Afterwards, to recognize the localized action proposals, a compact Long Short-Term Memory (LSTM) network with a de-noising expansion unit is employed. Compared to previous RGB-based methods, our approach offers robustness to radial motion, view-invariance and low computational complexity. Results on the Online Action Detection dataset show that our method outperforms earlier RGB-based approaches.
Research center :
Interdisciplinary Centre for Security, Reliability and Trust (SnT) > SIGCOM
Disciplines :
Computer science
Author, co-author :
Papadopoulos, Konstantinos ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Ghorbel, Enjie ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Baptista, Renato ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Aouada, Djamila ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
Ottersten, Björn ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT)
External co-authors :
no
Language :
English
Title :
Two-stage RGB-based Action Detection using Augmented 3D Poses
Publication date :
2019
Event name :
18th International Conference on Computer Analysis of Images and Patterns
Event date :
from 03-09-2019 to 05-09-2019
Main work title :
18th International Conference on Computer Analysis of Images and Patterns SALERNO, 3-5 SEPTEMBER, 2019
Peer reviewed :
Peer reviewed
FnR Project :
FNR10415355 - 3d Action Recognition Using Refinement And Invariance Strategies For Reliable Surveillance, 2015 (01/06/2016-31/05/2019) - Bjorn Ottersten
Baptista, R., Antunes, M., Aouada, D., Ottersten, B.: Anticipating suspicious actions using a small dataset of action templates. In: 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP) (2018)
Baptista, R., Antunes, M., Shabayek, A.E.R., Aouada, D., Ottersten, B.: Flexible feedback system for posture monitoring and correction. In: 2017 Fourth International Conference on Image Information Processing (ICIIP), pp. 1–6. IEEE (2017)
Baptista, R., Goncalves Almeida Antunes, M., Aouada, D., Ottersten, B.: Video-based feedback for assisting physical activity. In: 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP) (2017)
Boulahia, S.Y., Anquetil, E., Multon, F., Kulpa, R.: CuDi3D: curvilinear displacement based approach for online 3D action detection. Comput. Vis. Image Underst. 174, 57–69 (2018)
Demisse, G.G., Papadopoulos, K., Aouada, D., Ottersten, B.: Pose encoding for robust skeleton-based action recognition. In: CVPRW: Visual Understanding of Humans in Crowd Scene, Salt Lake City, Utah, 18–22 June 2018 (2018)
Gaidon, A., Harchaoui, Z., Schmid, C.: Actom sequence models for efficient action detection. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3201–3208. IEEE (2011)
Garcia-Hernando, G., Kim, T.K.: Transition forests: learning discriminative temporal transitions for action recognition and detection. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 432–440 (2017)
Ghorbel, E., Boonaert, J., Boutteau, R., Lecoeuche, S., Savatier, X.: An extension of kernel learning methods using a modified log-euclidean distance for fast and accurate skeleton-based human action recognition. Comput. Vis. Image Underst. 175, 32–43 (2018)
Gowayyed, M.A., Torki, M., Hussein, M.E., El-Saban, M.: Histogram of oriented displacements (HOD): describing trajectories of human joints for action recognition. In: IJCAI, vol. 13, pp. 1351–1357 (2013)
Hoai, M., De la Torre, F.: Max-margin early event detectors. Int. J. Comput. Vis. 107(2), 191–202 (2014)
Kawakami, K.: Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University of Munich (2008)
Li, Y., Lan, C., Xing, J., Zeng, W., Yuan, C., Liu, J.: Online human action detection using joint classification-regression recurrent neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 203–220. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7 13
Liu, C., Li, Y., Hu, Y., Liu, J.: Online action detection and forecast via multitask deep recurrent neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1702–1706. IEEE (2017)
Liu, J., Wang, G., Hu, P., Duan, L.Y., Kot, A.C.: Global context-aware attention LSTM networks for 3D action recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 7, p. 43 (2017)
Mehta, D., Rhodin, H., Casas, D., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3D human pose estimation using transfer learning and improved CNN supervision. arxiv preprint. arXiv preprint arXiv:1611.09813, vol. 1, no. 3, p. 5 (2016)
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera, vol. 36 (2017). http://gvv.mpi-inf.mpg.de/projects/VNect/
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8 29
Papadopoulos, K., Antunes, M., Aouada, D., Ottersten, B.: A revisit of action detection using improved trajectories. In: IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada (2018)
Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1263–1272. IEEE (2017)
Shou, Z., Wang, D., Chang, S.F.: Temporal action localization in untrimmed videos via multi-stage CNNs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1049–1058 (2016)
Shu, Z., Yun, K., Samaras, D.: Action detection with improved dense trajectories and sliding window. In: Agapito, L., Bronstein, M.M., Rother, C. (eds.) ECCV 2014. LNCS, vol. 8925, pp. 541–551. Springer, Cham (2015). https://doi.org/10. 1007/978-3-319-16178-5 38
Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: Spatio-temporal attention-based lstm networks for 3D action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018)
Song, Y., Demirdjian, D., Davis, R.: Continuous body and hand gesture recognition for natural human-computer interaction. ACM Trans. Interact. Intell. Syst. (TiiS) 2(1), 5 (2012)
Sun, C., Shetty, S., Sukthankar, R., Nevatia, R.: Temporal localization of fine-grained actions in videos by domain transfer from web images. In: Proceedings of the 23rd ACM International Conference on Multimedia, pp. 371–380. ACM (2015)
Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3D skeletons as points in a lie group. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of the IEEE International Conference on Computer Vision (2013)
Yang, W., Ouyang, W., Wang, X., Ren, J., Li, H., Wang, X.: 3D human pose estimation in the wild by adversarial learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1 (2018)