簡易檢索 / 詳目顯示

研究生: 葉旭 (村永旭)
XU YE (AKIRA MURANAGA)
論文名稱: 雙向Transformers於骨架動作預測之應用
On Human Motion Prediction Using Bidirectional Encoder Representations from Transformers
指導教授: 方文賢
Wen-Hsien Fang
口試委員: 陳郁堂
Yie-Tarng Chen
賴坤財
Kuen-Tsair Lay
丘建青
Chien-ching Chiu
鍾聖倫
Sheng-Luen Chung
學位類別: 碩士
Master
系所名稱: 電資學院 - 電子工程系
Department of Electronic and Computer Engineering
論文出版年: 2019
畢業學年度: 107
語文別: 英文
論文頁數: 74
中文關鍵詞: 注意力機制骨架動作預測
外文關鍵詞: transformer, human motion prediction
相關次數: 點閱:172下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Pose prediction found applications in a variety of areas.
    However, current methods adopting recurrent neural networks suffer from error accumulation in the training stage. Furthermore, encoder-decoder architecture in general fails to predict continuous poses between the end of the encoder input and the beginning of the decoder output.
    Benefiting from the recent successes of the attention mechanism, in the thesis, we propose a novel method which combined the transformer encoder architecture and universal transformer.
    The new architecture is free of error accumulation because this architecture processes data parallelly and the weight of updating for each position is equal. Moreover, the proposed attention map helps attention mechanism to refrain the predicted poses from discontinuity.
    We also apply adaptive computation time algorithm to optimize the iteration numbers of performing an attention mechanism.
    The mean absolute loss is considered to handle human motion prediction problem in the training process on the Human3.6M dataset.
    Simulations show that the proposed method outperforms the main state-of-the-art approaches.


    Pose prediction found applications in a variety of areas.
    However, current methods adopting recurrent neural networks suffer from error accumulation in the training stage. Furthermore, encoder-decoder architecture in general fails to predict continuous poses between the end of the encoder input and the beginning of the decoder output.
    Benefiting from the recent successes of the attention mechanism, in the thesis, we propose a novel method which combined the transformer encoder architecture and universal transformer.
    The new architecture is free of error accumulation because this architecture processes data parallelly and the weight of updating for each position is equal. Moreover, the proposed attention map helps attention mechanism to refrain the predicted poses from discontinuity.
    We also apply adaptive computation time algorithm to optimize the iteration numbers of performing an attention mechanism.
    The mean absolute loss is considered to handle human motion prediction problem in the training process on the Human3.6M dataset.
    Simulations show that the proposed method outperforms the main state-of-the-art approaches.

    Table of contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Acknowledgment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Human Motion Prediction . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Modeling of Human Motion Prediction . . . . . . . . . . . . . . . 5 2.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Generative Adversarial Nets . . . . . . . . . . . . . . . . . . . . . 6 2.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1 Overall Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Data Pre-processing and Position Encoding . . . . . . . . . . . . 9 3.3 Transformer Encoder Stack . . . . . . . . . . . . . . . . . . . . . 11 iii 3.3.1 Scaled Dot-Product Attention . . . . . . . . . . . . . . . . 12 3.3.2 Multi-Head Attention . . . . . . . . . . . . . . . . . . . . . 16 3.3.3 Position-wise Feed-Forward Networks . . . . . . . . . . . . 17 3.4 Universal Transformers . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Experimental Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1 Evaluation Protocol and Experimental Setup . . . . . . . . . . . . 22 4.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Transformer Con guration . . . . . . . . . . . . . . . . . . 24 4.2.3 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Comparison With State-of-the-Art Methods . . . . . . . . . . . . 27 5 Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . 28 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Appendix A : Class-wise ablation studies . . . . . . . . . . . . . . . . . . . 29 Appendix B : Performance comparison of state-of-the-art method . . . . . 44 Appendix C : Visualization of attention distributions . . . . . . . . . . . . 52 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    References
    [1] M. Brand and A. Hertzmann, \Style machines," in Proceedings of the 27th
    annual conference on Computer graphics and interactive techniques, pp. 183{
    192, ACM Press/Addison-Wesley Publishing Co., 2000.
    [2] V. Pavlovic, J. M. Rehg, and J. MacCormick, \Learning switching linear
    models of human motion," in Advances in neural information processing
    systems, pp. 981{987, 2001.
    [3] G. W. Taylor, G. E. Hinton, and S. T. Roweis, \Modeling human motion
    using binary latent variables," in Advances in neural information processing
    systems, pp. 1345{1352, 2007.
    [4] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, \Recurrent network models
    for human dynamics," in Proceedings of the IEEE International Confer-
    ence on Computer Vision, pp. 4346{4354, 2015.
    [5] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, \Structural-rnn: Deep
    learning on spatio-temporal graphs," in Proceedings of the IEEE Conference
    on Computer Vision and Pattern Recognition, pp. 5308{5317, 2016.
    [6] P. Ghosh, J. Song, E. Aksan, and O. Hilliges, \Learning human motion
    models for long-term predictions," in 2017 International Conference on 3D
    Vision (3DV), pp. 458{466, IEEE, 2017.
    [7] J. Martinez, M. J. Black, and J. Romero, \On human motion prediction
    using recurrent neural networks," in Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition, pp. 2891{2900, 2017.
    [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
    L. Kaiser, and I. Polosukhin, \Attention is all you need," in Advances in
    Neural Information Processing Systems, pp. 5998{6008, 2017.
    60
    [9] A. Gopalakrishnan, A. Mali, D. Kifer, L. Giles, and A. G. Ororbia, \A neural
    temporal model for human motion prediction," in Proceedings of the IEEE
    Conference on Computer Vision and Pattern Recognition, pp. 12116{12125,
    2019.
    [10] J. N. Kundu, M. Gor, and R. V. Babu, \Bihmp-gan: Bidirectional 3d human
    motion prediction gan," arXiv preprint arXiv:1812.02591, 2018.
    [11] J. Butepage, H. Kjellstrom, and D. Kragic, \Classify, predict, detect, anticipate
    and synthesize: Hierarchical recurrent latent variable models for human
    activity modeling," CoRR, vol. abs/1809.08875, 2018.
    [12] M. Wolter and A. Yao, \Gated complex recurrent neural networks," CoRR,
    vol. abs/1806.08267, 2018.
    [13] Y. T. Xu, Y. Li, and D. Meger, \Human motion prediction via pattern
    completion in latent representation space," arXiv preprint arXiv:1904.09039,
    2019.
    [14] H. Wang and J. Feng, \Vred: A position-velocity recurrent encoder-decoder
    for human motion prediction," arXiv preprint arXiv:1906.06514, 2019.
    [15] H.-k. Chiu, E. Adeli, B. Wang, D.-A. Huang, and J. C. Niebles, \Actionagnostic
    human pose forecasting," in 2019 IEEE Winter Conference on Ap-
    plications of Computer Vision (WACV), pp. 1423{1432, IEEE, 2019.
    [16] Z. Liu, S. Wu, S. Jin, Q. Liu, S. Lu, R. Zimmermann, and L. Cheng, \Towards
    natural and accurate future motion prediction of humans and animals,"
    in Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition, pp. 10004{10012, 2019.
    [17] Y. Li, Z. Wang, X. Yang, M. Wang, S. I. Poiana, E. Chaudhry, and J. Zhang,
    \Ecient convolutional hierarchical autoencoder for human motion prediction,"
    The Visual Computer, vol. 35, no. 6-8, pp. 1143{1156, 2019.
    61
    [18] L.-Y. Gui, Y.-X. Wang, X. Liang, and J. M. Moura, \Adversarial geometryaware
    human motion prediction," in Proceedings of the European Conference
    on Computer Vision (ECCV), pp. 786{803, 2018.
    [19] D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier, \Modeling human
    motion with quaternion-based neural networks," arXiv preprint
    arXiv:1901.07677, 2019.
    [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
    S. Ozair, A. Courville, and Y. Bengio, \Generative adversarial nets," in
    Advances in neural information processing systems, pp. 2672{2680, 2014.
    [21] E. Barsoum, J. Kender, and Z. Liu, \Hp-gan: Probabilistic 3d human motion
    prediction via gan," arXiv preprint arXiv:1711.09561, 2017.
    [22] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, \Bert: Pre-training of
    deep bidirectional transformers for language understanding," arXiv preprint
    arXiv:1810.04805, 2018.
    [23] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser, \Universal
    transformers," arXiv preprint arXiv:1807.03819, 2018.
    [24] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
    Corrado, A. Davis, J. Dean, M. Devin, et al., \Tensor
    ow: Large-scale
    machine learning on heterogeneous distributed systems," arXiv preprint
    arXiv:1603.04467, 2016.
    [25] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. N. Gomez, S. Gouws,
    L. Jones, L. Kaiser, N. Kalchbrenner, N. Parmar, R. Sepassi, N. Shazeer,
    and J. Uszkoreit, \Tensor2tensor for neural machine translation," CoRR,
    vol. abs/1803.07416, 2018.
    [26] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, \Human3. 6m: Large
    scale datasets and predictive methods for 3d human sensing in natural envi-
    62
    ronments," IEEE transactions on pattern analysis and machine intelligence,
    vol. 36, no. 7, pp. 1325{1339, 2014.
    [27] J. Duchi, E. Hazan, and Y. Singer, \Adaptive subgradient methods for online
    learning and stochastic optimization," Journal of Machine Learning Re-
    search, vol. 12, no. Jul, pp. 2121{2159, 2011.
    [28] T. Tieleman and G. Hinton, \Lecture 6.5-rmsprop, coursera: Neural networks
    for machine learning," University of Toronto, Technical Report, 2012.

    無法下載圖示 全文公開日期 2024/08/20 (校內網路)
    全文公開日期 本全文未授權公開 (校外網路)
    全文公開日期 本全文未授權公開 (國家圖書館:臺灣博碩士論文系統)
    QR CODE