字彙層次至句子層次美國手語視訊生成之研究｜國立臺灣科技大學博碩士論文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	徐孟辰 Meng-Chen Xu
論文名稱：	字彙層次至句子層次美國手語視訊生成之研究 Word-level to Sentence-level Realistic Sign Language Video Generation for American Sign Language
指導教授：	楊傳凱 Chuan-Kai Yang
口試委員:	林伯慎 Bor-Shen Lin 孫沛立 Pei-Li Sun
學位類別：	碩士 Master
系所名稱：	管理學院 - 資訊管理系 Department of Information Management
論文出版年：	2021
畢業學年度：	109
語文別：	英文
論文頁數：	57
中文關鍵詞：	美國手語、手語視訊生成、姿勢過渡估計、兩階段影片生成
外文關鍵詞：	American Sign Language, Sign Language Video Generation, Pose Transition Estimation, Two-stage Video Generation
相關次數：	點閱：255 下載：0
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

手語影片生成方法不少，但幾乎都是以 3D 人物建模為主，這些方法耗時費工且真實感及自然很難比擬真人手語影片，因此我們提出一個新穎的方法，利用最近很熱門的生成式對抗網路將手語的字彙層次的影片片段重新生成句子層次的影片。
我們提出的系統透過姿勢過渡估計，用以估計手語片段之間的距離並合成相應的過渡骨架，並且由於我們採用內插法，相較圖學的方法，能更快速且無需額外的資料集。
除此之外我們還提出了基於 Vid2Vid 模型的模型堆疊的方法，堆疊兩個 Vid2Vid 模型，用以兩階段式的生成影片。第一階段從骨架影像生成 IUV 影像（由索引值 I 及 UV 紋理座標組成的三通道影像），第二階段從骨架影像及 IUV 影像生成擬真影片。我們實驗所使用的資料是 American Sign Language Lexicon Video Dataset（ASLLVD）。我們發現當骨架是被我們提出的姿勢過渡估計方法生成時，用我們提出的兩階段式生成方法，其品質會高於僅用骨架直接生成的影片。
最後我們也開發了一個圖形使用者介面，讓使用者可以採用拖拉的方式把影片放到影軌並生成最終的手語視訊。

There are many ways to generate sign language videos, but almost all of them are based on 3D character modeling. These methods are time-consuming and labor-intensive, and they in terms of realness and naturalness are hard to compare with real person sign language videos. Therefore, we propose a novel approach using the recently popular generative adversarial network to synthesize sentence-level videos from word-level videos.
The pose transition estimation in our proposed system is used to estimate the distance between sign language clips and synthesize the corresponding transition skeletons. Because we use an interpolation approach, this is faster than a graphics approach and does not require additional datasets.
Furthermore, we also proposed an stacked based approach for the Vid2Vid model. Two Vid2Vid models are stacked together to generate videos in two stages. The first stage is to generate IUV images (3 channels images composed by index I and UV texture coordinates) which from the skeleton images, and the second stage is to generate realistic video from the skeleton images and the IUV images. We use American Sign Language Lexicon Video Dataset (ASLLVD) in our experiment. We found that when the skeletons are generated by our proposed pose transition estimation method, the quality of our proposed two-stage generation method is better than that of the direct generation with only the skeleton.
Finally, we also develop a graphical user interface that allows users to drag and drop the clips to the video track and generate the final realistic sign language video.

Recommendation Letter i
Approval Letter ii
Abstract in Chinese iii
Abstract in English iv
Acknowledgements v
Contents vi
List of Figures x
List of Tables xv
List of Algorithms xvii
Introduction 1
1 Motivation 1
2 Purpose 2
3 Organization 3
Related Work 5
1 Sign Language Video Generation 5
1.1 Realistic Approach 5
2 Sign Language Datasets 6
3 Motion Transition Synthesis 6
Proposed System 7
1 System Overview 7
2 Data Preprocess 8
3 Pose Transition Estimation 9
3.1 Gap Estimation 10
3.2 Interpolation 11
4 Skeleton Correction 12
4.1 Irregular Hand Pose Detection 12
4.2 Spatio-Temporal Hand Pose Correction 13
4.3 Leg Pose Detection & Correction 17
5 Skeleton images to IUV images Synthesis 18
6 Skeleton images & IUV images to Realistic images Synthesis 20
7 Complete Video Generation 21
8 Web Graphical User Interface 22
Experiments 24
1 System Environment 24
2 Dataset 24
3 Implementation details 26
3.1 Data Preprocessor 26
3.2 Gls2Vid 28
4 Training details 29
5 Evaluations 30
5.1 Human preference score 30
5.2 Error of Gap Estimation 32
5.3 Error of Pose Transition Estimation 34
5.4 Quality assessment of the generated video 34
5.5 Distribution measurement of the length between hand joints 38
6 Results 46
6.1 Comparison to our variants 46
6.2 Comparison to other methods 49
Conclusion 53
1 Limitations 54
2 Future Work 54
References 55
                                

[1] Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2d poseestimation using part affinity fields,”IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2019.
[2] R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense human pose estimation in the wild,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2018.
[3] T. C. Wang, M. Y. Liu, J. Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video syn-thesis,”Advances in Neural Information Processing Systems, vol. 2018-December, 2018.
[4] Yulia, “Transition motion synthesis for video-based text to asl,” Master’s thesis, National TaiwanUniversity of Science and Technology, 2019.
[5] Z. Li and A. Aaron, “Toward a practical perceptual video quality metric.”https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652,6 2016.
[6] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from errorvisibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
[7] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality as-sessment,” inThe Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2,pp. 1398–1402 Vol.2, 2003.
[8] W. H. Organization, “Deafness and hearing loss..”https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2020.
[9] W. Sandler and D. Lillo-Martin,Sign Language and Linguistic Universals. Cambridge UniversityPress, 2006.
[10] W. F. of the Deaf, “Our work.”http://wfdeaf.org/our-work/, 2018.
[11] S. Krapež and F. Solina, “Synthesis of the sign language of the deaf from the sign video clips,”Elek-trotehniski Vestnik/Electrotechnical Review, vol. 66, 1999.
[12] M. Borg and K. P. Camilleri, “Phonologically-meaningful subunits for deep learning-based sign lan-guage recognition,” inECCV 2020 Workshop on Sign Language Recognition, Translation and Pro-duction., 2020.
[13] E. P. D. Silva, P. Dornhofer, P. Costa, K. Mamhy, O. Kumada, J. M. D. Martino, and G. A. Florentino,“Recognition of affective and grammatical facial expressions: a study for brazilian sign language,” inECCV 2020 Workshop on Sign Language Recognition, Translation and Production., 2020.55
[14] M. Parelli, K. Papadimitriou, G. Potamianos, G. Pavlakos, and P. Maragos, “Exploiting 3d hand poseestimation in deep learning-based sign language recognition from rgb videos,” inECCV 2020 Work-shop on Sign Language Recognition, Translation and Production., 2020.
[15] X. Liang, A. Angelopoulou, E. Kapetanios, B. Woll, R. Al-Batat, and T. Woolfe, “A multi-modalmachine learning approach and toolkit to automate recognition of early stages of dementia amongbritish sign language users,” inECCV 2020 Workshop on Sign Language Recognition, Translationand Production., 2020.
[16] C. Gökçe, O. G. Gulcan ̈ozdemir, A. A. K. Glu, and L. Akarun, “Score-level multi cue fusion forsign language recognition,” inECCV 2020 Workshop on Sign Language Recognition, Translationand Production., 2020.
[17] S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, “Text2sign: Towards sign language produc-tion using neural machine translation and generative adversarial networks,”International Journal ofComputer Vision, vol. 128, 2020.
[18] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”CoRR, 2014.
[19] R. Elliott, J. R. Glauert, J. R. Kennaway, and I. Marshall, “The development of language processingsupport for the ViSiCAST project,”Annual ACM Conference on Assistive Technologies, Proceedings,2000.
[20] I. Zwitserlood, M. Verlinden, J. Ros, and S. Schoot, “Synthetic signing for the deaf: Esign.” http://www.visicast.cmp.uea.ac.uk/Papers/Synthetic01 2005.
[21] M. Papadogiorgaki, N. Grammalidis, D. Tzovaras, and M. G. Strintzis, “Text-to-sign language syn-thesis tool,”13th European Signal Processing Conference, EUSIPCO 2005, 2005.
[22] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio, “Generative adversarial nets,”Advances in Neural Information Processing Systems, vol. 3,2014.
[23] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutionalgenerative adversarial networks,”International Conference on Learning Representations, 11 2016.
[24] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,”34th Inter-national Conference on Machine Learning, ICML 2017, vol. 1, 2017.
[25] P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarialnetworks,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR2017, vol. 2017-January, 2017.
[26] A. M. Martínez, R. B. Wilbur, R. Shay, and A. C. Kak, “Purdue rvl-slll asl database for automaticrecognition of american sign language,”Proceedings - 4th IEEE International Conference on Multi-modal Interfaces, ICMI 2002, 2002.56
[27] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan, and A. Thangali, “The american signlanguage lexicon video dataset,”2008 IEEE Computer Society Conference on Computer Vision andPattern Recognition Workshops, CVPR Workshops, 2008.
[28] P. Lu and M. Huenerfauth, “Collecting and evaluating the cuny asl corpus for research on americansign language animation,”Computer Speech and Language, vol. 28, 2014.
[29] C. Chen, B. Zhang, Z. Hou, J. Jiang, M. Liu, and Y. Yang, “Action recognition from depth sequencesusing weighted fusion of 2d and 3d auto-correlation of gradients features,”Multimedia Tools andApplications, vol. 76, 2017.
[30] J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “Rwth-phoenix-weather:A large vocabulary sign language recognition and translation corpus,”Proceedings of the 8th Inter-national Conference on Language Resources and Evaluation, LREC 2012, 2012.
[31] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabularystatisticalrecognitionsystemshandlingmultiplesigners,”ComputerVisionandImageUnderstanding,vol. 141, 2015.
[32] M. Oszust and M. Wysocki, “Polish sign language words recognition with kinect,”2013 6th Interna-tional Conference on Human System Interactions, HSI 2013, 2013.
[33] F. Quiroga, “Sign language recognition datasets.”http://facundoq.github.io/guides/sign_language_datasets/slr, 2020.
[34] J. Min and J. Chai, “Motion graphs++: A compact generative model for semantic motion analysis andsynthesis,”ACM Transactions on Graphics, vol. 31, 2012.
[35] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using mul-tiview bootstrapping,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2017, vol. 2017-January, 2017.
[36] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution ofoptical flow estimation with deep networks,”Proceedings-30thIEEEConferenceonComputerVisionand Pattern Recognition, CVPR 2017, vol. 2017-January, 2017.
[37] NVIDIA, “NVIDIA container toolkit.”https://github.com/NVIDIA/nvidia-docker, 2015.
[38] D. Merkel, “Docker: lightweight linux containers for consistent development and deployment,”Linuxjournal, vol. 2014, no. 239, p. 2, 2014.
[39] S. Tomar, “Converting video formats with ffmpeg,”Linux Journal, vol. 2006, no. 146, p. 10, 2006.
[40] M. Tavakoli, R. Batista, and L. Sgrigna, “The UC Softhand: Light weight adaptive bionic hand witha compact twisted string actuation system,”Actuators, vol. 5, p. 1, 12 2015.

全文公開日期 2026/01/26 (校內網路)
全文公開日期 2026/01/26 (校外網路)
全文公開日期 2026/01/26 (國家圖書館：臺灣博碩士論文系統)

簡易檢索 / 詳目顯示

相關論文