Word-level to Sentence-level Realistic Sign Language Video Generation for American Sign Language
管理學院 - 資訊管理系
Department of Information Management
|Thesis Publication Year:||2021|
|Graduation Academic Year:||109|
|Keywords (in Chinese):||美國手語 、手語視訊生成 、姿勢過渡估計 、兩階段影片生成|
|Keywords (in other languages):||American Sign Language, Sign Language Video Generation, Pose Transition Estimation, Two-stage Video Generation|
|Reference times:||Clicks: 210 Downloads: 0|
|School Collection Retrieve National Library Collection Retrieve Error Report|
手語影片生成方法不少，但幾乎都是以 3D 人物建模為主，這些方法耗時費工且真實感及自然很難比擬真人手語影片，因此我們提出一個新穎的方法，利用最近很熱門的生成式對抗網路將手語的字彙層次的影片片段重新生成句子層次的影片。
除此之外我們還提出了基於 Vid2Vid 模型的模型堆疊的方法，堆疊兩個 Vid2Vid 模型，用以兩階段式的生成影片。第一階段從骨架影像生成 IUV 影像（由索引值 I 及 UV 紋理座標組成的三通道影像），第二階段從骨架影像及 IUV 影像生成擬真影片。我們實驗所使用的資料是 American Sign Language Lexicon Video Dataset（ASLLVD）。我們發現當骨架是被我們提出的姿勢過渡估計方法生成時，用我們提出的兩階段式生成方法，其品質會高於僅用骨架直接生成的影片。
There are many ways to generate sign language videos, but almost all of them are based on 3D character modeling. These methods are time-consuming and labor-intensive, and they in terms of realness and naturalness are hard to compare with real person sign language videos. Therefore, we propose a novel approach using the recently popular generative adversarial network to synthesize sentence-level videos from word-level videos.
The pose transition estimation in our proposed system is used to estimate the distance between sign language clips and synthesize the corresponding transition skeletons. Because we use an interpolation approach, this is faster than a graphics approach and does not require additional datasets.
Furthermore, we also proposed an stacked based approach for the Vid2Vid model. Two Vid2Vid models are stacked together to generate videos in two stages. The first stage is to generate IUV images (3 channels images composed by index I and UV texture coordinates) which from the skeleton images, and the second stage is to generate realistic video from the skeleton images and the IUV images. We use American Sign Language Lexicon Video Dataset (ASLLVD) in our experiment. We found that when the skeletons are generated by our proposed pose transition estimation method, the quality of our proposed two-stage generation method is better than that of the direct generation with only the skeleton.
Finally, we also develop a graphical user interface that allows users to drag and drop the clips to the video track and generate the final realistic sign language video.
 Z. Cao, G. Hidalgo, T. Simon, S. E. Wei, and Y. Sheikh, “OpenPose: Realtime multi-person 2d poseestimation using part affinity fields,”IEEE Transactions on Pattern Analysis and Machine Intelli-gence, 2019.
 R. A. Güler, N. Neverova, and I. Kokkinos, “DensePose: Dense human pose estimation in the wild,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,2018.
 T. C. Wang, M. Y. Liu, J. Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video syn-thesis,”Advances in Neural Information Processing Systems, vol. 2018-December, 2018.
 Yulia, “Transition motion synthesis for video-based text to asl,” Master’s thesis, National TaiwanUniversity of Science and Technology, 2019.
 Z. Li and A. Aaron, “Toward a practical perceptual video quality metric.”https://netflixtechblog.com/toward-a-practical-perceptual-video-quality-metric-653f208b9652,6 2016.
 Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from errorvisibility to structural similarity,”IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality as-sessment,” inThe Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, vol. 2,pp. 1398–1402 Vol.2, 2003.
 W. H. Organization, “Deafness and hearing loss..”https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss, 2020.
 W. Sandler and D. Lillo-Martin,Sign Language and Linguistic Universals. Cambridge UniversityPress, 2006.
 W. F. of the Deaf, “Our work.”http://wfdeaf.org/our-work/, 2018.
 S. Krapež and F. Solina, “Synthesis of the sign language of the deaf from the sign video clips,”Elek-trotehniski Vestnik/Electrotechnical Review, vol. 66, 1999.
 M. Borg and K. P. Camilleri, “Phonologically-meaningful subunits for deep learning-based sign lan-guage recognition,” inECCV 2020 Workshop on Sign Language Recognition, Translation and Pro-duction., 2020.
 E. P. D. Silva, P. Dornhofer, P. Costa, K. Mamhy, O. Kumada, J. M. D. Martino, and G. A. Florentino,“Recognition of affective and grammatical facial expressions: a study for brazilian sign language,” inECCV 2020 Workshop on Sign Language Recognition, Translation and Production., 2020.55
 M. Parelli, K. Papadimitriou, G. Potamianos, G. Pavlakos, and P. Maragos, “Exploiting 3d hand poseestimation in deep learning-based sign language recognition from rgb videos,” inECCV 2020 Work-shop on Sign Language Recognition, Translation and Production., 2020.
 X. Liang, A. Angelopoulou, E. Kapetanios, B. Woll, R. Al-Batat, and T. Woolfe, “A multi-modalmachine learning approach and toolkit to automate recognition of early stages of dementia amongbritish sign language users,” inECCV 2020 Workshop on Sign Language Recognition, Translationand Production., 2020.
 C. Gökçe, O. G. Gulcan ̈ozdemir, A. A. K. Glu, and L. Akarun, “Score-level multi cue fusion forsign language recognition,” inECCV 2020 Workshop on Sign Language Recognition, Translationand Production., 2020.
 S. Stoll, N. C. Camgoz, S. Hadfield, and R. Bowden, “Text2sign: Towards sign language produc-tion using neural machine translation and generative adversarial networks,”International Journal ofComputer Vision, vol. 128, 2020.
 M. Mirza and S. Osindero, “Conditional generative adversarial nets,”CoRR, 2014.
 R. Elliott, J. R. Glauert, J. R. Kennaway, and I. Marshall, “The development of language processingsupport for the ViSiCAST project,”Annual ACM Conference on Assistive Technologies, Proceedings,2000.
 I. Zwitserlood, M. Verlinden, J. Ros, and S. Schoot, “Synthetic signing for the deaf: Esign.” http://www.visicast.cmp.uea.ac.uk/Papers/Synthetic01 2005.
 M. Papadogiorgaki, N. Grammalidis, D. Tzovaras, and M. G. Strintzis, “Text-to-sign language syn-thesis tool,”13th European Signal Processing Conference, EUSIPCO 2005, 2005.
 I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, andY. Bengio, “Generative adversarial nets,”Advances in Neural Information Processing Systems, vol. 3,2014.
 A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutionalgenerative adversarial networks,”International Conference on Learning Representations, 11 2016.
 M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,”34th Inter-national Conference on Machine Learning, ICML 2017, vol. 1, 2017.
 P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarialnetworks,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR2017, vol. 2017-January, 2017.
 A. M. Martínez, R. B. Wilbur, R. Shay, and A. C. Kak, “Purdue rvl-slll asl database for automaticrecognition of american sign language,”Proceedings - 4th IEEE International Conference on Multi-modal Interfaces, ICMI 2002, 2002.56
 V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan, and A. Thangali, “The american signlanguage lexicon video dataset,”2008 IEEE Computer Society Conference on Computer Vision andPattern Recognition Workshops, CVPR Workshops, 2008.
 P. Lu and M. Huenerfauth, “Collecting and evaluating the cuny asl corpus for research on americansign language animation,”Computer Speech and Language, vol. 28, 2014.
 C. Chen, B. Zhang, Z. Hou, J. Jiang, M. Liu, and Y. Yang, “Action recognition from depth sequencesusing weighted fusion of 2d and 3d auto-correlation of gradients features,”Multimedia Tools andApplications, vol. 76, 2017.
 J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney, “Rwth-phoenix-weather:A large vocabulary sign language recognition and translation corpus,”Proceedings of the 8th Inter-national Conference on Language Resources and Evaluation, LREC 2012, 2012.
 O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabularystatisticalrecognitionsystemshandlingmultiplesigners,”ComputerVisionandImageUnderstanding,vol. 141, 2015.
 M. Oszust and M. Wysocki, “Polish sign language words recognition with kinect,”2013 6th Interna-tional Conference on Human System Interactions, HSI 2013, 2013.
 F. Quiroga, “Sign language recognition datasets.”http://facundoq.github.io/guides/sign_language_datasets/slr, 2020.
 J. Min and J. Chai, “Motion graphs++: A compact generative model for semantic motion analysis andsynthesis,”ACM Transactions on Graphics, vol. 31, 2012.
 T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypoint detection in single images using mul-tiview bootstrapping,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recog-nition, CVPR 2017, vol. 2017-January, 2017.
 E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution ofoptical flow estimation with deep networks,”Proceedings-30thIEEEConferenceonComputerVisionand Pattern Recognition, CVPR 2017, vol. 2017-January, 2017.
 NVIDIA, “NVIDIA container toolkit.”https://github.com/NVIDIA/nvidia-docker, 2015.
 D. Merkel, “Docker: lightweight linux containers for consistent development and deployment,”Linuxjournal, vol. 2014, no. 239, p. 2, 2014.
 S. Tomar, “Converting video formats with ffmpeg,”Linux Journal, vol. 2006, no. 146, p. 10, 2006.
 M. Tavakoli, R. Batista, and L. Sgrigna, “The UC Softhand: Light weight adaptive bionic hand witha compact twisted string actuation system,”Actuators, vol. 5, p. 1, 12 2015.