Basic Search / Detailed Display

Author: Yohanes Satria Nugroho
Yohanes Satria Nugroho
Thesis Title: Lightweight American Sign Language Recognition Using a Deep Learning Approach
Lightweight American Sign Language Recognition Using a Deep Learning Approach
Advisor: 楊傳凱
Chuan-Kai Yang
Committee: 賴源正
Yuan-Cheng Lai
林伯慎
Bor-Shen Lin
Degree: 碩士
Master
Department: 管理學院 - 資訊管理系
Department of Information Management
Thesis Publication Year: 2023
Graduation Academic Year: 111
Language: 英文
Pages: 54
Keywords (in Chinese): Sign Language RecognitionLightweight ModelKeypoints Estimation
Keywords (in other languages): Sign Language Recognition, Lightweight Model, Keypoints Estimation
Reference times: Clicks: 432Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.

    In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.


    Sign Language Recognition is a variant of the Action Recognition that consists of more detailed features, such as hand shapes and movements. Researchers have been trying to apply computer-based methods to tackle this task throughout the years. However, the methods proposed are constrained by hardware limitations thus limiting them from being applied in a real-life situations.

    In this research, we explore the possibilities of creating a lightweight Sign Language Recognition model so that it can be applied in real-life situations. We explore two different approaches. First we extract keypoints and use a simple LSTM model to do the recognition and get 75\% of Top-1 Validation Accuracy. For the second one, we used the lightweight MoViNet A0 model and achieved 71\% of Top-1 Test accuracy. Although these models achieved a little bit worse result compared to the state-of-the-art I3D, the complexity in terms of FLOPs are far more better.

    Master’s Thesis Recommendation Form . . . . . . . . . . . . . . . i Qualification Form by Master’s Degree Examination Committee . . ii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . iv Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Research Outline . . . . . . . . . . . . . . . . . . . . . . 4 2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Action Recognition . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sign Language Recognition . . . . . . . . . . . . . . . . . 7 2.3 Human Pose Estimation . . . . . . . . . . . . . . . . . . . 8 2.4 MoViNets . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.1.1 Keypoints + LSTM . . . . . . . . . . . . . . . . . 12 3.1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.3 Dataset Processing . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 Keypoints Input . . . . . . . . . . . . . . . . . . . 20 3.3.2 RGB Input . . . . . . . . . . . . . . . . . . . . . 23 4 Experiments & Results . . . . . . . . . . . . . . . . . . . . . . 24 4.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1 LSTM . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1.2 MoViNet . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . 30 4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 36 5 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . 39 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . 40 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 vi

    [1] O. M. Sincan and H. Y. Keles, “AUTSL: A large scale multi-modal
    turkish sign language dataset and baseline methods,” CoRR, vol. abs/
    2008.00932, 2020.

    [2] CCDHHDB, “Deaf, hard of hearing, and deafblind demographics
    guide.” https://ccdhhdb.com/wp-content/uploads/2022/09/
    DHHDB-Demographics.pdf.

    [3] Z. Zafrulla, H. Brashear, T. Starner, H. Hamilton, and P. Presti,
    “American sign language recognition with the kinect,” pp. 279–286,
    11 2011.

    [4] S. Mehdi and Y. Khan, “Sign language recognition using sensor
    gloves,” in Proceedings of the 9th International Conference on Neu-
    ral Information Processing, 2002. ICONIP ’02., vol. 5, pp. 2204–
    2206 vol.5, 2002.

    [5] R. Y. Wang and J. Popović, “Real-time hand-tracking with a color
    glove,” ACM Trans. Graph., vol. 28, Jul 2009.

    [6] T. Starner and A. Pentland, “Real-time american sign language recog-
    nition from video using hidden markov models,” in Proceedings of
    International Symposium on Computer Vision - ISCV, pp. 265–270,
    1995.

    [7] T. Starner and A. Pentland, “Visual recognition of american sign lan-
    guage using hidden markov models,” 05 1995.

    [8] A. A. Hosain, P. Selvam Santhalingam, P. Pathak, H. Rangwala, and
    J. Košecká, “Hand pose guided 3d pooling for word-level sign lan-
    guage recognition,” in 2021 IEEE Winter Conference on Applications
    of Computer Vision (WACV), pp. 3428–3438, 2021.

    [9] P. Adarsh, P. Rathi, and M. Kumar, “Yolo v3-tiny: Object detection
    and recognition using one stage improved model,” in 2020 6th In-
    ternational Conference on Advanced Computing and Communication
    Systems (ICACCS), pp. 687–694, 2020.

    [10] R. A. Abdul Rahman and C. K. Yang, “Mobile application for real-
    time bird sound recognition using convolutional neural network.”
    2021.

    [11] H. R. V. Joze and O. Koller, “MS-ASL: A large-scale data set
    and benchmark for understanding american sign language,” CoRR,
    vol. abs/1812.01053, 2018.

    [12] “MediaPipe.” mediapipe.dev. Accessed: 2022-11-12.

    [13] D. Kondratyuk, L. Yuan, Y. Li, L. Zhang, M. Tan, M. Brown, and
    B. Gong, “Movinets: Mobile video networks for efficient video
    recognition,” 2021.
    [14] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “C3D:
    generic features for video analysis,” CoRR, vol. abs/1412.0767, 2014.
    [15] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new
    model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.

    [16] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking
    spatiotemporal feature learning for video understanding,” CoRR,
    vol. abs/1712.04851, 2017.

    [17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
    for action recognition in videos,” CoRR, vol. abs/1406.2199, 2014.

    [18] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A
    closer look at spatiotemporal convolutions for action recognition,” in
    2018 IEEE/CVF Conference on Computer Vision and Pattern Recog-
    nition, pp. 6450–6459, 2018.

    [19] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian, “Recognizing
    american sign language manual signs from RGB-D videos,” CoRR,
    vol. abs/1906.02851, 2019.

    [20] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, “Sign
    language recognition using convolutional neural networks,” in Com-
    puter Vision - ECCV 2014 Workshops (L. Agapito, M. M. Bronstein,
    and C. Rother, eds.), (Cham), pp. 572–578, Springer International
    Publishing, 2015.

    [21] M. De Coster, M. Van Herreweghe, and J. Dambre, “Sign lan-
    guage recognition with transformer networks,” in Proceedings of the
    Twelfth Language Resources and Evaluation Conference, (Marseille,
    France), pp. 6018–6024, European Language Resources Association,
    May 2020.
    [22] R. Cui, H.
    Liu, and C. Zhang, “A deep neural framework for contin-
    uous sign language recognition by iterative training,” IEEE Transac-
    tions on Multimedia, vol. 21, no. 7, pp. 1880–1891, 2019.

    [23] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose:
    Realtime multi-person 2d pose estimation using part affinity fields,”
    CoRR, vol. abs/1812.08008, 2018.

    [24] D. Maji, S. Nagori, M. Mathew, and D. Poddar, “Yolo-pose: Enhanc-
    ing yolo for multi person pose estimation using object keypoint sim-
    ilarity loss,” 2022.

    [25] G. Bender, H. Liu, B. Chen, G. Chu, S. Cheng, P. Kindermans, and
    Q. Le, “Can weight sharing outperform random architecture search?
    an investigation with tunas,” CoRR, vol. abs/2008.06120, 2020.

    [26] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural net-
    works for continuous sign language recognition by staged optimiza-
    tion,” in 2017 IEEE Conference on Computer Vision and Pattern
    Recognition (CVPR), pp. 1610–1618, 2017.

    [27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-
    occurrence feature learning for skeleton based action recognition us-
    ing regularized deep LSTM networks,” CoRR, vol. abs/1603.07772,
    2016.

    [28] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end
    sequence modelling with deep recurrent cnn-hmms,” in 2017 IEEE
    Conference on Computer Vision and Pattern Recognition (CVPR),
    pp. 3416–3424, 2017.

    無法下載圖示 Full text public date 2026/01/18 (Intranet public)
    Full text public date 2026/01/18 (Internet public)
    Full text public date 2026/01/18 (National library)
    QR CODE