TY - GEN
T1 - A Comparative Study of Video-Based Human Representations for American Sign Language Alphabet Generation
AU - Xu, Fei
AU - Chaudhary, Lipisha
AU - Dong, Lu
AU - Setlur, Srirangaraj
AU - Govindaraju, Venu
AU - Nwogu, Ifeoma
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: two-stream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformer-based Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work.
AB - Sign language is a complex visual language, and automatic interpretations of sign language can facilitate communication involving deaf individuals. As one of the essential components of sign language, fingerspelling connects the natural spoken languages to the sign language and expands the scale of sign language vocabulary. In practice, it is challenging to analyze fingerspelling alphabets due to their signing speed and small motion range. The usage of synthetic data has the potential of further improving fingerspelling alphabets analysis at scale. In this paper, we evaluate how different video-based human representations perform in a framework for Alphabet Generation for American Sign Language (ASL). We tested three mainstream video-based human representations: two-stream inflated 3D ConvNet, 3D landmarks of body joints, and rotation matrices of body joints. We also evaluated the effect of different skeleton graphs and selected body joints. The generation process of ASL fingerspelling used a transformer-based Conditional Variational Autoencoder. To train the model, we collected ASL alphabet signing videos from 17 signers with dynamic alphabet signing. The generated alphabets were evaluated using automatic metrics of quality such as FID, and we also considered supervised metrics by recognizing the generated entries using Spatio-Temporal Graph Convolutional Networks. Our experiments show that using the rotation matrices of the upper body joints and the signing hand give the best results for the generation of ASL alphabet signing. Going forward, our goal is to produce articulated fingerspelling words by combining individual alphabets learned in this work.
UR - https://www.scopus.com/pages/publications/85199458916
U2 - 10.1109/FG59268.2024.10582020
DO - 10.1109/FG59268.2024.10582020
M3 - Conference contribution
T3 - 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition, FG 2024
BT - 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition, FG 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 18th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2024
Y2 - 27 May 2024 through 31 May 2024
ER -