TY - GEN
T1 - Skeleton-Based Methods for Speaker Action Classification on Lecture Videos
AU - Xu, Fei
AU - Davila, Kenny
AU - Setlur, Srirangaraj
AU - Govindaraju, Venu
N1 - Publisher Copyright: © 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - The volume of online lecture videos is growing at a frenetic pace. This has led to an increased focus on methods for automated lecture video analysis to make these resources more accessible. These methods consider multiple information channels including the actions of the lecture speaker. In this work, we analyze two methods that use spatio-temporal features of the speaker skeleton for action classification in lecture videos. The first method is the AM Pose model which is based on Random Forests with motion-based features. The second is a state-of-the-art action classifier based on a two-stream adaptive graph convolutional network (2S-AGCN) that uses features of both joints and bones of the speaker skeleton. Each video is divided into fixed-length temporal segments. Then, the speaker skeleton is estimated on every frame in order to build a representation for each segment for further classification. Our experiments used the AccessMath dataset and a novel extension which will be publicly released. We compared four state-of-the-art pose estimators: OpenPose, Deep High Resolution, AlphaPose and Detectron2. We found that AlphaPose is the most robust to the encoding noise found in online videos. We also observed that 2S-AGCN outperforms the AM Pose model by using the right domain adaptations.
AB - The volume of online lecture videos is growing at a frenetic pace. This has led to an increased focus on methods for automated lecture video analysis to make these resources more accessible. These methods consider multiple information channels including the actions of the lecture speaker. In this work, we analyze two methods that use spatio-temporal features of the speaker skeleton for action classification in lecture videos. The first method is the AM Pose model which is based on Random Forests with motion-based features. The second is a state-of-the-art action classifier based on a two-stream adaptive graph convolutional network (2S-AGCN) that uses features of both joints and bones of the speaker skeleton. Each video is divided into fixed-length temporal segments. Then, the speaker skeleton is estimated on every frame in order to build a representation for each segment for further classification. Our experiments used the AccessMath dataset and a novel extension which will be publicly released. We compared four state-of-the-art pose estimators: OpenPose, Deep High Resolution, AlphaPose and Detectron2. We found that AlphaPose is the most robust to the encoding noise found in online videos. We also observed that 2S-AGCN outperforms the AM Pose model by using the right domain adaptations.
KW - Action classification
KW - Lecture video analysis
KW - Lecture video dataset
KW - Pose estimation
UR - https://www.scopus.com/pages/publications/85103437131
U2 - 10.1007/978-3-030-68799-1_18
DO - 10.1007/978-3-030-68799-1_18
M3 - Conference contribution
SN - 9783030687984
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 250
EP - 264
BT - Pattern Recognition. ICPR International Workshops and Challenges, 2021, Proceedings
A2 - Del Bimbo, Alberto
A2 - Cucchiara, Rita
A2 - Sclaroff, Stan
A2 - Farinella, Giovanni Maria
A2 - Mei, Tao
A2 - Bertini, Marco
A2 - Escalante, Hugo Jair
A2 - Vezzani, Roberto
PB - Springer Science and Business Media Deutschland GmbH
T2 - 25th International Conference on Pattern Recognition Workshops, ICPR 2020
Y2 - 10 January 2021 through 15 January 2021
ER -