Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

Chen, Haifeng; Deng, Yifan; Cheng, Shiwen; Wang, Yixuan; Jiang, Dongmei; Sahli, Hichem

doi:10.1145/3347320.3357690

Simple item page Full metadata Statistics

dc.contributor.author	Chen, Haifeng
dc.contributor.author	Deng, Yifan
dc.contributor.author	Cheng, Shiwen
dc.contributor.author	Wang, Yixuan
dc.contributor.author	Jiang, Dongmei
dc.contributor.author	Sahli, Hichem
dc.date.accessioned	2026-03-16T15:01:31Z
dc.date.available	2026-03-16T15:01:31Z
dc.date.createdwos	2025-10-31
dc.date.issued	2019
dc.description.abstract	Affective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.
dc.description.wosFundingText	This work was supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), and the VUB Interdisciplinary Research Program through the EMO-App project.
dc.identifier.doi	10.1145/3347320.3357690
dc.identifier.uri	https://imec-publications.be/handle/20.500.12860/58850
dc.language.iso	eng
dc.provenance.editstepuser	greet.vanhoof@imec.be
dc.publisher	ASSOC COMPUTING MACHINERY
dc.source.beginpage	19
dc.source.conference	9TH INTERNATIONAL AUDIO/VISUAL EMOTION CHALLENGE AND WORKSHOP, AVEC 2019
dc.source.conferencedate	2019-10-21
dc.source.conferencelocation	Nice
dc.source.endpage	26
dc.source.journal	PROCEEDINGS OF THE 9TH INTERNATIONAL AUDIO/VISUAL EMOTION CHALLENGE AND WORKSHOP, AVEC 2019
dc.source.numberofpages	8
dc.title	Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition
dc.type	Proceedings paper
dspace.entity.type	Publication
imec.internal.crawledAt	2025-10-22
imec.internal.source	crawler
Files
Publication available in collections:	Conference contributions

Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

Date