Publication:

Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

Date

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0002-1774-2970
cris.virtualsource.departmentf869d5b4-4c3a-4052-af0a-3f6e6925edb3
cris.virtualsource.orcidf869d5b4-4c3a-4052-af0a-3f6e6925edb3
dc.contributor.authorChen, Haifeng
dc.contributor.authorDeng, Yifan
dc.contributor.authorCheng, Shiwen
dc.contributor.authorWang, Yixuan
dc.contributor.authorJiang, Dongmei
dc.contributor.authorSahli, Hichem
dc.date.accessioned2026-03-16T15:01:31Z
dc.date.available2026-03-16T15:01:31Z
dc.date.createdwos2025-10-31
dc.date.issued2019
dc.description.abstractAffective dimension prediction from multi-modal is becoming an increasingly attractive research field in artificial intelligence (AI) and human-computer interaction (HCI) . Previous works have shown that discriminative features from multiple modalities are of importance to accurately recognize emotional states. Recently, deep representations have proved to be effective for emotional state recognition. To investigate new deep spatial-temporal features and evaluate their effectiveness for affective dimension recognition, in this paper, we propose:~(1) combining a pre-trained 2D-CNN and a 1D-CNN for learning deep spatial-temporal features from video images and audio spectrograms; and~(2) a spatial-Temporal Graph Convolutional Networks (ST-GCN) adapted to facial landmarks graph. To evaluate the effectiveness of the proposed spatial-temporal features for affective dimension prediction, we propose Deep Bidirectional Long Short-Term Memory Networks (DBLSTM) model for single-modality prediction, early-fusion and late-fusion predictions. With respect to the liking dimension, we use the text modality for prediction. Experimental results, on the AVEC2019 CES dataset, show that our proposed spatial-temporal features and recognition model obtain promising results. On the development set, the obtained concordance correlation coefficient (CCC) is up to $0.724$ for arousal and $0.705$ for valence, and on the test set, the CCC is $0.513$ for arousal and $0.515$ for valence, which outperform the baseline system with corresponding CCC of $0.355$ and $0.468$ on arousal and valence, respectively.
dc.description.wosFundingTextThis work was supported by the Shaanxi Provincial International Science and Technology Collaboration Project (grant 2017KW-ZD-14), and the VUB Interdisciplinary Research Program through the EMO-App project.
dc.identifier.doi10.1145/3347320.3357690
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/58850
dc.language.isoeng
dc.provenance.editstepusergreet.vanhoof@imec.be
dc.publisherASSOC COMPUTING MACHINERY
dc.source.beginpage19
dc.source.conference9TH INTERNATIONAL AUDIO/VISUAL EMOTION CHALLENGE AND WORKSHOP, AVEC 2019
dc.source.conferencedate2019-10-21
dc.source.conferencelocationNice
dc.source.endpage26
dc.source.journalPROCEEDINGS OF THE 9TH INTERNATIONAL AUDIO/VISUAL EMOTION CHALLENGE AND WORKSHOP, AVEC 2019
dc.source.numberofpages8
dc.title

Efficient Spatial Temporal Convolutional Features for Audiovisual Continuous Affect Recognition

dc.typeProceedings paper
dspace.entity.typePublication
imec.internal.crawledAt2025-10-22
imec.internal.sourcecrawler
Files
Publication available in collections: