Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

Wang, Wei Cheng; De Coninck, Sander; Leroux, Sam; Simoens, Pieter

doi:10.3389/frobt.2024.1490718

Simple item page Full metadata Statistics

dc.contributor.author	Wang, Wei Cheng
dc.contributor.author	De Coninck, Sander
dc.contributor.author	Leroux, Sam
dc.contributor.author	Simoens, Pieter
dc.contributor.imecauthor	Wang, Wei-Cheng
dc.contributor.imecauthor	De Coninck, Sander
dc.contributor.imecauthor	Leroux, Sam
dc.contributor.imecauthor	Simoens, Pieter
dc.contributor.orcidimec	De Coninck, Sander::0000-0003-3070-9814
dc.contributor.orcidimec	Leroux, Sam::0000-0003-3792-5026
dc.contributor.orcidimec	Simoens, Pieter::0000-0002-9569-9373
dc.date.accessioned	2025-02-04T09:58:37Z
dc.date.available	2025-02-03T18:45:54Z
dc.date.available	2025-02-04T09:58:37Z
dc.date.issued	2025
dc.description.abstract	Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
dc.description.wosFundingText	The author(s) declare that financial support was received for the research, authorship, and/or publication of this article. Flemish Government under the "Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen" programme.
dc.identifier.doi	10.3389/frobt.2024.1490718
dc.identifier.issn	2296-9144
dc.identifier.pmid	MEDLINE:39871999
dc.identifier.uri	https://imec-publications.be/handle/20.500.12860/45170
dc.publisher	FRONTIERS MEDIA SA
dc.source.beginpage	1490718
dc.source.issue	/
dc.source.journal	FRONTIERS IN ROBOTICS AND AI
dc.source.numberofpages	14
dc.source.volume	11
dc.title	Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data
dc.type	Journal article
dspace.entity.type	Publication
Files	Original bundle Name: 8723.pdf Size: 22.78 MB Format: Adobe Portable Document Format Description: Published Download
Publication available in collections:	Articles

Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

Date