Publication:

Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0003-3792-5026
cris.virtual.orcid0000-0002-9569-9373
cris.virtual.orcid0000-0002-0912-8532
cris.virtual.orcid0000-0003-3070-9814
cris.virtualsource.department7db3840f-300f-4cd3-9f87-715eac1a46ae
cris.virtualsource.department1a726932-beb2-4302-94c5-a7ab5d05ce6c
cris.virtualsource.departmentd10d586a-beb2-4179-818a-f979b7ac86c9
cris.virtualsource.department6901d995-9b05-496c-9bb0-2975fa5a0598
cris.virtualsource.orcid7db3840f-300f-4cd3-9f87-715eac1a46ae
cris.virtualsource.orcid1a726932-beb2-4302-94c5-a7ab5d05ce6c
cris.virtualsource.orcidd10d586a-beb2-4179-818a-f979b7ac86c9
cris.virtualsource.orcid6901d995-9b05-496c-9bb0-2975fa5a0598
dc.contributor.authorWang, Wei Cheng
dc.contributor.authorDe Coninck, Sander
dc.contributor.authorLeroux, Sam
dc.contributor.authorSimoens, Pieter
dc.contributor.imecauthorWang, Wei-Cheng
dc.contributor.imecauthorDe Coninck, Sander
dc.contributor.imecauthorLeroux, Sam
dc.contributor.imecauthorSimoens, Pieter
dc.contributor.orcidimecDe Coninck, Sander::0000-0003-3070-9814
dc.contributor.orcidimecLeroux, Sam::0000-0003-3792-5026
dc.contributor.orcidimecSimoens, Pieter::0000-0002-9569-9373
dc.date.accessioned2025-02-04T09:58:37Z
dc.date.available2025-02-03T18:45:54Z
dc.date.available2025-02-04T09:58:37Z
dc.date.issued2025
dc.description.abstractSmart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model’s training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.
dc.description.wosFundingTextThe author(s) declare that financial support was received for the research, authorship, and/or publication of this article. Flemish Government under the "Onderzoeksprogramma Artificiele Intelligentie (AI) Vlaanderen" programme.
dc.identifier.doi10.3389/frobt.2024.1490718
dc.identifier.issn2296-9144
dc.identifier.pmidMEDLINE:39871999
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/45170
dc.publisherFRONTIERS MEDIA SA
dc.source.beginpage1490718
dc.source.issue/
dc.source.journalFRONTIERS IN ROBOTICS AND AI
dc.source.numberofpages14
dc.source.volume11
dc.title

Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data

dc.typeJournal article
dspace.entity.typePublication
Files

Original bundle

Name:
8723.pdf
Size:
22.78 MB
Format:
Adobe Portable Document Format
Description:
Published
Publication available in collections: