Publication:

Exploiting speaker embeddings for improved microphone clustering and speech separation in ad-hoc micrphone arrays

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0001-9300-4251
cris.virtual.orcid0000-0001-9131-3309
cris.virtual.orcid0000-0001-5990-722X
cris.virtualsource.departmentf53e6070-2708-44b0-90a2-c51b3969228e
cris.virtualsource.department3933668b-dfa9-4ff8-be41-336de12aa428
cris.virtualsource.department092fe92a-3bc6-46c1-82b1-cb40096e8470
cris.virtualsource.orcidf53e6070-2708-44b0-90a2-c51b3969228e
cris.virtualsource.orcid3933668b-dfa9-4ff8-be41-336de12aa428
cris.virtualsource.orcid092fe92a-3bc6-46c1-82b1-cb40096e8470
dc.contributor.authorKindt, Stijn
dc.contributor.authorThienpondt, Jenthe
dc.contributor.authorMadhu, Nilesh
dc.date.accessioned2026-03-16T12:07:43Z
dc.date.available2026-03-16T12:07:43Z
dc.date.createdwos2026-02-21
dc.date.issued2023-06-04
dc.description.abstractFor separating sources captured by ad hoc distributed microphones a key first step is assigning the microphones to the appropriate source-dominated clusters. The features used for such (blind) clustering are based on a fixed length embedding of the audio signals in a high-dimensional latent space. In previous work, the embedding was hand-engineered from the Mel frequency cepstral coefficients and their modulation-spectra. This paper argues that embedding frameworks designed explicitly for the purpose of reliably discriminating between speakers would produce more appropriate features. We propose features generated by the state-of-the-art ECAPA-TDNN speaker verification model for the clustering. We benchmark these features in terms of the subsequent signal enhancement as well as on the quality of the clustering where, further, we introduce 3 intuitive metrics for the latter. Results indicate that in contrast to the hand-engineered features, the ECAPA-TDNN-based features lead to more logical clusters and better performance in the subsequent enhancement stages- thus validating our hypothesis.
dc.description.wosFundingTextThis work is supported by the Research Foundation - Flanders (FWO) under grant number G081420N and imec.ICON: BLE2AV (support from VLAIO). Partners: Imec, Televic, Cochlear, and Qorvo.
dc.identifier.doi10.1109/icassp49357.2023.10094862
dc.identifier.issn1520-6149
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/58834
dc.language.isoeng
dc.provenance.editstepusergreet.vanhoof@imec.be
dc.publisherIEEE
dc.source.beginpage1
dc.source.conferenceIEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP
dc.source.conferencedate2023-06-04
dc.source.conferencelocationRhodes, Greece
dc.source.endpage5
dc.source.journalIEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, ICASSP 2023
dc.source.numberofpages5
dc.title

Exploiting speaker embeddings for improved microphone clustering and speech separation in ad-hoc micrphone arrays

dc.typeProceedings paper
dspace.entity.typePublication
imec.internal.crawledAt2026-02-23
imec.internal.sourcecrawler
Files

Original bundle

Name:
DS631_acc.pdf
Size:
511.9 KB
Format:
Adobe Portable Document Format
Description:
Accepted
Publication available in collections: