Publication:

Integrating Visual Context Into Language Models for Situated Social Conversation Starters

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0002-1790-9531
cris.virtual.orcid0000-0001-5207-7745
cris.virtual.orcid0000-0002-7420-7181
cris.virtual.orcid0000-0002-9901-5768
cris.virtualsource.department60910c8d-eace-48b6-8e4d-3c2fff94428a
cris.virtualsource.department6c1aac4b-593e-4f80-9ecc-911fd20f3c31
cris.virtualsource.department8b8f09dc-aa33-45b8-9790-5f5b456aeda5
cris.virtualsource.departmentdf6c83d3-392b-4c86-82f0-1f3fadc2f1fd
cris.virtualsource.orcid60910c8d-eace-48b6-8e4d-3c2fff94428a
cris.virtualsource.orcid6c1aac4b-593e-4f80-9ecc-911fd20f3c31
cris.virtualsource.orcid8b8f09dc-aa33-45b8-9790-5f5b456aeda5
cris.virtualsource.orciddf6c83d3-392b-4c86-82f0-1f3fadc2f1fd
dc.contributor.authorJanssens, Ruben
dc.contributor.authorWolfert, Pieter
dc.contributor.authorDemeester, Thomas
dc.contributor.authorBelpaeme, Tony
dc.contributor.imecauthorJanssens, Ruben
dc.contributor.imecauthorDemeester, Thomas
dc.contributor.imecauthorBelpaeme, Tony
dc.contributor.orcidimecJanssens, Ruben::0000-0002-1790-9531
dc.contributor.orcidimecDemeester, Thomas::0000-0002-9901-5768
dc.contributor.orcidimecBelpaeme, Tony::0000-0001-5207-7745
dc.date.accessioned2025-05-05T10:11:13Z
dc.date.available2025-05-03T05:31:09Z
dc.date.available2025-05-05T10:11:13Z
dc.date.issued2025
dc.description.abstractEmbodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.
dc.description.wosFundingTextThis work was supported in part by Flemish Government (AI Research Program), in part by Horizon Europe VALAWAI project under Grant 101070930, and in part by European ROBotics and AI Network (euROBIN) under Grant 101070596.
dc.identifier.doi10.1109/TAFFC.2024.3428704
dc.identifier.issn1949-3045
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/45597
dc.publisherIEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.source.beginpage223
dc.source.endpage236
dc.source.issue1
dc.source.journalIEEE TRANSACTIONS ON AFFECTIVE COMPUTING
dc.source.numberofpages14
dc.source.volume16
dc.title

Integrating Visual Context Into Language Models for Situated Social Conversation Starters

dc.typeJournal article
dspace.entity.typePublication
Files

Original bundle

Name:
DS798.pdf
Size:
2.29 MB
Format:
Adobe Portable Document Format
Description:
Published
Name:
DS798_acc.pdf
Size:
3.43 MB
Format:
Adobe Portable Document Format
Description:
Accepted
Publication available in collections: