Integrating Visual Context Into Language Models for Situated Social Conversation Starters

Janssens, Ruben; Wolfert, Pieter; Demeester, Thomas; Belpaeme, Tony

doi:10.1109/TAFFC.2024.3428704

Simple item page Full metadata Statistics

dc.contributor.author	Janssens, Ruben
dc.contributor.author	Wolfert, Pieter
dc.contributor.author	Demeester, Thomas
dc.contributor.author	Belpaeme, Tony
dc.contributor.imecauthor	Janssens, Ruben
dc.contributor.imecauthor	Demeester, Thomas
dc.contributor.imecauthor	Belpaeme, Tony
dc.contributor.orcidimec	Janssens, Ruben::0000-0002-1790-9531
dc.contributor.orcidimec	Demeester, Thomas::0000-0002-9901-5768
dc.contributor.orcidimec	Belpaeme, Tony::0000-0001-5207-7745
dc.date.accessioned	2025-05-05T10:11:13Z
dc.date.available	2025-05-03T05:31:09Z
dc.date.available	2025-05-05T10:11:13Z
dc.date.issued	2025
dc.description.abstract	Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.
dc.description.wosFundingText	This work was supported in part by Flemish Government (AI Research Program), in part by Horizon Europe VALAWAI project under Grant 101070930, and in part by European ROBotics and AI Network (euROBIN) under Grant 101070596.
dc.identifier.doi	10.1109/TAFFC.2024.3428704
dc.identifier.issn	1949-3045
dc.identifier.uri	https://imec-publications.be/handle/20.500.12860/45597
dc.publisher	IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
dc.source.beginpage	223
dc.source.endpage	236
dc.source.issue	1
dc.source.journal	IEEE TRANSACTIONS ON AFFECTIVE COMPUTING
dc.source.numberofpages	14
dc.source.volume	16
dc.title	Integrating Visual Context Into Language Models for Situated Social Conversation Starters
dc.type	Journal article
dspace.entity.type	Publication
Files	Original bundle Name: DS798.pdf Size: 2.29 MB Format: Adobe Portable Document Format Description: Published Download Name: DS798_acc.pdf Size: 3.43 MB Format: Adobe Portable Document Format Description: Accepted Download
Publication available in collections:	Articles

Integrating Visual Context Into Language Models for Situated Social Conversation Starters

Date