Publication:
Integrating Visual Context Into Language Models for Situated Social Conversation Starters
| cris.virtual.department | #PLACEHOLDER_PARENT_METADATA_VALUE# | |
| cris.virtual.department | #PLACEHOLDER_PARENT_METADATA_VALUE# | |
| cris.virtual.department | #PLACEHOLDER_PARENT_METADATA_VALUE# | |
| cris.virtual.department | #PLACEHOLDER_PARENT_METADATA_VALUE# | |
| cris.virtual.orcid | 0000-0002-1790-9531 | |
| cris.virtual.orcid | 0000-0001-5207-7745 | |
| cris.virtual.orcid | 0000-0002-7420-7181 | |
| cris.virtual.orcid | 0000-0002-9901-5768 | |
| cris.virtualsource.department | 60910c8d-eace-48b6-8e4d-3c2fff94428a | |
| cris.virtualsource.department | 6c1aac4b-593e-4f80-9ecc-911fd20f3c31 | |
| cris.virtualsource.department | 8b8f09dc-aa33-45b8-9790-5f5b456aeda5 | |
| cris.virtualsource.department | df6c83d3-392b-4c86-82f0-1f3fadc2f1fd | |
| cris.virtualsource.orcid | 60910c8d-eace-48b6-8e4d-3c2fff94428a | |
| cris.virtualsource.orcid | 6c1aac4b-593e-4f80-9ecc-911fd20f3c31 | |
| cris.virtualsource.orcid | 8b8f09dc-aa33-45b8-9790-5f5b456aeda5 | |
| cris.virtualsource.orcid | df6c83d3-392b-4c86-82f0-1f3fadc2f1fd | |
| dc.contributor.author | Janssens, Ruben | |
| dc.contributor.author | Wolfert, Pieter | |
| dc.contributor.author | Demeester, Thomas | |
| dc.contributor.author | Belpaeme, Tony | |
| dc.contributor.imecauthor | Janssens, Ruben | |
| dc.contributor.imecauthor | Demeester, Thomas | |
| dc.contributor.imecauthor | Belpaeme, Tony | |
| dc.contributor.orcidimec | Janssens, Ruben::0000-0002-1790-9531 | |
| dc.contributor.orcidimec | Demeester, Thomas::0000-0002-9901-5768 | |
| dc.contributor.orcidimec | Belpaeme, Tony::0000-0001-5207-7745 | |
| dc.date.accessioned | 2025-05-05T10:11:13Z | |
| dc.date.available | 2025-05-03T05:31:09Z | |
| dc.date.available | 2025-05-05T10:11:13Z | |
| dc.date.issued | 2025 | |
| dc.description.abstract | Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction. | |
| dc.description.wosFundingText | This work was supported in part by Flemish Government (AI Research Program), in part by Horizon Europe VALAWAI project under Grant 101070930, and in part by European ROBotics and AI Network (euROBIN) under Grant 101070596. | |
| dc.identifier.doi | 10.1109/TAFFC.2024.3428704 | |
| dc.identifier.issn | 1949-3045 | |
| dc.identifier.uri | https://imec-publications.be/handle/20.500.12860/45597 | |
| dc.publisher | IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC | |
| dc.source.beginpage | 223 | |
| dc.source.endpage | 236 | |
| dc.source.issue | 1 | |
| dc.source.journal | IEEE TRANSACTIONS ON AFFECTIVE COMPUTING | |
| dc.source.numberofpages | 14 | |
| dc.source.volume | 16 | |
| dc.title | Integrating Visual Context Into Language Models for Situated Social Conversation Starters | |
| dc.type | Journal article | |
| dspace.entity.type | Publication | |
| Files | ||
| Publication available in collections: |