Janssens, RubenRubenJanssensWolfert, PieterPieterWolfertDemeester, ThomasThomasDemeesterBelpaeme, TonyTonyBelpaeme2025-05-052025-05-032025-05-0520251949-3045WOS:001470259700022https://imec-publications.be/handle/20.500.12860/45597Embodied conversational agents that interact socially with people in the physical world require multi-modal capabilities, such as appropriately responding to visual features of users. While existing vision-and-language models can generate language based on visual input, this language is not situated in a social interaction in the physical world. We present a novel task called Visual Conversation Starters, where an agent generates a conversation-starting question referring to features visible in an image of the user. We collect a dataset of 4000 images of people with 12000 crowdsourced conversation starters, compare various model architectures: fine-tuning smaller seq2seq or image-to-text models versus zero-shot prompting of GPT-3.5, using image captions versus end-to-end image input, training on human data versus synthetic questions generated by GPT-3.5. Models were used to generate friendly conversation starters which were evaluated on criteria including language fluency, visual grounding, interestingness, politeness. Results show that GPT-3.5 generates more interesting, polite questions than smaller models that are fine-tuned on crowdsourced data, but vision-to-language models are better at referencing visual features, they can mimick GPT-3.5's performance. This demonstrates the feasibility of deep visiolinguistic models for situated social agents, forming an important first stage in creating situated multimodal social interaction.Integrating Visual Context Into Language Models for Situated Social Conversation StartersJournal article10.1109/TAFFC.2024.3428704WOS:001470259700022