Vision Language Models as Values Detectors

Abbo, Giulio Antonio; Belpaeme, Tony

doi:10.1007/978-3-031-85463-7_5

Simple item page Full metadata Statistics

cris.virtual.department	#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department	#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid	0000-0001-6301-0028
cris.virtual.orcid	0000-0001-5207-7745
cris.virtualsource.department	ab1b156b-2cca-4ddc-bdb9-155273f95966
cris.virtualsource.department	6c1aac4b-593e-4f80-9ecc-911fd20f3c31
cris.virtualsource.orcid	ab1b156b-2cca-4ddc-bdb9-155273f95966
cris.virtualsource.orcid	6c1aac4b-593e-4f80-9ecc-911fd20f3c31
dc.contributor.author	Abbo, Giulio Antonio
dc.contributor.author	Belpaeme, Tony
dc.date.accessioned	2026-06-08T14:43:30Z
dc.date.available	2026-06-08T14:43:30Z
dc.date.createdwos	2025-09-07
dc.date.issued	2025
dc.description.abstract	Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models’ potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.
dc.description.wosFundingText	Funded by the Horizon Europe VALAWAI project (grant agreement number 101070930).
dc.identifier.doi	10.1007/978-3-031-85463-7_5
dc.identifier.isbn	978-3-031-85462-0
dc.identifier.issn	2945-9133
dc.identifier.uri	https://imec-publications.be/handle/20.500.12860/59642
dc.language.iso	eng
dc.provenance.editstepuser	greet.vanhoof@imec.be
dc.publisher	SPRINGER INTERNATIONAL PUBLISHING AG
dc.source.beginpage	76
dc.source.conference	Value Engineering in Artificial Intelligence, VALE
dc.source.conferencedate	2024-10-19
dc.source.conferencelocation	Santiago de Compostela
dc.source.endpage	86
dc.source.journal	VALUE ENGINEERING IN ARTIFICIAL INTELLIGENCE, VALE 2024
dc.source.numberofpages	11
dc.title	Vision Language Models as Values Detectors
dc.type	Proceedings paper
dspace.entity.type	Publication
imec.internal.crawledAt	2025-10-22
imec.internal.source	crawler
Files
Publication available in collections:	Conference contributions

Vision Language Models as Values Detectors

Date