🤗
Dataset🤗
Explorer Evaluation NotebookThe image depicts a man scratching his arm, in a bedroom and a mosquito on a nightstand near the bed. Therefore, the man probably scratching his arm due to mosquito bite.
This local is most likely Italian, based on the colosseum in the background. He appears to be eating and pushing his finger to his cheek. In Italy, while eating, this gesture usually means “buono” - that you find the food tasty. Therefore, he is most likely saying that the food is delicious.
An outside image of a thai-style house, with big yard. in the yard there is grass and big pool. on the far background there are Alpine mountains with snow on the tops. there is visible snow on the mountains tops.
Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.
The Visual Riddles Challenge tests vision-and-language models using visual riddles that incorporate common-sense reasoning with culturally rich and ambiguous scenarios.
Each riddle features a synthetic image created by experienced designers using advanced text-to-image models like DALLE-3, Gemini-1.5 and Stable-Diffusion.
These images, designed to include subtle visual clues and cultural nuances, challenge the models to integrate commonsense and world knowledge for solving.
Designers provide hints to guide the interpretation of visual clues and attributions for riddles requiring specific knowledge. After rigorous peer review to ensure clarity and solvability,
each riddle is finalized with a detailed answer explaining the solution logically based on the visual clues. The dataset aims to advance the capabilities of AI models in interpreting complex visual information.
This study introduces three critical tasks within the Visual Riddles benchmark to evaluate vision-and-language models:
solving open-ended visual riddles, selecting the correct answer from multiple options, and assessing open-ended responses both with and without reference answers.
We also incorporate auxiliary information, such as textual hints and attributions, to enhance model accuracy.
For example, hints like 'Look at the colors of the fur' guide models to accurately infer a cat’s gender, leveraging knowledge that calico cats are predominantly female.
These tasks are designed to enhance model capabilities in integrating visual data with commonsense reasoning and detailed justifications, supporting scalable and automated evaluations.
Our experiments evaluated several leading vision-and-language models, including LLaVA, Gemini-Pro,
InstructBLIP, and GPT-4, on various tasks within our benchmark.
Models like Gemini-Pro-1.5 showed a performance of 40%, with humans achieving 82%. Even with auxiliary data such as human-generated captions, model performance improved only marginally.
This task tests models' ability to generate correct answers from visual cues alone.
This task shifts from generative to classification-based evaluation.
GPT-4 and Gemini-Pro-Vision showed the highest accuracies, with slight improvements over open-ended tasks.
Models perform better with hints, demonstrating the importance of auxiliary information in enhancing accuracy.
Gemini-Pro-1.5, identified as the best auto-rater, scored higher in evaluations that used reference-based scenarios.
It demonstrated that models generally perform better with hints but struggle with attributions, highlighting ongoing challenges in model reasoning with auxiliary data.
@misc{bittonguetta2024visualriddlescommonsenseworld,
title={Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models},
author={Nitzan Bitton-Guetta and Aviv Slobodkin and Aviya Maimon and Eliya Habba and Royi Rassin and Yonatan Bitton and Idan Szpektor and Amir Globerson and Yuval Elovici},
year={2024},
eprint={2407.19474},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.19474},
}