Visual Riddles Benchmark

Visual Riddles visual_riddles

: a Commonsense and
World Knowledge Challenge for
Large Vision and Language Models

Ben Gurion University of the Negev, Bar-Ilan University,
The Hebrew University of Jerusalem, Google Research, Tel Aviv University

Why is he doing this?

Look at the nightstand

The image depicts a man scratching his arm, in a bedroom and a mosquito on a nightstand near the bed. Therefore, the man probably scratching his arm due to mosquito bite.

What is this local doing?

Look at his cheek

This local is most likely Italian, based on the colosseum in the background. He appears to be eating and pushing his finger to his cheek. In Italy, while eating, this gesture usually means “buono” - that you find the food tasty. Therefore, he is most likely saying that the food is delicious.

Sara is a resort owner in Krabi, Thailand. could this be her resort?

Look on the mountains

An outside image of a thai-style house, with big yard. in the yard there is grass and big pool. on the far background there are Alpine mountains with snow on the tops. there is visible snow on the mountains tops.

Abstract

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating the need for further information. This example illustrates how subtle visual cues can challenge our cognitive skills and demonstrates the complexity of interpreting visual scenarios. To study these skills, we present Visual Riddles, a benchmark aimed to test vision and language models on visual riddles requiring commonsense and world knowledge. The benchmark comprises 400 visual riddles, each featuring a unique image created by a variety of text-to-image models, question, ground-truth answer, textual hint, and attribution. Human evaluation reveals that existing models lag significantly behind human performance, which is at 82% accuracy, with Gemini-Pro-1.5 leading with 40% accuracy. Our benchmark comes with automatic evaluation tasks to make assessment scalable. These findings underscore the potential of Visual Riddles as a valuable resource for enhancing vision and language models' capabilities in interpreting complex visual scenarios.

Data Collection

The Visual Riddles Challenge tests vision-and-language models using visual riddles that incorporate common-sense reasoning with culturally rich and ambiguous scenarios. Each riddle features a synthetic image created by experienced designers using advanced text-to-image models like DALLE-3, Gemini-1.5 and Stable-Diffusion. These images, designed to include subtle visual clues and cultural nuances, challenge the models to integrate commonsense and world knowledge for solving. Designers provide hints to guide the interpretation of visual clues and attributions for riddles requiring specific knowledge. After rigorous peer review to ensure clarity and solvability, each riddle is finalized with a detailed answer explaining the solution logically based on the visual clues. The dataset aims to advance the capabilities of AI models in interpreting complex visual information.

MY ALT TEXT

Visual Riddles Benchmark

This study introduces three critical tasks within the Visual Riddles benchmark to evaluate vision-and-language models: solving open-ended visual riddles, selecting the correct answer from multiple options, and assessing open-ended responses both with and without reference answers. We also incorporate auxiliary information, such as textual hints and attributions, to enhance model accuracy. For example, hints like 'Look at the colors of the fur' guide models to accurately infer a cat’s gender, leveraging knowledge that calico cats are predominantly female. These tasks are designed to enhance model capabilities in integrating visual data with commonsense reasoning and detailed justifications, supporting scalable and automated evaluations.

MY ALT TEXT

Experiments - Open-ended VQA

Our experiments evaluated several leading vision-and-language models, including LLaVA, Gemini-Pro, InstructBLIP, and GPT-4, on various tasks within our benchmark. Models like Gemini-Pro-1.5 showed a performance of 40%, with humans achieving 82%. Even with auxiliary data such as human-generated captions, model performance improved only marginally. This task tests models' ability to generate correct answers from visual cues alone.

MY ALT TEXT

Experiments - Multiple-choice VQA

This task shifts from generative to classification-based evaluation. GPT-4 and Gemini-Pro-Vision showed the highest accuracies, with slight improvements over open-ended tasks. Models perform better with hints, demonstrating the importance of auxiliary information in enhancing accuracy.

MY ALT TEXT

Experiments - Automatic Evaluation

Gemini-Pro-1.5, identified as the best auto-rater, scored higher in evaluations that used reference-based scenarios. It demonstrated that models generally perform better with hints but struggle with attributions, highlighting ongoing challenges in model reasoning with auxiliary data.

MY ALT TEXT

BibTeX

@misc{bittonguetta2024visualriddlescommonsenseworld, title={Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models}, author={Nitzan Bitton-Guetta and Aviv Slobodkin and Aviya Maimon and Eliya Habba and Royi Rassin and Yonatan Bitton and Idan Szpektor and Amir Globerson and Yuval Elovici}, year={2024}, eprint={2407.19474}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2407.19474}, }