Measuring CLEVRness
Blackbox testing of Visual Reasoning Models
Abstract
How can we measure the reasoning capabilities of intelligence systems? Arguably, visual question answering provides a convenient framework. However, despite scores of various visual QA datasets and architectures, which sometimes yield even a super-human performance, the question of whether those architectures can actually reason remains open to debate. To answer this, we extend the visual question answering framework and propose the following behavioral test in the form of a two-player game. We consider black-box neural models, trained on a diagnostic dataset benchmarking reasoning - CLEVR. Next, we train an adversarial player that re-configures the scene to fool the models. We show that otherwise human-level performers can easily be fooled. Our results put in doubt whether data-driven approaches can do reasoning without exploiting the numerous biases that are often present in those datasets.
Authors:
Spyridon Mouselinos (University of Warsaw)
Henryk Michalewski (Google, University of Warsaw)
Mateusz Malinowski (DeepMind)
To appear in ICLR 2022
Motivation
Imagine the following scenario:
A robotic arm identifies and picks objects in a warehouse. Its perceptual system is based on a neural network that is trained on a set of images. What if the different placement of objects changes the reasoning capabilities of the system in an unexpected and harmful way?
In environments where robots cooperate with humans, such incidents may cause malfunctions or even accidents.
Can we test our networks against such cases?
Dataset
For our experiments, we choose CLEVR a well-studied dataset on Visual Reasoning. This choice is justified for two reasons:
Despite its complex nature, today’s models achieve almost perfect performance on the surface, even surpassing humans. However, can we trust these results conducted on static datasets? What if these models are performing much worse in more realistic situations?
Its synthetic nature gives us full control of the scene generation process, providing an excellent sandbox for our experiments.
Method
From Static
In the typical (static) visual question answering format, a model is trained on image-question pairs.
Performance is measured by its accuracy on those examples
To Dynamic
In our proposed setup (dynamic), a pre-trained model (VQA Model) is given pairs produced by an Adversarial Agent. The agent manipulates the scenes to fool the VQA model, and thus, it pushes the VQA visual system and its reasoning capabilities to their limits. Owing to the dynamic setting, we can measure reasoning gaps of various models.
The Agent
Our agent, receives an image and a question, and then suggests 3D manipulations that will likely fool the VQA Model. The suggestions are validated so that they do not cause any visual ambiguities such as heavy occlusions.
The agent has limited access to the model under the test. It only communicates with the latter using questions, answers, and the scene. In particular, it has no access to the model's gradients, and not even to its visual system.
Results
Even though the VQA models can achieve almost perfect scores on the CLEVR dataset, their performance quickly degrades under our dynamic setting. The results give evidence to our hypothesis that these models still have poor reasoning capabilities.
The plot on the left shows the performance degradation of various state-of-the-art CLEVR models under our dynamic setting.The performance drop is significant, and ranges between 5-15% on average.
Examples
Q: What number of cubes are both behind the purple rubber sphere and to the right of the gray cylinder?
Model: 0✔️
Q: What number of cubes are both behind the purple rubber sphere and to the right of the gray cylinder?
Model: 1❌
Q: Are there any other things the same size as the ball?
Model: No ✔️
Q: Are there any other things the same size as the ball?
Model: Yes ❌