Reasoning over Scientific Plots

What is PlotQA?

PlotQA is a VQA dataset with 28.9 million question-answer pairs grounded over 224,377 plots on data from real-world sources and questions based on crowd-sourced question templates.

Why PlotQA?

Existing synthetic datasets (FigureQA, DVQA) for reasoning over plots do not contain variability in data labels, real-valued data, or complex reasoning questions. Consequently, proposed models for these datasets do not fully address the challenge of reasoning over plots. In particular, they assume that the answer comes either from a small fixed size vocabulary or from a bounding box within the image. However, in practice this is an unrealistic assumption because many questions require reasoning and thus have real valued answers which appear neither in a small fixed size vocabulary nor in the image. In this work, we aim to bridge this gap between existing datasets and real world plots by introducing PlotQA. Further, 80.76% of the out-of-vocabulary (OOV) questions in PlotQA have answers that are not in a fixed vocabulary. PlotQA paper (WACV 2020)

Getting Started

Download the dataset of 28.9 million question-answer pairs grounded over 224,377 plots.

PlotQA Pipeline

Our proposed pipeline (VOES) consists of various subtasks: (i) detect all the elements in the plot (bars, legend names, tick labels, etc), (ii) reads the values of these elements, (iii) establish relationship between the plot elements, and (iv) reason over this structured data.

Download Code

Have Questions?

Ask us questions at nmethani@cse.iitm.ac.in and prithag@cse.iitm.ac.in.

Acknowledgements

Thank you SQuAD for allowing us to use the code to create this website.

Star

Results (trained and tested on PlotQA)

To assess the difficulty of the PlotQA dataset, we report human accuracy on a small subset of the Test split of the dataset. We also evaluate three state-of-the-art models on PlotQA and observe that uur proposed hybrid model significantly outperforms the existing models. It has an aggregate accuracy of 22.52% on the PlotQA dataset. We acknowledge that the accuracy is significantly lower than human performance. This establishes that the dataset is challenging and raises open questions on models for visual reasoning.

Rank	Model	Accuracy
	Human Baseline IIT Madras	80.47
1 March, 2020	Hybrid Model IIT Madras (Methani & al. '20)	22.52
2 March, 2020	VOES IIT Madras (Methani & al. '20)	18.46
3 March, 2020	SAN Carnegie Mellon University (Yang & al., '16)	7.76

Results (trained and tested on DVQA)

We evaluate our model on the test-set of DVQA. Our proposed hybrid model performs better than the existing models (SAN and SANY-OCR) establishing a new state-of-the-art result on DVQA.

Rank	Model	Accuracy
1 March, 2020	Hybrid Model IIT Madras (Methani & al. '20)	57.99
2 March, 2020	SANDY-OCR Rochester Institute of Technology (Kafle et al., 2018)	45.77
3 March, 2020	SAN Carnegie Mellon University (Yang & al., '16)	32.1