Multi-reference Adversarial Dataset for Dialog Evaluation

What is DailyDialog++?

DailyDialog++ is an open-domain dialogue evaluation dataset consisting of 19k contexts with five relevant responses for each context. Additionally for 11k contexts, it includes five adversarial irrelevant responses which are specifically crafted to have lexical or semantic overlap with the context but are still unacceptable as valid responses. We hope that DailyDialog++ will be a useful resource in achieving better training and robust evaluation of dialogue evaluation metrics. Explore DailyDialog++DailyDialog++ paper (TACL 2020)

Getting Started

Download the dataset of 11k contexts with 5 relevant and 5 adversarial irrelevant responses per context.

Models

We propose Dialogue Evaluation with Bert (DEB), a new BERT-based evaluation metric which is pretrained on a massive corpus of 727 Million Reddit conversations. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives. Check out the code for DEB and other Dialogue evaluation baselines like RUBER, ADEM, etc

Code

Have Questions?

Ask us questions at makashkumar99@gmail.com and ananya@cse.iitm.ac.in.

Star

Results (trained on random negatives)

DailyDialog++ evaluates the ability of dailog evaluation metrics to distinguish between the relevant and irrelevant responses to a given context. Here, we report the test perfromance of various metrics in differentiating between relevant and adversarial irrelevant responses when trained with random negative responses (sampling random utterances from other contexts). We use accuracy and the Point-biserial correlation (PBC) coefficient as the performance measure. We observe that the performance of all metrics including DEB is poor when evaluated on adversarial examples from our dataset.

Rank	Model	Accuracy	PBC
1 Sep 13, 2020	RUBER-Large Peking University (Tao & al., '17)	68.92	0.42
2 Sep 13, 2020	DEB IIT Madras (Sai & al. '20)	66.78	0.39
3 Sep 13, 2020	RUBER Peking University (Tao & al., '17)	65	0.35
4 Sep 13, 2020	ADEM McGill University (Lowe & al., '17)	64.43	0.37
5 Sep 13, 2020	BERT+DNN University of Southern California (Ghazarian et al., 2019)	60.14	0.29

Results (trained on random + adversarial negatives)

Here, we report the test perfromance of various metrics in differentiating between the relevant and adversarial irrelevant responses when trained with both random and adversarial irrelevant responses. We observe that DEB significantly outperforms all the other metrics.

Rank	Model	Accuracy	PBC
1 Sep 13, 2020	DEB IIT Madras (Sai & al. '20)	92.66	0.86
2 Sep 13, 2020	BERT+DNN University of Southern California (Ghazarian et al., 2019)	86.61	0.79
3 Sep 13, 2020	RUBER-Large Peking University (Tao & al. '17)	86.52	0.78
4 Sep 13, 2020	RUBER Peking University (Tao & al. '17)	83.81	0.74
5 Sep 13, 2020	ADEM McGill University (Lowe & al., '17)	66.62	0.41