DailyDialog++ is an open-domain dialogue evaluation dataset consisting of 19k contexts with five relevant responses for each context. Additionally for 11k contexts, it includes five adversarial irrelevant responses which are specifically crafted to have lexical or semantic overlap with the context but are still unacceptable as valid responses. We hope that DailyDialog++ will be a useful resource in achieving better training and robust evaluation of dialogue evaluation metrics. Explore DailyDialog++DailyDialog++ paper (TACL 2020)
Download the dataset of 11k contexts with 5 relevant and 5 adversarial irrelevant responses per context.
We propose Dialogue Evaluation with Bert (DEB), a new BERT-based evaluation metric which is pretrained on a massive corpus of 727 Million Reddit conversations. DEB significantly outperforms existing models, showing better correlation with human judgements and better performance on random negatives. Check out the code for DEB and other Dialogue evaluation baselines like RUBER, ADEM, etc
Ask us questions at makashkumar99@gmail.com and ananya@cse.iitm.ac.in.
DailyDialog++ evaluates the ability of dailog evaluation metrics to distinguish between the relevant and irrelevant responses to a given context. Here, we report the test perfromance of various metrics in differentiating between relevant and adversarial irrelevant responses when trained with random negative responses (sampling random utterances from other contexts). We use accuracy and the Point-biserial correlation (PBC) coefficient as the performance measure. We observe that the performance of all metrics including DEB is poor when evaluated on adversarial examples from our dataset.
Rank | Model | Accuracy | PBC |
---|---|---|---|
1 Sep 13, 2020 | RUBER-Large Peking University (Tao & al., '17) | 68.92 | 0.42 |
2 Sep 13, 2020 | DEB IIT Madras (Sai & al. '20) | 66.78 | 0.39 |
3 Sep 13, 2020 | RUBER Peking University (Tao & al., '17) | 65 | 0.35 |
4 Sep 13, 2020 | ADEM McGill University (Lowe & al., '17) | 64.43 | 0.37 |
5 Sep 13, 2020 | BERT+DNN University of Southern California (Ghazarian et al., 2019) | 60.14 | 0.29 |
Here, we report the test perfromance of various metrics in differentiating between the relevant and adversarial irrelevant responses when trained with both random and adversarial irrelevant responses. We observe that DEB significantly outperforms all the other metrics.
Rank | Model | Accuracy | PBC |
---|---|---|---|
1 Sep 13, 2020 | DEB IIT Madras (Sai & al. '20) | 92.66 | 0.86 |
2 Sep 13, 2020 | BERT+DNN University of Southern California (Ghazarian et al., 2019) | 86.61 | 0.79 |
3 Sep 13, 2020 | RUBER-Large Peking University (Tao & al. '17) | 86.52 | 0.78 |
4 Sep 13, 2020 | RUBER Peking University (Tao & al. '17) | 83.81 | 0.74 |
5 Sep 13, 2020 | ADEM McGill University (Lowe & al., '17) | 66.62 | 0.41 |