BECON: BERT With Evidence From CONceptNet for Commonsense Question Answering

Preamble

This blog post gives a brief report of the BECON system submitted to csqa-leaderboard. The submission time is 2019 June. At the submission time BECON ranks #3.

Motivation

CommonsenseQA dataset is created by crowdsourcing workers based on knowledge graphs on ConceptNet. Solving the tasks requries the model to have commonsense knowledge. Current LM-pretrained model such as BERT achieves SOTA performance on CQA dataset, which implies that language models trained on very large corpus may learn some commonsense implicitly. With the availability of the large knowledge graph such as ConceptNet, which contains explicit commonsense knowledge, we would like to know if we can use the explicit form of commonsense knowledge as a complementary to BERT which learns implicit commonsense knowledge.

Dataset

The statistics about the dataset is shown in the table.

Train Dev Test
9741 1221 1140

A Sample of the dataset is shown below.

{
    "answerKey": "B",
    "id": "70701f5d1d62e58d5c74e2e303bb4065",
    "question": {
        "choices": [
            {
                "label": "A",
                "text": "bunk"
            },
            {
                "label": "B",
                "text": "reading"
            },
            {
                "label": "C",
                "text": "think"
            },
            {
                "label": "D",
                "text": "fall asleep"
            },
            {
                "label": "E",
                "text": "meditate"
            }
        ],
        "stem": "What is someone doing if he or she is sitting quietly and his or her eyes are moving?"
    }
}

Evidence Finder

Each question has 5 candidate answers. According to our analysis, candidate answers are usually one or two words long. We can use ConceptNet API http://api.conceptnet.io/c/en/ to find all the related information related with a word or phrase in the knowledge graph.

Example:

{
    "text": "meeting",
    "evidence": [
        "*Something you find at a meeting is notepad",
        "*Something you find at a meeting is an agenda",
        "*Something you find at a meeting is a group of people",
        "*Something you find at a meeting is discussion",
        "a stranger is for meeting",
        "appointment is related to meeting",
        "interview is related to meeting",
        "group meeting is a synonym of meeting",
        "rendezvous is a type of meeting",
        "*Something you find at a meeting is papers"
    ],
}

We expect that such evidences may be helpful to answer the question. The problem is, the evidence is too noisy. How to extract useful information? We would like to keep the evidence which is relevant to the question, and discard others. Assume that at most 1 evidence sentence is helpful (which means 0 or 1). We can first rank the evidences and then use the top-ranked evidence (or not).

Evidence Ranker

The evidence ranker ranks the evidences according to the relevant scores with the question. We consider some of the very simple rankers:

To have a sense of how the rankers work, we use the ranker to rank all the evidences of the 5 candidate answers. The candidate answer with the top ranked evidence is chosen as the predicted answer.

A simple model without training: select the choice with the highest evidence score.

The result on train and dev is shown in the table below.

Ranker train dev train_SANITY dev_SANITY
random 21.14 19.82 21.07 19.57
jaccard 23.12 22.44 44.43 41.28
w2v 26.05 23.91 48.73 47.01
bert-base 34.95 34.73 82.89 81.90
bert-large 36.50 36.86 84.41 82.88

For comparison, below is the results on test from original paper on test split.

Models test test_SANITY
VECSIM+NUMBERBATCH 29.1 54.0
LM1B-REP 26.1 39.6
LM1B-CONCAT 25.3 37.4
VECSIM+GLOVE 22.3 26.8
BERT-LARGE 55.9 92.3
GPT 45.5 87.2
ESIM+ELMO 34.1 76.9
ESIM+GLOVE 32.8 79.1
QABILINEAR+GLOVE 31.5 74.8
ESIM+NUMBERBATCH 30.1 74.6
QABILINEAR+NUMBERBATCH 28.8 73.3
QACOMPARE+GLOVE 25.7 69.2
QACOMPARE+NUMBERBATCH 20.4 60.6
BIDAF++ 32.0 71.0
HUMAN 88.9 -

There is no dev result in the original paper, but if we assume the dev and test result are close, we can see that the BERT-large NSP model without training is only inferior than BERT-large and GPT which use the CQA dataset to train.

This encourages us to think about another simple model without training: select the choice with the highest NSP score with the question. Below are the results.

NextSentencePrediction Pretrained BERT Model

Model train dev train-SANITY dev-SANITY
BERT-base NSP 35.36 39.39 71.28 71.99
BERT-large NSP 38.41 40.38 73.54 73.14

Still reletively high compared with the trained model, especially on “SANITY” variant of the dataset. It may indicates that the contribution from BERT model mainly comes from the “pretrain” phase.

Results

Literature & Baseline

Leaderboard

Models test test-SANITY
KagNet 58.9  
CoS-E 58.2  
BECON(ours) 57.9+  
SGN-lite 57.1  
BERT-large(Tel-Aviv U) 56.7  
BERT-large 55.9 92.3
BERT-base(UCL) 53.0  
GPT 45.5 87.2
ESIM+ELMo 34.1 76.9
ESIM+glove 32.8 79.1

Reproduce baseline

Models dev
BERT-base 57.6
BERT-large 63.4

Our Model: BECON

For each answer candidate, rank evidences, and use top evidence.

[CLS] + Question + [SEP] + Answer + [SEP] + Evidence + [SEP]

Pretrain Models ranker dev
BERT-base BERT-base 56.2
BERT-base BERT-large 57.6
BERT-large BERT-base 61.9
BERT-large BERT-large 62.2

The comparision between BERT-base/large rankers show that BERT-large ranker is better. The experiments later all use BERT-large ranker.

Compared with our baseline, the result is a bit lower. This means if we add evidence for each answer candidate, the noise may still overwhelms the useful information.

Solution: Encode BERT(Question + Answer) as well as BERT(Question + Answer + Evidence), and then use max/mean/concatenation as representation of the candidate answer.

Pretrain Models pooling dev
BERT-large max 63.6
BERT-large mean 64.0
BERT-large concat (no pooling) 64.4

The concatenation without pooling outperforms the BERT-large baseline on dev by 1.0%.

We also try another way to incorprate the evidence: rank evidences among all candidate answers, and use the top-ranked evidence. We expect that in this way, since we use only 1 evidence for this sample, the noise will be lower.

BERT(Question + Evidence + Answer)

Question + Evidence Models dev
BERT-base 58.3
BERT-large 62.8

It works on BERT-base (+0.7%), but not on BERT-large (-0.6%).

Summary

We use conceptnet to search for evidence, use BERT to rank them, and use BERT as the base model to train the model with evidence. To alleviate the noise introduced by the evidence, we use BERT to encode both w/ w/o evidence, and let model learn to choose which one contributes more. This model outperforms BERT-large baseline by 0.9% on dev and +1.2% on test, which proves the effectiveness of our method. For comparison, CoS-E use human generated explaination to enhance the question, which only outperforms our model by 0.3%.

Another interesting phenomenon is that BERT NSP without any training on CQA dataset has comparable performance with ESIM + ELMO/glove models on CQA dataset.

If you are interested, please refer to my Github repo for source code.