NAACL 2019 Paper Reading Notes

April 28, 2019 | 10 min read

This post is intended to record the notes and personal thoughts when reading NAACL 2019 papers.

Based on personal research interest, I select a subset (50+ papers and 4 demo papers) of the accepted papers. The keywords include but not limited to:

knowledge sentence representation relation extraction bert attention question answering

Accepted Papers

knowledge graph completion A Capsule Network-based Embedding Model for Knowledge Graph Completion and Search Personalization Dai Quoc Nguyen, Thanh Vu, Tu Dinh Nguyen, Dat Quoc Nguyen and Dinh Phung [paper]

Use Hinton’s capsule network to train a embedding for Knowledge Graph Completion.
A Complex-valued Network for Matching Qiuchi Li, Benyou Wang and Massimo Melucci [paper]

Use mathematical formulations in quantum physics to build a network for WikiQA and TrecQA (MCQ)
A general framework for information extraction using dynamic span graphs Yi Luan, Dave Wadden, Luheng He, Amy Shah, Mari Ostendorf and Hannaneh Hajishirzi

Use multitask learning to identify entities, relations and coreferences through shared span representations
unsupervised learning A Grounded Unsupervised Universal Part-of-Speech Tagger for Low-Resource Languages Ronald Cardenas, Ying Lin, Heng Ji and Jonathan May [paper]

Solve unsuperivised POS Tagging task for low resource languages (CL) by using a high resource parent language (PL) and use EM algorithms. \begin{align} \underset{p}{\operatorname{argmax}}P_\theta(p|w) \\ & = \underset{p}{\operatorname{argmax}}\sum_{c \in C^{|w|}}{P_\theta(p,c|w)}\tag{1}\label{eq:1} \\ & = \underset{p}{\operatorname{argmax}}\sum_{c \in C^{|w|}}{P_\theta(p|c,w)P_\theta(c|w)}\tag{2}\label{eq:2} \\ & = \underset{p}{\operatorname{argmax}}P_\theta(p|\hat{c}) \tag{3}\label{eq:3} \\ & = \underset{p}{\operatorname{argmax}}P_\theta(\hat{c}|p)P_\theta(p)\tag{4}\label{eq:4} \end{align} $w$ is the word sequence, $p$ is the tag sequence, $C$ is the cluster vocabulary.
\eqref{eq:1} and \eqref{eq:2}: formulate POS induction as a two-step pipeline: from word sequence $w$ to POS tag sequence $p$ via cluster sequence $c$ .
\eqref{eq:3}: assume a deterministic pipelined clustering of words and a tag labeling model that does not depend on words. Estimate $P_\theta(p)$ from parent language, and use EM to estimate $P_\theta(\hat{c}|p)$
A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations Mingda Chen, Qingming Tang, Sam Wiseman and Kevin Gimpel [paper]

Use two latent variable to genreate a sentence: semantic variable $y$ and syntactic variable $z$ . The author introduce the vMF- Gaussian Variational Autoencoder (VGVAE) as the model. The author found that the model with the best performing syntactic and semantic representations also gives rise to the most disentangled representations.
explainability A Structural Probe for Finding Syntax in Word Representations John Hewitt and Christopher D. Manning [link]

Can we recover structure information from contexal embeddings such as ELMo or BERT? Use a prove/diagnostic classifier (one layer NN) to construct evidence to predict the syntax tree distances of words in a sentence. The answer is Yes! It works superisingly well.
word representation A Systematic Study of Leveraging Subword Information for Learning Word Representations Yi Zhu, Ivan Vulić and Anna Korhonen [paper]

Propose a genreal framework for learning subword-informed word embeddings: 1) segmentations of words 2) subword embedding composition.
Abstractive Summarization of Reddit Posts with Multi-level Memory Networks Byeongchang Kim, Hyunwoo Kim and Gunhee Kim
Abusive Language Detection with Graph Convolutional Networks Pushkar Mishra, Marco Del Tredici, Helen Yannakoudakis and Ekaterina Shutova
Adaptive Convolution for Text Classification Byung-Ju Choi, Jun-Hyung Park and SangKeun Lee
Alignment over Heterogeneous Embeddings for Question Answering Vikas Yadav, Steven Bethard and Mihai Surdeanu
An Effective Label Noise Model for DNN Text Classification Ishan Jindal, Daniel Pressel, Brian Lester and Matthew Nokleby
An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models Alexandra Chronopoulou, Christos Baziotis and Alexandros Potamianos
An Encoding Strategy Based Word-Character LSTM for Chinese NER Wei Liu, Tongge Xu, Qinghua Xu, Jiayu Song and Yueran Zu
keyphrase generation An Integrated Approach for Keyphrase Generation via Exploring the Power of Retrieval and Extraction Wang Chen, Hou Pong Chan, Piji Li, Lidong Bing and Irwin King [paper]

Keyphrase generation task: combine both extractive and generative approach in a multitask learning framework; use similar doc’s keyphrase as external memory; use a neural merge module to combine the keyphrases.
explainability Attention is not Explanation Sarthak Jain and Byron C. Wallace [paper]

Experiments on sentence classification(SST, IMDB, etc.), QA (CNN dailymail and bAbI) and NLI(SNLI) tasks show that attention weights is weakly correlated with gradient based and leave-one-out methods. Adversial generated attention weights can achieve similar performance.
Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts Timo Schick and Hinrich Schütze
BAG: Bi-directional Attention Entity Graph Convolutional Network for Multi-hop Reasoning Question Answering Yu Cao, Meng Fang and Dacheng Tao
Better Modeling of Incomplete Annotations for Named Entity Recognition Zhanming Jie, Pengjun Xie, Wei Lu, Ruixue Ding and Linlin Li
Bidirectional Attentive Memory Networks for Question Answering over Knowledge Bases Yu Chen, Lingfei Wu and Mohammed Zaki
Biomedical Event Extraction based on Knowledge-driven Tree-LSTM Diya Li, Lifu Huang, Heng Ji and Jiawei Han
Chinese Named Entity Recognition using Featured Embeddings and Attention Mechanism Yuying Zhu and Guoxin Wang
Combining Distant and Direct Supervision for Neural Relation Extraction Iz Beltagy, Kyle Lo and Waleed Ammar
QA CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge Alon Talmor, Jonathan Herzig, Nicholas Lourie and Jonathan Berant [link][paper]

Model Test Accuracy

BERT-Large 55.9

Human 88.8
relation extraction Connecting Language and Knowledge with Heterogeneous Representations for Neural Relation Extraction Peng Xu and Denilson Barbosa [paper]

Relation Extraction on NYT corpus: integrate RE and KBE models together.
RE: Embedding + BiLSTM Encoder + Multilevel Attention $J_L = -\frac{1}{N}\sum_{i=1}^{N}{\log{p(r_i|S_i;\Theta^{(L)})}}$
KBE: Follow Trouillon et al. (2016), we can get the knowledge representations $e_h, w_r, e_t \in \mathbb C ^{d_k}$ . $J_G = -\frac1{N}\sum_{i=1}^{N}\log{p(r_i|(h_i,t_i);\Theta^{(G)})}$
disimilarities: $J_D = -\frac{1}{N}\sum_{i=1}^N \log{p(r_i^*|S_i;\Theta^{(L)})}$ , $r_i^*=\underset{r\in R \cup { \{NA\} }}{\operatorname{argmax}}p(r|(h_i, t_i);\Theta^{(G)})$
Combine: $\min_\Theta J = J_L + J_G + J_D + \lambda { \lVert \Theta \rVert } ^2$
Continual Learning for Sentence Representations Using Conceptors Tianlin Liu, Lyle Ungar and Joao Sedoc
self attention Convolutional Self-Attention Networks Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S. Chao and Zhaopeng Tu [paper]

Motivation: self-attention-networks (SAN) may fail to emphasize context information. Therefore they limit the self attention scope to a small context window, and use multihead attention, to form a 2D convolution network.
Question: How can this model handle sentences with long distance dependency?
Disentangling Language and Knowledge in Task Oriented Dialogs Dinesh Raghu, Nikhil Gupta and Mausam
relation extraction Distant Supervision Relation Extraction with Intra-Bag and Inter-Bag Attentions Zhi-Xiu Ye and Zhen-Hua Ling [paper]

Inter-bag and intra-bag attentions are used to denoice at different level.
Improved intra-bag attention: consider all the relation labels at the intra-bag level attentions.
Add inter-bag attention: relax the at-least-one hypothesis.
rebuttal Does My Rebuttal Matter? Insights from a Major NLP Conference Yang Gao, Steffen Eger, Ilia Kuznetsov, Iryna Gurevych and Yusuke Miyao [paper]
Key findings:
- “Peer pressure” is the most important factor of score change
- To improve the score for a borderline paper, a more convincing, specific and explicit response may be helpful
- An impolite author response may harm the final score
Machine Reading DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh and Matt Gardner [link][paper]

A really interesting Reading Comprehension dataset requiring many types of reasoning including Subtraction(28.8%), Comparison (18.2%), Selection (19.4%), Addition (11.7%), Count (16.5%) and Sort (11.7%), Coreference Resolution (3.7%), Other Arithmetic (3.2%), Set of spans (6.0%) and Other (6.8%).

Model Test F1

BERT 32.7%

NAQANet 47.01%

Human 96.42%
Enhancing Key-Value Memory Neural Networks for Knowledge Based Question Answering Kun Xu, Yuxuan Lai, Yansong Feng and Zhiguo Wang
Entailment-based Question Answering over Multiple Sentences Harsh Trivedi, Heeyoung Kwon, Tushar Khot, Ashish Sabharwal and Niranjan Balasubramanian
Exploiting Noisy Data in Distant Supervision Relation Classification Kaijia Yang, Liang He, XIN-YU DAI, Shujian Huang and Jiajun CHEN
FreebaseQA: A New Factoid QA Dataset Matching Trivia-Style Question-Answer Pairs with Freebase Kelvin Jiang, Dekun Wu and Hui Jiang
GAN Driven Semi-distant Supervision for Relation Extraction Pengshuai Li, Xinsong Zhang, Weijia Jia and Hai Zhao
Generating Knowledge Graph Paths from Textual Definitions using Sequence-to-Sequence Models Victor Prokhorov, Mohammad Taher Pilehvar and Nigel Collier
GraphIE: A Graph-Based Framework for Information Extraction Yujie Qian, Enrico Santus, Zhijing Jin, Jiang Guo and Regina Barzilay
Improving Machine Reading Comprehension with General Reading Strategies Kai Sun, Dian Yu, Dong Yu and Claire Cardie
Information Aggregation for Multi-Head Attention with Routing-by-Agreement Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R. Lyu and Zhaopeng Tu
Integrating Semantic Knowledge to Tackle Zero-shot Text Classification Jingqing Zhang, Piyawat Lertvittayakumjorn and Yike Guo
Joint Detection and Location of English Puns Yanyan Zou and Wei Lu
Knowledge-Augmented Language Model and Its Application to Unsupervised Named-Entity Recognition Angli Liu, Jingfei Du and Veselin Stoyanov
Learning to Attend On Essential Terms: An Enhanced Retriever-Reader Model for Open-domain Question Answering Jianmo Ni, Chenguang Zhu, Weizhu Chen and Julian McAuley
Neural Chinese Address Parsing Hao Li, Wei Lu, Pengjun Xie and Linlin Li
No Permanent Friends or Enemies: Tracking Dynamic Relationships Between Nations from News Xiaochuang Han, Eunsol Choi and Chenhao Tan
On Knowledge distillation from complex networks for response prediction Siddhartha Arora, Mitesh M. Khapra and Harish G. Ramaswamy
On Measuring Social Biases in Sentence Encoders Chandler May, Alex Wang, Shikha Bordia, Samuel R. Bowman and Rachel Rudinger
pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference Mandar Joshi, Eunsol Choi, Omer Levy, Daniel Weld and Luke Zettlemoyer
PAWS: A Dataset for Measuring Structure Sensitivity in Paraphrase Identification Yuan Zhang, Jason Baldridge and Luheng He
Relation Classification Using Segment-Level Attention-based CNN and Dependency-based RNN Van-Hien Tran, Van-Thuy Phi, Hiroyuki Shindo and Yuji Matsumoto
Text Generation from Knowledge Graphs Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Mirella Lapata and Hannaneh Hajishirzi
UHop: An Unrestricted-Hop Relation Extraction Framework for Knowledge-Based Question Answering Zi-Yuan Chen, Chih-Hung Chang, Yi-Pei Chen, Jijnasa Nayak and Lun-Wei Ku
common sense Unsupervised Deep Structured Semantic Models for Commonsense Reasoning Shuohang Wang, Sheng Zhang, Yelong Shen, Xiaodong Liu, Jingjing Liu, Jianfeng Gao and Jing Jiang [paper]

Related work worth reading: GPT-2: Radford et al., 2019, ConceptNet/WordNet:Liu et al. (2017), LM model:Trinh and Le (2018)
word embedding What just happened? Evaluating retrofitted distributional word vectors Dmetri Hayes [paper]

Given a pretrained word embedding, introduce external knowledge such as synonyms and antonyms to finetune the word embeddings such that the distance of synonyms are minimized and the distance of antonyms are maximized.

Question: Can more external knowledge be integrated into the word embeddings? e.g. Knowledge graphs?

Model	Test Accuracy
BERT-Large	55.9
Human	88.8

Model	Test F1
BERT	32.7%
NAQANet	47.01%
Human	96.42%

System Demonstrations

End-to-End Open-Domain Question Answering with BERTserini Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li and Jimmy Lin
fairseq: A Fast, Extensible Toolkit for Sequence Modeling Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier and Michael Auli
FLAIR: An Easy-to-Use Framework for State-of-the-Art NLP Alan Akbik, Tanja Bergmann, Duncan Blythe, Kashif Rasul, Stefan Schweter and Roland Vollgraf
iComposer: An Automatic Songwriting System for Chinese Popular Music Hsin-Pei Lee, Jhih-Sheng Fang and Wei-Yun Ma