1. Long Papers (Main Conference)
  2. Short Papers (Main Conference)
  3. System Demonstration Papers
  4. Student Research Workshop Papers

Based on their titles papers have been automatically tagged with the following topics wherever applicable:
QA summarization dialogue/conversation MT NLG parsing transfer corpus NER bias IE NLI RepL

Long Papers (Main Conference)

SphereRE: Distinguishing Lexical Relations with Hyperspherical Relation Embeddings link
Chengyu Wang, XIAOFENG HE, Aoying Zhou

Learning from Dialogue after Deployment: Feed Yourself, Chatbot! link dialogue/conversation
Braden Hancock, Antoine Bordes, Pierre-Emmanuel Mazare, Jason Weston

MOROCO: The Moldavian and Romanian Dialectal Corpus link corpus
Andrei Butnaru, Radu Tudor Ionescu

In this work, we introduce the MOldavian and ROmanian Dialectal COrpus (MOROCO), which is freely available for download at https://github.com/butnaruandrei/MOROCO. The corpus contains 33564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21719 samples for training, 5921 samples for validation and another 5924 samples for testing. For each sample, we provide corresponding dialectal and category labels. This allows us to perform empirical studies on several classification tasks such as (i) binary discrimination of Moldavian versus Romanian text samples, (ii) intra-dialect multi-class categorization by topic and (iii) cross-dialect multi-class categorization by topic. We perform experiments using a shallow approach based on string kernels, as well as a novel deep approach based on character-level convolutional neural networks containing Squeeze-and-Excitation blocks. We also present and analyze the most discriminative features of our best performing model, before and after named entity removal.

Improved Language Modeling by Decoding the Past link
Siddhartha Brahma

Highly regularized LSTMs achieve impressive results on several benchmark datasets in language modeling. We propose a new regularization method based on decoding the last token in the context using the predicted distribution of the next token. This biases the model towards retaining more contextual information, in turn improving its ability to predict the next token. With negligible overhead in the number of parameters and training time, our Past Decode Regularization (PDR) method achieves a word level perplexity of 55.6 on the Penn Treebank and 63.5 on the WikiText-2 datasets using a single softmax. We also show gains by using PDR in combination with a mixture-of-softmaxes, achieving a word level perplexity of 53.8 and 60.5 on these datasets. In addition, our method achieves 1.169 bits-per-character on the Penn Treebank Character dataset for character level language modeling. These results constitute a new state-of-the-art in their respective settings.

Generating Responses with a Specific Emotion in Dialog dialogue/conversation
Zhenqiao Song, Xiaoqing Zheng, Xuanjing Huang, Mu Xu, Lu Liu

Semantically Conditioned Dialog Response Generation via Hierarchical Disentangled Self-Attention link dialogue/conversation NLG
Wenhu Chen, Jianshu Chen, Pengda Qin, Xifeng Yan, William Yang Wang

Semantically controlled neural response generation on limited-domain has achieved great performance. However, moving towards multi-domain large-scale scenarios are shown to be difficult because the possible combinations of semantic inputs grow exponentially with the number of domains. To alleviate such scalability issue, we exploit the structure of dialog acts to build a multi-layer hierarchical graph, where each act is represented as a root-to-leaf route on the graph. Then, we incorporate such graph structure prior as an inductive bias to build a hierarchical disentangled self-attention network, where we disentangle attention heads to model designated nodes on the dialog act graph. By activating different (disentangled) heads at each layer, combinatorially many dialog act semantics can be modeled to control the neural response generation. On the large-scale Multi-Domain-WOZ dataset, our model can yield a significant improvement over the baselines on various automatic and human evaluation metrics.

Inferring Concept Hierarchies from Text Corpora via Hyperbolic Embeddings link
Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, Maximilian Nickel

We consider the task of inferring is-a relationships from large text corpora. For this purpose, we propose a new method combining hyperbolic embeddings and Hearst patterns. This approach allows us to set appropriate constraints for inferring concept hierarchies from distributional contexts while also being able to predict missing is-a relationships and to correct wrong extractions. Moreover -- and in contrast with other methods -- the hierarchical nature of hyperbolic space allows us to learn highly efficient representations and to improve the taxonomic consistency of the inferred hierarchies. Experimentally, we show that our approach achieves state-of-the-art performance on several commonly-used benchmarks.

Open Vocabulary Learning for Neural Chinese Pinyin IME link
Zhuosheng Zhang, Yafang Huang, Hai Zhao

Pinyin-to-character (P2C) conversion is the core component of pinyin-based Chinese input method engine (IME). However, the conversion is seriously compromised by the ambiguities of Chinese characters corresponding to pinyin as well as the predefined fixed vocabularies. To alleviate such inconveniences, we propose a neural P2C conversion model augmented by an online updated vocabulary with a sampling mechanism to support open vocabulary learning during IME working. Our experiments show that the proposed method outperforms commercial IMEs and state-of-the-art traditional models on standard corpus and true inputting history dataset in terms of multiple metrics and thus the online updated vocabulary indeed helps our IME effectively follows user inputting behavior.

Head-Driven Phrase Structure Grammar Parsing on Penn Treebank parsing
Junru Zhou, Hai Zhao

Weakly-Supervised Spatio-Temporally Grounding Natural Sentence in Video link
Zhenfang Chen, Lin Ma, Wenhan Luo, Kwan-Yee Kenneth Wong

In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video. Specifically, given a natural sentence and a video, we localize a spatio-temporal tube in the video that semantically corresponds to the given sentence, with no reliance on any spatio-temporal annotations during training. First, a set of spatio-temporal tubes, referred to as instances, are extracted from the video. We then encode these instances and the sentence using our proposed attentive interactor which can exploit their fine-grained relationships to characterize their matching behaviors. Besides a ranking loss, a novel diversity loss is introduced to train the proposed attentive interactor to strengthen the matching behaviors of reliable instance-sentence pairs and penalize the unreliable ones. Moreover, we also contribute a dataset, called VID-sentence, based on the ImageNet video object detection dataset, to serve as a benchmark for our task. Extensive experimental results demonstrate the superiority of our model over the baseline approaches.

Reinforced Training Data Selection for Domain Adaptation
Miaofeng Liu, Yan Song, Hongbin Zou, Tong Zhang

Generating Long and Informative Reviews with Aspect-Aware Coarse-to-Fine Decoding link
Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, Yang Song

Generating long and informative review text is a challenging natural language generation task. Previous work focuses on word-level generation, neglecting the importance of topical and syntactic characteristics from natural languages. In this paper, we propose a novel review generation model by characterizing an elaborately designed aspect-aware coarse-to-fine generation process. First, we model the aspect transitions to capture the overall content flow. Then, to generate a sentence, an aspect-aware sketch will be predicted using an aspect-aware decoder. Finally, another decoder fills in the semantic slots by generating corresponding words. Our approach is able to jointly utilize aspect semantics, syntactic sketch, and context information. Extensive experiments results have demonstrated the effectiveness of the proposed model.

Neural News Recommendation with Long- and Short-term User Representations RepL
Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu, Xing Xie

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text
Jianxing YU, Zhengjun ZHA, Jian YIN

Coherent Comments Generation for Chinese Articles with a Graph-to-Sequence Model link NLG
Wei Li, Jingjing Xu, Yancheng He, ShengLi Yan, Yunfang Wu, Xu SUN

Adversarial Attention Modeling for Multi-dimensional Emotion Regression
Suyang Zhu, Shoushan Li, Guodong Zhou

Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing
Sijie Mai, Haifeng Hu, Songlong Xing

OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs dialogue/conversation
Seungwhan Moon, Pararth Shah, Anuj Kumar, Rajen Subba

Joint Slot Filling and Intent Detection via Capsule Neural Networks link
Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, Philip Yu

Being able to recognize words as slots and detect the intent of an utterance has been a keen issue in natural language understanding. The existing works either treat slot filling and intent detection separately in a pipeline manner, or adopt joint models which sequentially label slots while summarizing the utterance-level intent without explicitly preserving the hierarchical relationship among words, slots, and intents. To exploit the semantic hierarchy for effective modeling, we propose a capsule-based neural network model which accomplishes slot filling and intent detection via a dynamic routing-by-agreement schema. A re-routing schema is proposed to further synergize the slot filling performance using the inferred intent representation. Experiments on two real-world datasets show the effectiveness of our model when compared with other alternative model architectures, as well as existing natural language understanding services.

Semi-supervised Domain Adaptation for Dependency Parsing parsing
Zhenghua Li, Xue Peng, Min Zhang, Rui Wang, Luo Si

This Email Could Save Your Life: Introducing the Task of Email Subject Line Generation link NLG
Rui Zhang, Joel Tetreault

Given the overwhelming number of emails, an effective subject line becomes essential to better inform the recipient of the email's content. In this paper, we propose and study the task of email subject line generation: automatically generating an email subject line from the email body. We create the first dataset for this task and find that email subject line generation favor extremely abstractive summary which differentiates it from news headline generation or news single document summarization. We then develop a novel deep learning method and compare it to several baselines as well as recent state-of-the-art text summarization systems. We also investigate the efficacy of several automatic metrics based on correlations with human judgments and propose a new automatic evaluation metric. Our system outperforms competitive baselines given both automatic and human evaluations. To our knowledge, this is the first work to tackle the problem of effective email subject line generation.

Dense Procedure Captioning in Narrated Instructional Videos
Botian Shi, Lei Ji, Yaobo Liang, Zhendong NIU, Nan Duan, Ming Zhou

Incremental Learning from Scratch for Task-Oriented Dialogue Systems link dialogue/conversation
Weikang Wang, Jiajun Zhang, Qian Li, Mei-Yuh Hwang, Chengqing Zong, Zhifei Li

Clarifying user needs is essential for existing task-oriented dialogue systems. However, in real-world applications, developers can never guarantee that all possible user demands are taken into account in the design phase. Consequently, existing systems will break down when encountering unconsidered user needs. To address this problem, we propose a novel incremental learning framework to design task-oriented dialogue systems, or for short Incremental Dialogue System (IDS), without pre-defining the exhaustive list of user needs. Specifically, we introduce an uncertainty estimation module to evaluate the confidence of giving correct responses. If there is high confidence, IDS will provide responses to users. Otherwise, humans will be involved in the dialogue process, and IDS can learn from human intervention through an online learning module. To evaluate our method, we propose a new dataset which simulates unanticipated user needs in the deployment stage. Experiments show that IDS is robust to unconsidered user actions, and can update itself online by smartly selecting only the most effective training data, and hence attains better performance with less annotation cost.

Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model
Yitao Cai, Huiyu Cai, Xiaojun Wan

Task Refinement Learning for Improved Accuracy and Stability of Unsupervised Domain Adaptation
Yftah Ziser, Roi Reichart

Deep Dominance - How to Properly Compare Deep Neural Models link
Rotem Dror, Segev Shlomov, Roi Reichart

Zero-Shot Semantic Parsing for Instructions parsing
Ofer Givoli, Roi Reichart

Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes
Marina Sedinkina, Nikolas Breitkopf, Hinrich Schütze

Latent Variable Model for Multi-modal Translation link MT
Iacer Calixto, Miguel Rios, Wilker Aziz

Token-level Dynamic Self-Attention Network for Multi-Passage Reading Comprehension
Yimeng Zhuang, Huadong Wang

PaperRobot: Incremental Draft Generation of Scientific Ideas link NLG
Qingyun Wang, Lifu Huang, Zhiying Jiang, Kevin Knight, Heng Ji, Mohit Bansal, Yi Luan

We present a PaperRobot who performs as an automatic research assistant by (1) conducting deep understanding of a large collection of human-written papers in a target domain and constructing comprehensive background knowledge graphs (KGs); (2) creating new ideas by predicting links from the background KGs, by combining graph attention and contextual text attention; (3) incrementally writing some key elements of a new paper based on memory-attention networks: from the input title along with predicted related entities to generate a paper abstract, from the abstract to generate conclusion and future work, and finally from future work to generate a title for a follow-on paper. Turing Tests, where a biomedical domain expert is asked to compare a system output and a human-authored string, show PaperRobot generated abstracts, conclusion and future work sections, and new titles are chosen over human-written ones up to 30%, 24% and 12% of the time, respectively.

Reliability-aware Dynamic Feature Composition for Name Tagging
Ying Lin, Liyuan Liu, Heng Ji, Dong Yu, Jiawei Han

Topic Tensor Network for Implicit Discourse Relation Recognition in Chinese
Sheng Xu, Peifeng Li, Fang Kong, Qiaoming Zhu, Guodong Zhou

Selection Bias Explorations and Debias Methods for Natural Language Sentence Matching Datasets link bias
Guanhua Zhang, Bing Bai, Jian Liang, Kun Bai, Shiyu Chang, Mo Yu, Conghui Zhu, Tiejun Zhao

Natural Language Sentence Matching (NLSM) has gained substantial attention from both academics and the industry, and rich public datasets contribute a lot to this process. However, biased datasets can also hurt the generalization performance of trained models and give untrustworthy evaluation results. For many NLSM datasets, the providers select some pairs of sentences into the datasets, and this sampling procedure can easily bring unintended pattern, i.e., selection bias. One example is the QuoraQP dataset, where some content-independent naive features are unreasonably predictive. Such features are the reflection of the selection bias and termed as the leakage features. In this paper, we investigate the problem of selection bias on six NLSM datasets and find that four out of them are significantly biased. We further propose a training and evaluation framework to alleviate the bias. Experimental results on QuoraQP suggest that the proposed framework can improve the generalization ability of trained models, and give more trustworthy evaluation results for real-world adoptions.

ReCoSa: Detecting the Relevant Contexts with Self-Attention for Multi-turn Dialogue Generation dialogue/conversation NLG
Hainan Zhang, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng

AutoML strategy based on grammatical evolution: A case study about knowledge discovery from text
Suilan Estevez-Velarde, Yoan Gutiérrez, Andrés Montoyo, Yudivián Almeida-Cruz

Manipulating the Difficulty of C-Tests link
Ji-Ung Lee, Erik Schwan, Christian M. Meyer

We propose two novel manipulation strategies for increasing and decreasing the difficulty of C-tests automatically. This is a crucial step towards generating learner-adaptive exercises for self-directed language learning and preparing language assessment tests. To reach the desired difficulty level, we manipulate the size and the distribution of gaps based on absolute and relative gap difficulty predictions. We evaluate our approach in corpus-based experiments and in a user study with 60 participants. We find that both strategies are able to generate C-tests with the desired difficulty level.

Towards Unsupervised Text Classification Leveraging Experts and Word Embeddings RepL
Zied Haj-Yahia, Adrien Sieg, Léa A. Deleris

A Cross-Sentence Latent Variable Model for Semi-Supervised Text Sequence Matching link
Jihun Choi, Taeuk Kim, Sang-goo Lee

We present a latent variable model for predicting the relationship between a pair of text sequences. Unlike previous auto-encoding--based approaches that consider each sequence separately, our proposed framework utilizes both sequences within a single model by generating a sequence that has a given relationship with a source sequence. We further extend the cross-sentence generating framework to facilitate semi-supervised training. We also define novel semantic constraints that lead the decoder network to generate semantically plausible and diverse sequences. We demonstrate the effectiveness of the proposed model from quantitative and qualitative experiments, while achieving state-of-the-art results on semi-supervised natural language inference and paraphrase identification.

Heuristic Authorship Obfuscation link
Janek Bevendorff, Martin Potthast, Matthias Hagen, Benno Stein

Learning Deep Transformer Models for Machine Translation link MT
Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model summarization
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, Dragomir Radev

Rhetorically Controlled Encoder-Decoder for Modern Chinese Poetry Generation NLG
Zhiqiang Liu, Zuohui Fu, Jie Cao, Gerard de Melo, Yik-Cheung Tam, Cheng Niu, Jie Zhou

Automatic Generation of High Quality CCGbanks for Parser Domain Adaptation link NLG parsing
Masashi Yoshikawa, Hiroshi Noji, Koji Mineshima, Daisuke Bekki

We propose a new domain adaptation method for Combinatory Categorial Grammar (CCG) parsing, based on the idea of automatic generation of CCG corpora exploiting cheaper resources of dependency trees. Our solution is conceptually simple, and not relying on a specific parser architecture, making it applicable to the current best-performing parsers. We conduct extensive parsing experiments with detailed discussion; on top of existing benchmark datasets on (1) biomedical texts and (2) question sentences, we create experimental datasets of (3) speech conversation and (4) math problems. When applied to the proposed method, an off-the-shelf CCG parser shows significant performance gains, improving from 90.7% to 96.6% on speech conversation, and from 88.5% to 96.8% on math problems.

Multi-Source Cross-Lingual Model Transfer: Learning What to Share link transfer
Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan, Wei Wang, Claire Cardie

Identifying Visible Actions in Lifestyle Vlogs link
Oana Ignat, Laura Burdick, Jia Deng, Rada Mihalcea

We consider the task of identifying human actions visible in online videos. We focus on the widely spread genre of lifestyle vlogs, which consist of videos of people performing actions while verbally describing them. Our goal is to identify if actions mentioned in the speech description of a video are visually present. We construct a dataset with crowdsourced manual annotations of visible actions, and introduce a multimodal algorithm that leverages information derived from visual and linguistic clues to automatically infer which actions are visible in a video. We demonstrate that our multimodal algorithm outperforms algorithms based only on one modality at a time.

Augmenting Neural Networks with First-order Logic link
Tao Li, Vivek Srikumar

Today, the dominant paradigm for training neural networks involves minimizing task loss on a large dataset. Using world knowledge to inform a model, and yet retain the ability to perform end-to-end training remains an open question. In this paper, we present a novel framework for introducing declarative knowledge to neural network architectures in order to guide training and prediction. Our framework systematically compiles logical statements into computation graphs that augment a neural network without extra learnable parameters or manual redesign. We evaluate our modeling strategy on three tasks: machine comprehension, natural language inference, and text chunking. Our experiments show that knowledge-augmented networks can strongly improve over baselines, especially in low-data regimes.

Bayes Test of Precision, Recall, and F1 Measure for Comparison of Two Natural Language Processing Models
Ruibo WANG, Jihong Li

Explicit Utilization of General Knowledge in Machine Reading Comprehension link
Chao Wang, Hui Jiang

To bridge the gap between Machine Reading Comprehension (MRC) models and human beings, which is mainly reflected in the hunger for data and the robustness to noise, in this paper, we explore how to integrate the neural networks of MRC models with the general knowledge of human beings. On the one hand, we propose a data enrichment method, which uses WordNet to extract inter-word semantic connections as general knowledge from each given passage-question pair. On the other hand, we propose an end-to-end MRC model named as Knowledge Aided Reader (KAR), which explicitly uses the above extracted general knowledge to assist its attention mechanisms. Based on the data enrichment method, KAR is comparable in performance with the state-of-the-art MRC models, and significantly more robust to noise than them. When only a subset (20%-80%) of the training examples are available, KAR outperforms the state-of-the-art MRC models by a large margin, and is still reasonably robust to noise.

DOER: Dual Cross-Shared RNN for Aspect Term-Polarity Co-Extraction link IE
Huaishao Luo, Tianrui Li, Bing Liu, Junbo Zhang

Self-Attentional Models for Lattice Inputs link
Matthias Sperber, Graham Neubig, Ngoc-Quan Pham, Alex Waibel

Lattices are an efficient and effective method to encode ambiguity of upstream systems in natural language processing tasks, for example to compactly capture multiple speech recognition hypotheses, or to represent multiple linguistic analyses. Previous work has extended recurrent neural networks to model lattice inputs and achieved improvements in various tasks, but these models suffer from very slow computation speeds. This paper extends the recently proposed paradigm of self-attention to handle lattice inputs. Self-attention is a sequence modeling technique that relates inputs to one another by computing pairwise similarities and has gained popularity for both its strong results and its computational efficiency. To extend such models to handle lattices, we introduce probabilistic reachability masks that incorporate lattice structure into the model and support lattice scores if available. We also propose a method for adapting positional embeddings to lattice structures. We apply the proposed model to a speech translation task and find that it outperforms all examined baselines while being much faster to compute than previous neural lattice models during both training and inference.

Multi-style Generative Reading Comprehension link
Kyosuke Nishida, Itsumi Saito, Kosuke Nishida, Kazutoshi Shinoda, Atsushi Otsuka, Hisako Asano, Junji Tomita

Joint Effects of Context and User History for Predicting Online Conversation Re-entries link dialogue/conversation
Xingshan Zeng, Jing Li, Lu Wang, Kam-Fai Wong

As the online world continues its exponential growth, interpersonal communication has come to play an increasingly central role in opinion formation and change. In order to help users better engage with each other online, we study a challenging problem of re-entry prediction foreseeing whether a user will come back to a conversation they once participated in. We hypothesize that both the context of the ongoing conversations and the users' previous chatting history will affect their continued interests in future engagement. Specifically, we propose a neural framework with three main layers, each modeling context, user history, and interactions between them, to explore how the conversation context and user chatting history jointly result in their re-entry behavior. We experiment with two large-scale datasets collected from Twitter and Reddit. Results show that our proposed framework with bi-attention achieves an F1 score of 61.1 on Twitter conversations, outperforming the state-of-the-art methods from previous work.

A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity link RepL
Yoshinari Fujinuma, Jordan Boyd-Graber, Michael J. Paul

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language - i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

Textbook Question Answering with Multi-modal Context Graph Understanding and Self-supervised Open-set Comprehension link QA
Daesik Kim, Seonhoon Kim, Nojun Kwak

In this work, we introduce a novel algorithm for solving the textbook question answering (TQA) task which describes more realistic QA problems compared to other recent tasks. We mainly focus on two related issues with analysis of the TQA dataset. First, solving the TQA problems requires to comprehend multi-modal contexts in complicated input data. To tackle this issue of extracting knowledge features from long text lessons and merging them with visual features, we establish a context graph from texts and images, and propose a new module f-GCN based on graph convolutional networks (GCN). Second, scientific terms are not spread over the chapters and subjects are split in the TQA dataset. To overcome this so called "out-of-domain" issue, before learning QA problems, we introduce a novel self-supervised open-set learning process without any annotations. The experimental results show that our model significantly outperforms prior state-of-the-art methods. Moreover, ablation studies validate that both methods of incorporating f-GCN for extracting knowledge from multi-modal contexts and our newly proposed self-supervised learning process are effective for TQA problems.

Multi-Level Matching and Aggregation Network for Few-Shot Relation Classification link
Zhi-Xiu Ye, Zhen-Hua Ling

This paper presents a multi-level matching and aggregation network (MLMAN) for few-shot relation classification. Previous studies on this topic adopt prototypical networks, which calculate the embedding vector of a query instance and the prototype vector of each support set independently. In contrast, our proposed MLMAN model encodes the query instance and each support set in an interactive way by considering their matching information at both local and instance levels. The final class prototype for each support set is obtained by attentive aggregation over the representations of its support instances, where the weights are calculated using the query instance. Experimental results demonstrate the effectiveness of our proposed methods, which achieve a new state-of-the-art performance on the FewRel dataset.

Retrieve, Read, Rerank: Towards End-to-End Multi-Document Reading Comprehension link
Minghao Hu, Yuxing Peng, Zhen Huang, Dongsheng Li

Neural Aspect and Opinion Term Extraction with Mined Rules as Weak Supervision IE
Hongliang Dai, Yangqiu Song

Coupling Retrieval and Meta-Learning for Context-Dependent Semantic Parsing link parsing
Daya Guo, Duyu Tang, Nan Duan, Ming Zhou, Jian Yin

In this paper, we present an approach to incorporate retrieved datapoints as supporting evidence for context-dependent semantic parsing, such as generating source code conditioned on the class environment. Our approach naturally combines a retrieval model and a meta-learner, where the former learns to find similar datapoints from the training data, and the latter considers retrieved datapoints as a pseudo task for fast adaptation. Specifically, our retriever is a context-aware encoder-decoder model with a latent variable which takes context environment into consideration, and our meta-learner learns to utilize retrieved datapoints in a model-agnostic meta-learning paradigm for fast adaptation. We conduct experiments on CONCODE and CSQA datasets, where the context refers to class environment in JAVA codes and conversational history, respectively. We use sequence-to-action model as the base semantic parser, which performs the state-of-the-art accuracy on both datasets. Results show that both the context-aware retriever and the meta-learning strategy improve accuracy, and our approach performs better than retrieve-and-edit baselines.

Improving Abstractive Document Summarization with Salient Information Modeling summarization
Yongjian You, Weijia Jia, Tianyi Liu, Wenmian Yang

Unsupervised Neural Single-Document Summarization of Reviews via Learning Latent Discourse Structure and its Ranking link summarization
Masaru Isonuma, Junichiro Mori, Ichiro Sakata

This paper focuses on the end-to-end abstractive summarization of a single product review without supervision. We assume that a review can be described as a discourse tree, in which the summary is the root, and the child sentences explain their parent in detail. By recursively estimating a parent from its children, our model learns the latent discourse tree without an external parser and generates a concise summary. We also introduce an architecture that ranks the importance of each sentence on the tree to support summary generation focusing on the main review point. The experimental results demonstrate that our model is competitive with or outperforms other unsupervised approaches. In particular, for relatively long reviews, it achieves a competitive or better performance than supervised models. The induced tree shows that the child sentences provide additional information about their parent, and the generated summary abstracts the entire review.

Just "OneSeC" for Producing Multilingual Sense-Annotated Data link
Bianca Scarlini, Tommaso Pasini, Roberto Navigli

How to (Properly) Evaluate Cross-Lingual Word Embeddings: On Strong Baselines, Comparative Analyses, and Some Misconceptions link RepL
Goran Glavaš, Robert Litschko, Sebastian Ruder, Ivan Vulić

Cross-lingual word embeddings (CLEs) enable multilingual modeling of meaning and facilitate cross-lingual transfer of NLP models. Despite their ubiquitous usage in downstream tasks, recent increasingly popular projection-based CLE models are almost exclusively evaluated on a single task only: bilingual lexicon induction (BLI). Even BLI evaluations vary greatly, hindering our ability to correctly interpret performance and properties of different CLE models. In this work, we make the first step towards a comprehensive evaluation of cross-lingual word embeddings. We thoroughly evaluate both supervised and unsupervised CLE models on a large number of language pairs in the BLI task and three downstream tasks, providing new insights concerning the ability of cutting-edge CLE models to support cross-lingual NLP. We empirically demonstrate that the performance of CLE models largely depends on the task at hand and that optimizing CLE models for BLI can result in deteriorated downstream performance. We indicate the most robust supervised and unsupervised CLE models and emphasize the need to reassess existing baselines, which still display competitive performance across the board. We hope that our work will catalyze further work on CLE evaluation and model analysis.

Learning from omission
Bill McDowell, Noah Goodman

Topic Modeling with Wasserstein Autoencoders
Feng Nan, Ran Ding, Ramesh Nallapati, Bing Xiang

Towards Language Agnostic Universal Representations link RepL
Armen Aghajanyan, Xia Song, Saurabh Tiwary

When a bilingual student learns to solve word problems in math, we expect the student to be able to solve these problem in both languages the student is fluent in,even if the math lessons were only taught in one language. However, current representations in machine learning are language dependent. In this work, we present a method to decouple the language from the problem by learning language agnostic representations and therefore allowing training a model in one language and applying to a different one in a zero shot fashion. We learn these representations by taking inspiration from linguistics and formalizing Universal Grammar as an optimization process (Chomsky, 2014; Montague, 1970). We demonstrate the capabilities of these representations by showing that the models trained on a single language using language agnostic representations achieve very similar accuracies in other languages.

Meaning to Form: Measuring Systematicity as Information link
Tiago Pimentel, Arya D. McCarthy, Damian Blasi, Brian Roark, Ryan Cotterell

A longstanding debate in semiotics centers on the relationship between linguistic signs and their corresponding semantics: is there an arbitrary relationship between a word form and its meaning, or does some systematic phenomenon pervade? For instance, does the character bigram \textit{gl} have any systematic relationship to the meaning of words like \textit{glisten}, \textit{gleam} and \textit{glow}? In this work, we offer a holistic quantification of the systematicity of the sign using mutual information and recurrent neural networks. We employ these in a data-driven and massively multilingual approach to the question, examining 106 languages. We find a statistically significant reduction in entropy when modeling a word form conditioned on its semantic representation. Encouragingly, we also recover well-attested English examples of systematic affixes. We conclude with the meta-point: Our approximate effect size (measured in bits) is quite small---despite some amount of systematicity between form and meaning, an arbitrary relationship and its resulting benefits dominate human language.

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology link
Ran Zmigrod, Sebastian J. Mielke, Hanna Wallach, Ryan Cotterell

Gender stereotypes are manifest in most of the world's languages and are consequently propagated or amplified by NLP systems. Although research has focused on mitigating gender stereotypes in English, the approaches that are commonly employed produce ungrammatical sentences in morphologically rich languages. We present a novel approach for converting between masculine-inflected and feminine-inflected sentences in such languages. For Spanish and Hebrew, our approach achieves F1 scores of 82% and 73% at the level of tags and accuracies of 90% and 87% at the level of forms. By evaluating our approach using four different languages, we show that, on average, it reduces gender stereotyping by a factor of 2.5 without any sacrifice to grammaticality.

Neural Text Simplification of Clinical Letters with a Domain Specific Phrase Table
Matthew Shardlow, Raheel Nawaz

TIGS: An Inference Algorithm for Text Infilling with Gradient Search link
Dayiheng Liu, Jie Fu, Pengfei Liu, Jiancheng Lv

Text infilling is defined as a task for filling in the missing part of a sentence or paragraph, which is suitable for many real-world natural language generation scenarios. However, given a well-trained sequential generative model, generating missing symbols conditioned on the context is challenging for existing greedy approximate inference algorithms. In this paper, we propose an iterative inference algorithm based on gradient search, which is the first inference algorithm that can be broadly applied to any neural sequence generative models for text infilling tasks. We compare the proposed method with strong baselines on three text infilling tasks with various mask ratios and different mask strategies. The results show that our proposed method is effective and efficient for fill-in-the-blank tasks, consistently outperforming all baselines.

Stochastic Tokenization with a Language Model for Neural Text Classification
Tatsuya Hiraoka, Hiroyuki Shindo, Yuji Matsumoto

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency
Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che

Modeling financial analysts’ decision making via the pragmatics and semantics of earnings calls link
Katherine Keith, Amanda Stent

Dialogue Natural Language Inference link dialogue/conversation NLI
Sean Welleck, Jason Weston, Arthur Szlam, Kyunghyun Cho

Consistency is a long standing issue faced by dialogue models. In this paper, we frame the consistency of dialogue agents as natural language inference (NLI) and create a new natural language inference dataset called Dialogue NLI. We propose a method which demonstrates that a model trained on Dialogue NLI can be used to improve the consistency of a dialogue model, and evaluate the method with human evaluation and with automatic metrics on a suite of evaluation sets designed to measure a dialogue model's consistency.

Budgeted Policy Learning for Task-Oriented Dialogue Systems link dialogue/conversation
Zhirui Zhang, Xiujun Li, Jianfeng Gao, Enhong Chen

This paper presents a new approach that extends Deep Dyna-Q (DDQ) by incorporating a Budget-Conscious Scheduling (BCS) to best utilize a fixed, small amount of user interactions (budget) for learning task-oriented dialogue agents. BCS consists of (1) a Poisson-based global scheduler to allocate budget over different stages of training; (2) a controller to decide at each training step whether the agent is trained using real or simulated experiences; (3) a user goal sampling module to generate the experiences that are most effective for policy learning. Experiments on a movie-ticket booking task with simulated and real users show that our approach leads to significant improvements in success rate over the state-of-the-art baselines given the fixed budget.

An Interactive Multi-Task Learning Network for End-to-End Aspect-Based Sentiment Analysis link
Ruidan He, Wee Sun Lee, Hwee Tou Ng, Daniel Dahlmeier

Aspect-based sentiment analysis produces a list of aspect terms and their corresponding sentiments for a natural language sentence. This task is usually done in a pipeline manner, with aspect term extraction performed first, followed by sentiment predictions toward the extracted aspect terms. While easier to develop, such an approach does not fully exploit joint information from the two subtasks and does not use all available sources of training information that might be helpful, such as document-level labeled sentiment corpus. In this paper, we propose an interactive multi-task learning network (IMN) which is able to jointly learn multiple related tasks simultaneously at both the token level as well as the document level. Unlike conventional multi-task learning methods that rely on learning common features for the different tasks, IMN introduces a message passing architecture where information is iteratively passed to different tasks through a shared set of latent variables. Experimental results demonstrate superior performance of the proposed method against multiple baselines on three benchmark datasets.

Extracting Symptoms and their Status from Clinical Conversations link dialogue/conversation
Nan Du, Kai Chen, Anjuli Kannan, Linh Trans, Yuhui Chen, Izhak Shafran

This paper describes novel models tailored for a new application, that of extracting the symptoms mentioned in clinical conversations along with their status. Lack of any publicly available corpus in this privacy-sensitive domain led us to develop our own corpus, consisting of about 3K conversations annotated by professional medical scribes. We propose two novel deep learning approaches to infer the symptom names and their status: (1) a new hierarchical span-attribute tagging (\SAT) model, trained using curriculum learning, and (2) a variant of sequence-to-sequence model which decodes the symptoms and their status from a few speaker turns within a sliding window over the conversation. This task stems from a realistic application of assisting medical providers in capturing symptoms mentioned by patients from their clinical conversations. To reflect this application, we define multiple metrics. From inter-rater agreement, we find that the task is inherently difficult. We conduct comprehensive evaluations on several contrasting conditions and observe that the performance of the models range from an F-score of 0.5 to 0.8 depending on the condition. Our analysis not only reveals the inherent challenges of the task, but also provides useful directions to improve the models.

Comparison of Diverse Decoding Methods from Conditional Language Models link
Daphne Ippolito, Reno Kriz, Joao Sedoc, Maria Kustikova, Chris Callison-Burch

While conditional language models have greatly improved in their ability to output high-quality natural language, many NLP applications benefit from being able to generate a diverse set of candidate sequences. Diverse decoding strategies aim to, within a given-sized candidate list, cover as much of the space of high-quality outputs as possible, leading to improvements for tasks that re-rank and combine candidate outputs. Standard decoding methods, such as beam search, optimize for generating high likelihood sequences rather than diverse ones, though recent work has focused on increasing diversity in these methods. In this work, we perform an extensive survey of decoding-time strategies for generating diverse outputs from conditional language models. We also show how diversity can be improved without sacrificing quality by over-sampling additional candidates, then filtering to the desired number.

Diachronic Sense Modeling with Deep Contextualized Word Embeddings: An Ecological View RepL
Renfen Hu, Shen Li, Shichen Liang

Retrieval-Enhanced Adversarial Training for Neural Response Generation link NLG
Qingfu Zhu, Lei Cui, Wei-Nan Zhang, Furu Wei, Ting Liu

Dialogue systems are usually built on either generation-based or retrieval-based approaches, yet they do not benefit from the advantages of different models. In this paper, we propose a Retrieval-Enhanced Adversarial Training (REAT) method for neural response generation. Distinct from existing approaches, the REAT method leverages an encoder-decoder framework in terms of an adversarial training paradigm, while taking advantage of N-best response candidates from a retrieval-based system to construct the discriminator. An empirical study on a large scale public available benchmark dataset shows that the REAT method significantly outperforms the vanilla Seq2Seq model as well as the conventional adversarial training approach.

Argument Invention from First Principles
Yonatan Bilu, Ariel Gera, Daniel Hershcovich, Benjamin Sznajder, Dan Lahav, Guy Moshkowich, Anael Malet, Assaf Gavron, Noam Slonim

ERNIE: Enhanced Language Representation with Informative Entities link RepL
Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, Qun Liu

Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution link
Shany Barhom, Vered Shwartz, Alon Eirew, Michael Bugert, Nils Reimers, Ido Dagan

Recognizing coreferring events and entities across multiple texts is crucial for many NLP applications. Despite the task's importance, research focus was given mostly to within-document entity coreference, with rather little attention to the other variants. We propose a neural architecture for cross-document coreference resolution. Inspired by Lee et al (2012), we jointly model entity and event coreference. We represent an event (entity) mention using its lexical span, surrounding context, and relation to entity (event) mentions via predicate-arguments structures. Our model outperforms the previous state-of-the-art event coreference model on ECB+, while providing the first entity coreference results on this corpus. Our analysis confirms that all our representation elements, including the mention span itself, its context, and the relation to other mentions contribute to the model's success.

Cognitive Graph for Multi-Hop Reading Comprehension at Scale link
Ming Ding, Chang Zhou, Qibin Chen, Hongxia Yang, Jie Tang

We propose a new CogQA framework for multi-hop question answering in web-scale documents. Inspired by the dual process theory in cognitive science, the framework gradually builds a \textit{cognitive graph} in an iterative process by coordinating an implicit extraction module (System 1) and an explicit reasoning module (System 2). While giving accurate answers, our framework further provides explainable reasoning paths. Specifically, our implementation based on BERT and graph neural network efficiently handles millions of documents for multi-hop reasoning questions in the HotpotQA fullwiki dataset, achieving a winning joint $F_1$ score of 34.9 on the leaderboard, compared to 23.6 of the best competitor.

Knowledge-aware Pronoun Coreference Resolution
Hongming Zhang, Yan Song, Yangqiu Song, Dong Yu

Unsupervised Question Answering by Cloze Translation link QA MT
Patrick Lewis, Ludovic Denoyer, Sebastian Riedel

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we explore to what extent high quality training data is actually required for Extractive QA, and investigate the possibility of unsupervised Extractive QA. We approach this problem by first learning to generate context, question and answer triples in an unsupervised manner, which we then use to synthesize Extractive QA training data automatically. To generate such triples, we first sample random context paragraphs from a large corpus of documents and then random noun phrases or named entity mentions from these paragraphs as answers. Next we convert answers in context to "fill-in-the-blank" cloze questions and finally translate them into natural questions. We propose and compare various unsupervised ways to perform cloze-to-natural question translation, including training an unsupervised NMT model using non-aligned corpora of natural questions and cloze questions as well as a rule-based approach. We find that modern QA models can learn to answer human questions surprisingly well using only synthetic training data. We demonstrate that, without using the SQuAD training data at all, our approach achieves 56.4 F1 on SQuAD v1 (64.5 F1 when the answer is a Named entity mention), outperforming early supervised models.

SP-10K: A Large-scale Evaluation Set for Selectional Preference Acquisition link
Hongming Zhang, Hantian Ding, Yangqiu Song

Selectional Preference (SP) is a commonly observed language phenomenon and proved to be useful in many natural language processing tasks. To provide a better evaluation method for SP models, we introduce SP-10K, a large-scale evaluation set that provides human ratings for the plausibility of 10,000 SP pairs over five SP relations, covering 2,500 most frequent verbs, nouns, and adjectives in American English. Three representative SP acquisition methods based on pseudo-disambiguation are evaluated with SP-10K. To demonstrate the importance of our dataset, we investigate the relationship between SP-10K and the commonsense knowledge in ConceptNet5 and show the potential of using SP to represent the commonsense knowledge. We also use the Winograd Schema Challenge to prove that the proposed new SP relations are essential for the hard pronoun coreference resolution problem.

Vocabulary Pyramid Network: Multi-Pass Encoding and Decoding with Multi-Level Vocabularies for Response Generation NLG
Cao Liu, Shizhu He, Kang Liu, Jun Zhao

Complex Question Decomposition for Semantic Parsing parsing
Haoyu Zhang, Jingjing Cai, Jianjun Xu, Ji Wang

A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains link
Dominik Schlechtweg, Anna Hätty, Marco Del Tredici, Sabine Schulte im Walde

We perform an interdisciplinary large-scale evaluation for detecting lexical semantic divergences in a diachronic and in a synchronic task: semantic sense changes across time, and semantic sense changes across domains. Our work addresses the superficialness and lack of comparison in assessing models of diachronic lexical change, by bringing together and extending benchmark models on a common state-of-the-art evaluation task. In addition, we demonstrate that the same evaluation task and modelling approaches can successfully be utilised for the synchronic detection of domain-specific sense divergences in the field of term extraction.

Miss Tools and Mr Fruit: Emergent communication in agents learning about object affordances link
Diane Bouchacourt, Marco Baroni

Recent research studies communication emergence in communities of deep network agents assigned a joint task, hoping to gain insights on human language evolution. We propose here a new task capturing crucial aspects of the human environment, such as natural object affordances, and of human conversation, such as full symmetry among the participants. By conducting a thorough pragmatic and semantic analysis of the emergent protocol, we show that the agents solve the shared task through genuine bilateral, referential communication. However, the agents develop multiple idiolects, which makes us conclude that full symmetry is not a sufficient condition for a common language to emerge.

Detecting Subevents using Discourse and Narrative Features
Mohammed Aldawsari, Mark Finlayson

Sentence Centrality Revisited for Unsupervised Summarization link summarization
Hao Zheng, Mirella Lapata

Single document summarization has enjoyed renewed interests in recent years thanks to the popularity of neural network models and the availability of large-scale datasets. In this paper we develop an unsupervised approach arguing that it is unrealistic to expect large-scale and high-quality training data to be available or created for different types of summaries, domains, or languages. We revisit a popular graph-based ranking algorithm and modify how node (aka sentence) centrality is computed in two ways: (a)~we employ BERT, a state-of-the-art neural representation learning model to better capture sentential meaning and (b)~we build graphs with directed edges arguing that the contribution of any two nodes to their respective centrality is influenced by their relative position in a document. Experimental results on three news summarization datasets representative of different languages and writing styles show that our approach outperforms strong baselines by a wide margin.

Transferable Multi-Domain State Generator for Task-Oriented Dialogue Systems link dialogue/conversation transfer
Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, Pascale Fung

Over-dependence on domain ontology and lack of knowledge sharing across domains are two practical and yet less studied problems of dialogue state tracking. Existing approaches generally fall short in tracking unknown slot values during inference and often have difficulties in adapting to new domains. In this paper, we propose a Transferable Dialogue State Generator (TRADE) that generates dialogue states from utterances using a copy mechanism, facilitating knowledge transfer when predicting (domain, slot, value) triplets not encountered during training. Our model is composed of an utterance encoder, a slot gate, and a state generator, which are shared across domains. Empirical results demonstrate that TRADE achieves state-of-the-art joint goal accuracy of 48.62% for the five domains of MultiWOZ, a human-human dialogue dataset. In addition, we show its transferring ability by simulating zero-shot and few-shot dialogue state tracking for unseen domains. TRADE achieves 60.58% joint goal accuracy in one of the zero-shot domains, and is able to adapt to few-shot cases without forgetting already trained domains.

Errudite: Scalable, Reproducible, and Testable Error Analysis
Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, Daniel Weld

Multi-Hop Paragraph Retrieval for Open-Domain Question Answering QA
Yair Feldman, Ran El-Yaniv

Towards Understanding Linear Word Analogies link
Kawin Ethayarajh, David Duvenaud, Graeme Hirst

Multilingual and Cross-Lingual Graded Lexical Entailment
Ivan Vulić, Simone Paolo Ponzetto, Goran Glavaš

When a Good Translation is Wrong in Context: Context-Aware Machine Translation Improves on Deixis, Ellipsis, and Lexical Cohesion link MT
Elena Voita, Rico Sennrich, Ivan Titov

Though machine translation errors caused by the lack of context beyond one sentence have long been acknowledged, the development of context-aware NMT systems is hampered by several problems. Firstly, standard metrics are not sensitive to improvements in consistency in document-level translations. Secondly, previous work on context-aware NMT assumed that the sentence-aligned parallel data consisted of complete documents while in most practical scenarios such document-level data constitutes only a fraction of the available parallel data. To address the first issue, we perform a human study on an English-Russian subtitles dataset and identify deixis, ellipsis and lexical cohesion as three main sources of inconsistency. We then create test sets targeting these phenomena. To address the second shortcoming, we consider a set-up in which a much larger amount of sentence-level data is available compared to that aligned at the document level. We introduce a model that is suitable for this scenario and demonstrate major gains over a context-agnostic baseline on our new benchmarks without sacrificing performance as measured with BLEU.

The Language of Legal and Illegal Activity on the Darknet link
Leshem Choshen, Dan Eldad, Daniel Hershcovich, Elior Sulem, Omri Abend

The non-indexed parts of the Internet (the Darknet) have become a haven for both legal and illegal anonymous activity. Given the magnitude of these networks, scalably monitoring their activity necessarily relies on automated tools, and notably on NLP tools. However, little is known about what characteristics texts communicated through the Darknet have, and how well off-the-shelf NLP tools do on this domain. This paper tackles this gap and performs an in-depth investigation of the characteristics of legal and illegal text in the Darknet, comparing it to a clear net website with similar content as a control condition. Taking drug-related websites as a test case, we find that texts for selling legal and illegal drugs have several linguistic characteristics that distinguish them from one another, as well as from the control condition, among them the distribution of POS tags, and the coverage of their named entities in Wikipedia.

What Kind of Language Is Hard to Language-Model? link
Sebastian J. Mielke, Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

How language-agnostic are current state-of-the-art NLP tools? Are there some types of language that are easier to model with current methods? In prior work (Cotterell et al., 2018) we attempted to address this question for language modeling, and observed that recurrent neural network language models do not perform equally well over all the high-resource European languages found in the Europarl corpus. We speculated that inflectional morphology may be the primary culprit for the discrepancy. In this paper, we extend these earlier experiments to cover 69 languages from 13 language families using a multilingual Bible corpus. Methodologically, we introduce a new paired-sample multiplicative mixed-effects model to obtain language difficulty coefficients from at-least-pairwise parallel corpora. In other words, the model is aware of inter-sentence variation and can handle missing data. Exploiting this model, we show that "translationese" is not any easier to model than natively written language in a fair comparison. Trying to answer the question of what features difficult languages have in common, we try and fail to reproduce our earlier (Cotterell et al., 2018) observation about morphological complexity and instead reveal far simpler statistics of the data that seem to drive complexity in a much larger sample.

E3: Entailment-driven Extracting and Editing for Conversational Machine Reading link dialogue/conversation
Victor Zhong, Luke Zettlemoyer

Conversational machine reading systems help users answer high-level questions (e.g. determine if they qualify for particular government benefits) when they do not know the exact rules by which the determination is made(e.g. whether they need certain income levels or veteran status). The key challenge is that these rules are only provided in the form of a procedural text (e.g. guidelines from government website) which the system must read to figure out what to ask the user. We present a new conversational machine reading model that jointly extracts a set of decision rules from the procedural text while reasoning about which are entailed by the conversational history and which still need to be edited to create questions for the user. On the recently introduced ShARC conversational machine reading dataset, our Entailment-driven Extract and Edit network (E3) achieves a new state-of-the-art, outperforming existing systems as well as a new BERT-based baseline. In addition, by explicitly highlighting which information still needs to be gathered, E3 provides a more explainable alternative to prior work. We release source code for our models and experiments at https://github.com/vzhong/e3.

Analysis of Automatic Annotation Suggestions for Hard Discourse-Level Tasks in Expert Domains link
Claudia Schulz, Christian M. Meyer, Jan Kiesewetter, Michael Sailer, Elisabeth Bauer, Martin R. Fischer, Frank Fischer, Iryna Gurevych

Many complex discourse-level tasks can aid domain experts in their work but require costly expert annotations for data creation. To speed up and ease annotations, we investigate the viability of automatically generated annotation suggestions for such tasks. As an example, we choose a task that is particularly hard for both humans and machines: the segmentation and classification of epistemic activities in diagnostic reasoning texts. We create and publish a new dataset covering two domains and carefully analyse the suggested annotations. We find that suggestions have positive effects on annotation speed and performance, while not introducing noteworthy biases. Envisioning suggestion models that improve with newly annotated texts, we contrast methods for continuous model adjustment and suggest the most effective setup for suggestions in future expert tasks.

What Makes a Good Counselor? Learning to Distinguish between High-quality and Low-quality Counseling Conversations dialogue/conversation
Verónica Pérez-Rosas, Xinyi Wu, Kenneth Resnicow, Rada Mihalcea

A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction
Mengjie Zhao, Hinrich Schütze

LSTMEmbed: Learning Word and Sense Representations from a Large Semantically Annotated Corpus with Long Short-Term Memories RepL
Ignacio Iacobacci, Roberto Navigli

Word-order biases in deep-agent emergent communication
Rahma Chaabouni, Evgeny Kharitonov, Alessandro Lazaric, Emmanuel Dupoux, Marco Baroni

Generating Question-Answer Hierarchies QA
Kalpesh Krishna, Mohit Iyyer

A Compact and Language-Sensitive Multilingual Translation Method MT
Yining Wang, Long Zhou, Jiajun Zhang, Feifei Zhai, Jingfang Xu, Chengqing Zong

Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction link IE
Kosuke Nishida, Kyosuke Nishida, Masaaki Nagata, Itsumi Saito, Atushi Otuka, Hisako Asano, Junji Tomita

Question answering (QA) using textual sources for purposes such as reading comprehension (RC) has attracted much attention. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. It proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model. QFE is inspired by extractive summarization models; compared with the existing method, which extracts each evidence sentence independently, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence. It enables QFE to consider the dependency among the evidence sentences and cover important information in the question sentence. Experimental results show that QFE with a simple RC baseline model achieves a state-of-the-art evidence extraction score on HotpotQA. Although designed for RC, it also achieves a state-of-the-art evidence extraction score on FEVER, which is a recognizing textual entailment task on a large textual database.

Discourse Representation Parsing for Sentences and Documents parsing
Jiangming Liu, Shay B. Cohen, Mirella Lapata

Enhance Topic-to-Essay Generation with External Commonsense Knowledge NLG
Pengcheng Yang, Lei Li, Fuli Luo, Tianyu Liu, Xu SUN

Entity-Relation Extraction as Multi-Turn Question Answering QA IE
Xiaoya Li, Fan Yin, Zijun Sun, Xiayu Li, Arianna Yuan, Duo Chai, Mingxin Zhou, Jiwei Li

Robust Neural Machine Translation with Doubly Adversarial Inputs link MT
Yong Cheng, Lu Jiang, Wolfgang Macherey

Neural machine translation (NMT) often suffers from the vulnerability to noisy perturbations in the input. We propose an approach to improving the robustness of NMT models, which consists of two parts: (1) attack the translation model with adversarial source examples; (2) defend the translation model with adversarial target inputs to improve its robustness against the adversarial source inputs.For the generation of adversarial inputs, we propose a gradient-based method to craft adversarial examples informed by the translation loss over the clean inputs.Experimental results on Chinese-English and English-German translation tasks demonstrate that our approach achieves significant improvements ($2.8$ and $1.6$ BLEU points) over Transformer on standard clean benchmarks as well as exhibiting higher robustness on noisy data.

Don't Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference NLI
Yonatan Belinkov, Adam Poliak, Stuart Shieber, Benjamin Van Durme, Alexander Rush

Improving Neural Language Models by Segmenting, Attending, and Predicting the Future link
Hongyin Luo, Lan Jiang, Yonatan Belinkov, James Glass

Common language models typically predict the next word given the context. In this work, we propose a method that improves language modeling by learning to align the given context and the following phrase. The model does not require any linguistic annotation of phrase segmentation. Instead, we define syntactic heights and phrase segmentation rules, enabling the model to automatically induce phrases, recognize their task-specific heads, and generate phrase embeddings in an unsupervised learning manner. Our method can easily be applied to language models with different network architectures since an independent module is used for phrase induction and context-phrase alignment, and no change is required in the underlying language modeling network. Experiments have shown that our model outperformed several strong baseline models on different data sets. We achieved a new state-of-the-art performance of 17.4 perplexity on the Wikitext-103 dataset. Additionally, visualizing the outputs of the phrase induction module showed that our model is able to learn approximate phrase-level structural knowledge without any annotation.

A Corpus for Reasoning About Natural Language Grounded in Photographs link corpus
Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, Yoav Artzi

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge.

Distantly Supervised Named Entity Recognition using Positive-Unlabeled Learning link NER
Minlong Peng, Xiaoyu Xing, Qi Zhang, Jinlan Fu, Xuanjing Huang

In this work, we explore the way to perform named entity recognition (NER) using only unlabeled data and named entity dictionaries. To this end, we formulate the task as a positive-unlabeled (PU) learning problem and accordingly propose a novel PU learning algorithm to perform the task. We prove that the proposed algorithm can unbiasedly and consistently estimate the task loss as if there is fully labeled data. A key feature of the proposed method is that it does not require the dictionaries to label every entity within a sentence, and it even does not require the dictionaries to label all of the words constituting an entity. This greatly reduces the requirement on the quality of the dictionaries and makes our method generalize well with quite simple dictionaries. Empirical studies on four public NER datasets demonstrate the effectiveness of our proposed method. We have published the source code at \url{https://github.com/v-mipeng/LexiconNER}.

Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition transfer NER
Joey Tianyi Zhou, Hao Zhang, Di Jin, Hongyuan Zhu, Meng Fang, Rick Siow Mong Goh, Kenneth Kwok

Exploiting Entity BIO Tag Embeddings and Multi-task Learning for Relation Extraction with Imbalanced Data link IE
Wei Ye, Bo Li, Rui Xie, Zhonghao Sheng, Long Chen, Shikun Zhang

In practical scenario, relation extraction needs to first identify entity pairs that have relation and then assign a correct relation class. However, the number of non-relation entity pairs in context (negative instances) usually far exceeds the others (positive instances), which negatively affects a model's performance. To mitigate this problem, we propose a multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification with ranking loss. Meanwhile, we observe that a sentence may have multiple entities and relation mentions, and the patterns in which the entities appear in a sentence may contain useful semantic information that can be utilized to distinguish between positive and negative instances. Thus we further incorporate the embeddings of character-wise/word-wise BIO tag from the named entity recognition task into character/word embeddings to enrich the input representation. Experiment results show that our proposed approach can significantly improve the performance of a baseline model with more than 10% absolute increase in F1-score, and outperform the state-of-the-art models on ACE 2005 Chinese and English corpus. Moreover, BIO tag embeddings are particularly effective and can be used to improve other models as well.

Topic-Aware Neural Keyphrase Generation for Social Media Language link NLG
Yue Wang, Jing Li, Hou Pong Chan, Irwin King, Michael R. Lyu, Shuming Shi

A huge volume of user-generated content is daily produced on social media. To facilitate automatic language understanding, we study keyphrase prediction, distilling salient information from massive posts. While most existing methods extract words from source posts to form keyphrases, we propose a sequence-to-sequence (seq2seq) based neural keyphrase generation framework, enabling absent keyphrases to be created. Moreover, our model, being topic-aware, allows joint modeling of corpus-level latent topic representations, which helps alleviate the data sparsity that widely exhibited in social media language. Experiments on three datasets collected from English and Chinese social media platforms show that our model significantly outperforms both extraction and generation models that do not exploit latent topics. Further discussions show that our model learns meaningful topics, which interprets its superiority in social media keyphrase generation.

Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs
Ming Tu, Guangtao Wang, Jing Huang, Yun Tang, Xiaodong He, Bowen Zhou

Distilling Discrimination and Generalization Knowledge for Event Detection via Delta-Representation Learning RepL
Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Neural Relation Extraction for Knowledge Base Enrichment IE
Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, Rui Zhang

Visually Grounded Neural Syntax Acquisition link
Haoyue Shi, Jiayuan Mao, Kevin Gimpel, Karen Livescu

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F1 scores against gold parse trees. We find that VGNSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.

Episodic Memory Reader: Learning What to Remember for Question Answering from Streaming Data QA
Moonsu Han, Minki Kang, Hyunwoo Jung, Sung Ju Hwang

On the Robustness of Self-Attentive Models
Yu-Lun Hsieh, Minhao Cheng, Da-Cheng Juan, wei wei, Wen-Lian Hsu, Cho-Jui Hsieh

Multimodal Transformer for Unaligned Multimodal Language Sequences link
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, Ruslan Salakhutdinov

Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.

Joint Type Inference on Entities and Relations via Graph Convolutional Networks
Changzhi Sun, Yeyun Gong, Nan Duan, Ming Gong, Daxin Jiang, Shiliang Sun, Man Lan, Yuanbin Wu

Decompositional Argument Mining: A General Purpose Approach for Argument Graph Construction
Debela Gemechu, Chris Reed

DIAG-NRE: A Neural Pattern Diagnosis Framework for Distantly Supervised Neural Relation Extraction link IE
Shun Zheng, Xu Han, Yankai Lin, Peilin Yu, Lu Chen, Ling Huang, Zhiyuan Liu, Wei Xu

Pattern-based labeling methods have achieved promising results in alleviating the inevitable labeling noises of distantly supervised neural relation extraction. However, these methods require significant expert labor to write relation-specific patterns, which makes them too sophisticated to generalize quickly.To ease the labor-intensive workload of pattern writing and enable the quick generalization to new relation types, we propose a neural pattern diagnosis framework, DIAG-NRE, that can automatically summarize and refine high-quality relational patterns from noise data with human experts in the loop. To demonstrate the effectiveness of DIAG-NRE, we apply it to two real-world datasets and present both significant and interpretable improvements over state-of-the-art methods.

Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension RepL
An Yang, Quan Wang, Jing Liu, KAI LIU, Yajuan Lyu, Hua Wu, Qiaoqiao She, Sujian Li

Learning Transferable Feature Representations Using Neural Networks transfer
Himanshu Sharad Bhatt, Shourya Roy, Arun Rajkumar, Sriranjani Ramakrishnan

Understanding Undesirable Word Embedding Associations RepL
Kawin Ethayarajh, David Duvenaud, Graeme Hirst

Exploring Sequence-to-Sequence Learning in Aspect Term Extraction IE
Dehong Ma, Sujian Li, Fangzhao Wu, Xing Xie, Houfeng WANG

Towards Fine-grained Text Sentiment Transfer transfer
Fuli Luo, Peng Li, Pengcheng Yang, Jie Zhou, Yutong Tan, Baobao Chang, Zhifang Sui, Xu SUN

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations link dialogue/conversation
Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Rada Mihalcea, Gautam Naik, Erik Cambria

Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available for use at http:// affective-meld.github.io.

Inducing Document Structure for Aspect-based Summarization summarization
Lea Frermann, Alexandre Klementiev

GEAR: Graph-based Evidence Aggregating and Reasoning for Fact Verification
Jie Zhou, Xu Han, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, Maosong Sun

Shared-Private Bilingual Word Embeddings for Neural Machine Translation MT RepL
Xuebo Liu, Derek F. Wong, Yang Liu, Lidia S. Chao

Relation Embedding with Dihedral Group in Knowledge Graph link
Canran Xu, Ruijiang Li

Link prediction is critical for the application of incomplete knowledge graph (KG) in the downstream tasks. As a family of effective approaches for link predictions, embedding methods try to learn low-rank representations for both entities and relations such that the bilinear form defined therein is a well-behaved scoring function. Despite of their successful performances, existing bilinear forms overlook the modeling of relation compositions, resulting in lacks of interpretability for reasoning on KG. To fulfill this gap, we propose a new model called DihEdral, named after dihedral symmetry group. This new model learns knowledge graph embeddings that can capture relation compositions by nature. Furthermore, our approach models the relation embeddings parametrized by discrete values, thereby decrease the solution space drastically. Our experiments show that DihEdral is able to capture all desired properties such as (skew-) symmetry, inversion and (non-) Abelian composition, and outperforms existing bilinear form based approach and is comparable to or better than deep learning models such as ConvE.

Aspect Sentiment Classification Towards Question-Answering with Reinforced Bidirectional Attention Network QA
Jingjing Wang, Changlong Sun, Shoushan Li, Xiaozhong Liu, Luo Si, Min Zhang, Guodong Zhou

On-device Structured and Context Partitioned Projection Networks
Sujith Ravi, Zornitsa Kozareva

Multi-grained Named Entity Recognition NER
Congying Xia, Chenwei Zhang, Tao Yang, Yaliang Li, Nan Du, Xian Wu, Wei Fan, Fenglong Ma, Philip Yu

What You Say and How You Say It Matters: Predicting Stock Volatility Using Verbal and Vocal Cues
Yu Qin, Yi Yang

Unsupervised Parallel Sentence Extraction with Parallel Segment Detection Helps Machine Translation MT IE
Viktor Hangya, Alexander Fraser

Empirical Linguistic Study of Sentence Embeddings
Katarzyna Krasnowska-Kieraś, Alina Wróblewska

Multilingual Factor Analysis link
Francisco Vargas, Kamen Brestnichki, Alex Papadopoulos Korfiatis, Nils Hammerla

In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.

Is Word Segmentation Necessary for Deep Learning of Chinese Representations? link RepL
Xiaoya Li, Yuxian Meng, Xiaofei Sun, Qinghong Han, Arianna Yuan, Jiwei Li

Segmenting a chunk of text into words is usually the first step of processing Chinese text, but its necessity has rarely been explored. In this paper, we ask the fundamental question of whether Chinese word segmentation (CWS) is necessary for deep learning-based Chinese Natural Language Processing. We benchmark neural word-based models which rely on word segmentation against neural char-based models which do not involve word segmentation in four end-to-end NLP benchmark tasks: language modeling, machine translation, sentence matching/paraphrase and text classification. Through direct comparisons between these two types of models, we find that char-based models consistently outperform word-based models. Based on these observations, we conduct comprehensive experiments to study why word-based models underperform char-based models in these deep learning-based NLP tasks. We show that it is because word-based models are more vulnerable to data sparsity and the presence of out-of-vocabulary (OOV) words, and thus more prone to overfitting. We hope this paper could encourage researchers in the community to rethink the necessity of word segmentation in deep learning-based Chinese Natural Language Processing. \footnote{Yuxian Meng and Xiaoya Li contributed equally to this paper.}

Interconnected Question Generation with Coreference Alignment and Conversation Flow Modeling link dialogue/conversation NLG
Yifan Gao, Piji Li, Irwin King, Michael R. Lyu

We study the problem of generating interconnected questions in question-answering style conversations. Compared with previous works which generate questions based on a single sentence (or paragraph), this setting is different in two major aspects: (1) Questions are highly conversational. Almost half of them refer back to conversation history using coreferences. (2) In a coherent conversation, questions have smooth transitions between turns. We propose an end-to-end neural model with coreference alignment and conversation flow modeling. The coreference alignment modeling explicitly aligns coreferent mentions in conversation history with corresponding pronominal references in generated questions, which makes generated questions interconnected to conversation history. The conversation flow modeling builds a coherent conversation by starting questioning on the first few sentences in a text passage and smoothly shifting the focus to later parts. Extensive experiments show that our system outperforms several baselines and can generate highly conversational questions. The code implementation is released at https://github.com/Evan-Gao/conversational-QG.

A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy link
Genady Beryozkin, Yoel Drori, Oren Gilon, Idan Szpektor, Tzvika Hartman

We study a variant of domain adaptation for named-entity recognition where multiple, heterogeneously tagged training sets are available. Furthermore, the test tag-set is not identical to any individual training tag-set. Yet, the relations between all tags are provided in a tag hierarchy, covering the test tags as a combination of training tags. This setting occurs when various datasets are created using different annotation schemes. This is also the case of extending a tag-set with a new tag by annotating only the new tag in a new dataset. We propose to use the given tag hierarchy to jointly learn a neural network that shares its tagging layer among all tag-sets. We compare this model to combining independent models and to a model based on the multitasking approach. Our experiments show the benefit of the tag-hierarchy model, especially when facing non-trivial consolidation of tag-sets.

Data-to-text Generation with Entity Modeling link NLG
Ratish Puduppully, Li Dong, Mirella Lapata

DocRED: A Large-Scale Document-Level Relation Extraction Dataset IE
Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Maosong Sun

Text Categorization by Learning Predominant Sense of Words as Auxiliary Task
Kazuya Shimura, Jiyi Li, Fumiyo Fukumoto

Attention Guided Graph Convolutional Networks for Relation Extraction link IE
Zhijiang Guo, Yan Zhang, Wei Lu

Unsupervised Bilingual Word Embedding Agreement for Unsupervised Neural Machine Translation MT RepL
Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao

XQA: A Cross-lingual Open-domain Question Answering Dataset QA
Jiahua Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension link transfer
Alon Talmor, Jonathan Berant

A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to generalization, and show that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019). We also find that training on multiple source RC datasets leads to robust generalization and transfer, and can reduce the cost of example collection for a new RC dataset. Following our analysis, we propose MultiQA, a BERT-based model, trained on multiple RC datasets, which leads to state-of-the-art performance on five RC datasets. We share our infrastructure for the benefit of the research community.

Proactive Human-Machine Conversation with Explicit Conversation Goal dialogue/conversation
Wenquan Wu, Zhen Guo, Xiangyang Zhou, Hua Wu, Xiyuan Zhang, Rongzhong Lian, Haifeng Wang

Neural Machine Translation with Reordering Embeddings MT
Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita

Cross-Lingual Training for Automatic Question Generation link NLG
vishwajeet kumar, Nitish Joshi, Arijit Mukherjee, Ganesh Ramakrishnan, Preethi Jyothi

Open-Domain Targeted Sentiment Analysis via Span-Based Extraction and Classification IE
Minghao Hu, Yuxing Peng, Zhen Huang, Dongsheng Li, Yiwei Lv

BiSET: Bi-directional Selective Encoding with Template for Abstractive Summarization link summarization
Kai Wang, Xiaojun Quan, Rui Wang

The success of neural summarization models stems from the meticulous encodings of source articles. To overcome the impediments of limited and sometimes noisy training data, one promising direction is to make better use of the available training data by applying filters during summarization. In this paper, we propose a novel Bi-directional Selective Encoding with Template (BiSET) model, which leverages template discovered from training data to softly select key information from each source article to guide its summarization process. Extensive experiments on a standard summarization dataset were conducted and the results show that the template-equipped BiSET model manages to improve the summarization performance significantly with a new state of the art.

Multi-Task Semantic Dependency Parsing with Policy Gradient for Learning Easy-First Strategies link parsing
Shuhei Kurita, Anders Søgaard

In Semantic Dependency Parsing (SDP), semantic relations form directed acyclic graphs, rather than trees. We propose a new iterative predicate selection (IPS) algorithm for SDP. Our IPS algorithm combines the graph-based and transition-based parsing approaches in order to handle multiple semantic head words. We train the IPS model using a combination of multi-task learning and task-specific policy gradient training. Trained this way, IPS achieves a new state of the art on the SemEval 2015 Task 18 datasets. Furthermore, we observe that policy gradient training learns an easy-first strategy.

GCDT: A Global Context Enhanced Deep Transition Architecture for Sequence Labeling link
Yijin Liu, Fandong Meng, Jinchao Zhang, Jinan Xu, Yufeng Chen, Jie Zhou

Current state-of-the-art systems for sequence labeling are typically based on the family of Recurrent Neural Networks (RNNs). However, the shallow connections between consecutive hidden states of RNNs and insufficient modeling of global information restrict the potential performance of those models. In this paper, we try to address these issues, and thus propose a Global Context enhanced Deep Transition architecture for sequence labeling named GCDT. We deepen the state transition path at each position in a sentence, and further assign every token with a global representation learned from the entire sentence. Experiments on two standard sequence labeling tasks show that, given only training data and the ubiquitous word embeddings (Glove), our GCDT achieves 91.96 F1 on the CoNLL03 NER task and 95.43 F1 on the CoNLL2000 Chunking task, which outperforms the best reported results under the same settings. Furthermore, by leveraging BERT as an additional resource, we establish new state-of-the-art results with 93.47 F1 on NER and 97.30 F1 on Chunking.

Learning a Matching Model with Co-teaching for Multi-turn Response Selection in Retrieval-based Dialogue Systems link dialogue/conversation
Jiazhan Feng, Chongyang Tao, wei wu, Yansong Feng, Dongyan Zhao, Rui Yan

We study learning of a matching model for response selection in retrieval-based dialogue systems. The problem is equally important with designing the architecture of a model, but is less explored in existing literature. To learn a robust matching model from noisy training data, we propose a general co-teaching framework with three specific teaching strategies that cover both teaching with loss functions and teaching with data curriculum. Under the framework, we simultaneously learn two matching models with independent training sets. In each iteration, one model transfers the knowledge learned from its training set to the other model, and at the same time receives the guide from the other model on how to overcome noise in training. Through being both a teacher and a student, the two models learn from each other and get improved together. Evaluation results on two public data sets indicate that the proposed learning approach can generally and significantly improve the performance of existing matching models.

Hierarchical Transformers for Multi-Document Summarization summarization
Yang Liu, Mirella Lapata

A Hierarchical Reinforced Sequence Operation Method for Unsupervised Text Style Transfer link transfer
Chen Wu, Xuancheng Ren, Fuli Luo, Xu SUN

Unsupervised text style transfer aims to alter text styles while preserving the content, without aligned data for supervision. Existing seq2seq methods face three challenges: 1) the transfer is weakly interpretable, 2) generated outputs struggle in content preservation, and 3) the trade-off between content and style is intractable. To address these challenges, we propose a hierarchical reinforced sequence operation method, named Point-Then-Operate (PTO), which consists of a high-level agent that proposes operation positions and a low-level agent that alters the sentence. We provide comprehensive training objectives to control the fluency, style, and content of the outputs and a mask-based inference algorithm that allows for multi-step revision based on the single-step trained agents. Experimental results on two text style transfer datasets show that our method significantly outperforms recent methods and effectively addresses the aforementioned challenges.

GraphRel: Modeling Text as Relational Graphs for Joint Entity and Relation Extraction link IE
Tsu-Jui Fu, Peng-Hsuan Li, Wei-Yun Ma

Learning to Abstract for Memory-augmented Conversational Response Generation dialogue/conversation NLG
Zhiliang Tian, Wei Bi, Xiaopeng Li, Nevin L. Zhang

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms
Henry Moss, Andrew Moore, David Leslie, Paul Rayson

Transfer Capsule Network for Aspect Level Sentiment Classification transfer
Zhuang Chen, Tieyun Qian

Are Training Samples Correlated? Learning to Generate Dialogue Responses with Multiple References dialogue/conversation
Lisong Qiu, Juntao Li, Wei Bi, Dongyan Zhao, Rui Yan

Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums NLG
Zi Chai, Xinyu Xing, Xiaojun Wan, Bo Huang

Ensuring Readability and Data-fidelity using Head-modifier Templates in Deep Type Description Generation link NLG
Jiangjie Chen, Ao Wang, Haiyun Jiang, Suo Feng, Chenguang Li, Yanghua Xiao

A type description is a succinct noun compound which helps human and machines to quickly grasp the informative and distinctive information of an entity. Entities in most knowledge graphs (KGs) still lack such descriptions, thus calling for automatic methods to supplement such information. However, existing generative methods either overlook the grammatical structure or make factual mistakes in generated texts. To solve these problems, we propose a head-modifier template-based method to ensure the readability and data fidelity of generated type descriptions. We also propose a new dataset and two automatic metrics for this task. Experiments show that our method improves substantially compared with baselines and achieves state-of-the-art performance on both datasets.

Simple and Effective Text Matching with Richer Alignment Features
Runqi Yang, Jianhai Zhang, Xing Gao, Feng Ji, Haiqing Chen

Semantic Parsing with Dual Learning parsing
Ruisheng Cao, Su Zhu, Chen Liu, Jieyu Li, Kai Yu

ChID: A Large-scale Chinese IDiom Dataset for Cloze Test link corpus
Chujie Zheng, Minlie Huang, Aixin Sun

Cloze-style reading comprehension in Chinese is still limited due to the lack of various corpora. In this paper we propose a large-scale Chinese cloze test dataset ChID, which studies the comprehension of idiom, a unique language phenomenon in Chinese. In this corpus, the idioms in a passage are replaced by blank symbols and the correct answer needs to be chosen from well-designed candidate idioms. We carefully study how the design of candidate idioms and the representation of idioms affect the performance of state-of-the-art models. Results show that the machine accuracy is substantially worse than that of human, indicating a large space for further research.

Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation MT
Bram Bulte, Arda Tezcan

Are You Convinced? Choosing the More Convincing Evidence with a Siamese Network
Martin Gleize, Eyal Shnarch, Leshem Choshen, Lena Dankin, GUY MOSHKOWICH, Ranit Aharonov, Noam Slonim

Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies link MT transfer
Yunsu Kim, Yingbo Gao, Hermann Ney

Transfer learning or multilingual model is essential for low-resource neural machine translation (NMT), but the applicability is limited to cognate languages by sharing their vocabularies. This paper shows effective techniques to transfer a pre-trained NMT model to a new, unrelated language without shared vocabularies. We relieve the vocabulary mismatch by using cross-lingual word embedding, train a more language-agnostic encoder by injecting artificial noises, and generate synthetic data easily from the pre-training data without back-translation. Our methods do not require restructuring the vocabulary or retraining the model. We improve plain NMT transfer by up to +5.1% BLEU in five low-resource translation tasks, outperforming multilingual joint training by a large margin. We also provide extensive ablation studies on pre-trained embedding, synthetic data, vocabulary size, and parameter freezing for a better understanding of NMT transfer.

Towards Explainable NLP: A Generative Explanation Framework for Text Classification link
Hui Liu, Qingyu Yin, William Yang Wang

Building explainable systems is a critical problem in the field of Natural Language Processing (NLP), since most machine learning models provide no explanations for the predictions. Existing approaches for explainable machine learning systems tend to focus on interpreting the outputs or the connections between inputs and outputs. However, the fine-grained information is often ignored, and the systems do not explicitly generate the human-readable explanations. To better alleviate this problem, we propose a novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time. More specifically, we introduce the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations. We construct two new datasets that contain summaries, rating scores, and fine-grained reasons. We conduct experiments on both datasets, comparing with several strong neural network baseline systems. Experimental results show that our method surpasses all baselines on both datasets, and is able to generate concise explanations at the same time.

Automatic Evaluation of Local Topic Quality link
Jeffrey Lund, Piper Armstrong, Wilson Fearn, Stephen Cowley, Courtni Byun, Jordan Boyd-Graber, Kevin Seppi

Topic models are typically evaluated with respect to the global topic distributions that they generate, using metrics such as coherence, but without regard to local (token-level) topic assignments. Token-level assignments are important for downstream tasks such as classification. Even recent models, which aim to improve the quality of these token-level topic assignments, have been evaluated only with respect to global metrics. We propose a task designed to elicit human judgments of token-level topic assignments. We use a variety of topic model types and parameters and discover that global metrics agree poorly with human assignments. Since human evaluation is expensive we propose a variety of automated metrics to evaluate topic models at a local level. Finally, we correlate our proposed metrics with human judgments from the task on several datasets. We show that an evaluation based on the percent of topic switches correlates most strongly with human judgment of local topic quality. We suggest that this new metric, which we call consistency, be adopted alongside global metrics such as topic coherence when evaluating new topic models.

Self-Attentive, Multi-Context One-Class Classification for Unsupervised Anomaly Detection on Text
Lukas Ruff, Yury Zemlyanskiy, Robert Vandermeulen, Thomas Schnake, Marius Kloft

Are we there yet? Encoder-decoder neural networks as cognitive models of English past tense inflection link
Maria Corkery, Yevgen Matusevych, Sharon Goldwater

The cognitive mechanisms needed to account for the English past tense have long been a subject of debate in linguistics and cognitive science. Neural network models were proposed early on, but were shown to have clear flaws. Recently, however, Kirov and Cotterell (2018) showed that modern encoder-decoder (ED) models overcome many of these flaws. They also presented evidence that ED models demonstrate humanlike performance in a nonce-word task. Here, we look more closely at the behaviour of their model in this task. We find that (1) the model exhibits instability across multiple simulations in terms of its correlation with human data, and (2) even when results are aggregated across simulations (treating each simulation as an individual human participant), the fit to the human data is not strong---worse than an older rule-based model. These findings hold up through several alternative training regimes and evaluation measures. Although other neural architectures might do better, we conclude that there is still insufficient evidence to claim that neural nets are a good cognitive model for this task.

SherLIiC: A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference link NLI
Martin Schmitt, Hinrich Schütze

We present SherLIiC, a testbed for lexical inference in context (LIiC), consisting of 3985 manually annotated inference rule candidates (InfCands), accompanied by (i) ~960k unlabeled InfCands, and (ii) ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. Each InfCand consists of one of these relations, expressed as a lemmatized dependency path, and two argument placeholders, each linked to one or more Freebase types. Due to our candidate selection process based on strong distributional evidence, SherLIiC is much harder than existing testbeds because distributional evidence is of little utility in the classification of InfCands. We also show that, due to its construction, many of SherLIiC's correct InfCands are novel and missing from existing rule bases. We evaluate a number of strong baselines on SherLIiC, ranging from semantic vector space models to state of the art neural models of natural language inference (NLI). We show that SherLIiC poses a tough challenge to existing NLI systems.

Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards link NLG
Hou Pong Chan, Wang Chen, Lu Wang, Irwin King

Generating keyphrases that summarize the main points of a document is a fundamental task in natural language processing. Although existing generative models are capable of predicting multiple keyphrases for an input document as well as determining the number of keyphrases to generate, they still suffer from the problem of generating too few keyphrases. To address this problem, we propose a reinforcement learning (RL) approach for keyphrase generation, with an adaptive reward function that encourages a model to generate both sufficient and accurate keyphrases. Furthermore, we introduce a new evaluation method that incorporates name variations of the ground-truth keyphrases using the Wikipedia knowledge base. Thus, our evaluation method can more robustly evaluate the quality of predicted keyphrases. Extensive experiments on five real-world datasets of different scales demonstrate that our RL approach consistently and significantly improves the performance of the state-of-the-art generative models with both conventional and new evaluation methods.

#YouToo? Detection of Personal Recollections of Sexual Harassment on Social Media
Arijit Ghosh Chowdhury, Ramit Sawhney, Rajiv Ratn Shah, Debanjan Mahata

Generating Question Relevant Captions to Aid Visual Question Answering link QA
Jialin Wu, Zeyuan Hu, Raymond Mooney

Visual question answering (VQA) and image captioning require a shared body of general knowledge connecting language and vision. We present a novel approach to improve VQA performance that exploits this connection by jointly generating captions that are targeted to help answer a specific visual question. The model is trained using an existing caption dataset by automatically determining question-relevant captions using an online gradient-based method. Experimental results on the VQA v2 challenge demonstrates that our approach obtains state-of-the-art VQA performance (e.g. 68.4% on the Test-standard set using a single model) by simultaneously generating question-relevant captions.

Improving the Robustness of Question Answering Systems to Question Paraphrasing QA
Wee Chung Gan, Hwee Tou Ng

Continual and Multi-Task Architecture Search link
Ramakanth Pasunuru, Mohit Bansal

Architecture search is the process of automatically learning the neural model or cell structure that best suits the given task. Recently, this approach has shown promising performance improvements (on language modeling and image classification) with reasonable training speed, using a weight sharing strategy called Efficient Neural Architecture Search (ENAS). In our work, we first introduce a novel continual architecture search (CAS) approach, so as to continually evolve the model parameters during the sequential training of several tasks, without losing performance on previously learned tasks (via block-sparsity and orthogonality constraints), thus enabling life-long learning. Next, we explore a multi-task architecture search (MAS) approach over ENAS for finding a unified, single cell structure that performs well across multiple tasks (via joint controller rewards), and hence allows more generalizable transfer of the cell structure knowledge to an unseen new task. We empirically show the effectiveness of our sequential continual learning and parallel multi-task learning based architecture search approaches on diverse sentence-pair classification tasks (GLUE) and multimodal-generation based video captioning tasks. Further, we present several ablations and analyses on the learned cell structures.

Adversarial Learning of Privacy-Preserving Text Representations for De-Identification of Medical Records link RepL
Max Friedrich, Arne Köhn, Gregor Wiedemann, Chris Biemann

De-identification is the task of detecting protected health information (PHI) in medical text. It is a critical step in sanitizing electronic health records (EHRs) to be shared for research. Automatic de-identification classifierscan significantly speed up the sanitization process. However, obtaining a large and diverse dataset to train such a classifier that works wellacross many types of medical text poses a challenge as privacy laws prohibit the sharing of raw medical records. We introduce a method to create privacy-preserving shareable representations of medical text (i.e. they contain no PHI) that does not require expensive manual pseudonymization. These representations can be shared between organizations to create unified datasets for training de-identification models. Our representation allows training a simple LSTM-CRF de-identification model to an F1 score of 97.4%, which is comparable to a strong baseline that exposes private information in its representation. A robust, widely available de-identification classifier based on our representation could potentially enable studies for which de-identification would otherwise be too costly.

Crowdsourcing and Aggregating Nested Markable Annotations
Chris Madge, Silviu Paun, Juntao Yu, Jon Chamberlain, Udo Kruschwitz, Massimo Poesio

ARNOR: Attention Regularization based Noise Reduction for Distant Supervision Relation Classification
Wei Jia, Dai Dai, Xinyan Xiao, Hua Wu

The PhotoBook Dataset: Building Common Ground through Visually-Grounded Dialogue link dialogue/conversation
Janosch Haber, Tim Baumgärtner, Ece Takmaz, Lieke Gelderloos, Elia Bruni, Raquel Fernández

This paper introduces the PhotoBook dataset, a large-scale collection of visually-grounded, task-oriented dialogues in English designed to investigate shared dialogue history accumulating during conversation. Taking inspiration from seminal work on dialogue analysis, we propose a data-collection task formulated as a collaborative game prompting two online participants to refer to images utilising both their visual context as well as previously established referring expressions. We provide a detailed description of the task setup and a thorough analysis of the 2,500 dialogues collected. To further illustrate the novel features of the dataset, we propose a baseline model for reference resolution which uses a simple method to take into account shared information accumulated in a reference chain. Our results show that this information is particularly important to resolve later descriptions and underline the need to develop more sophisticated models of common ground in dialogue interaction.

Handling Divergent Reference Texts when Evaluating Table-to-Text Generation link NLG
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, Ming-Wei Chang, Dipanjan Das, William Cohen

Automatically constructed datasets for generating text from semi-structured data (tables), such as WikiBio, often contain reference texts that diverge from the information in the corresponding semi-structured data. We show that metrics which rely solely on the reference texts, such as BLEU and ROUGE, show poor correlation with human judgments when those references diverge. We propose a new metric, PARENT, which aligns n-grams from the reference and generated texts to the semi-structured data before computing their precision and recall. Through a large scale human evaluation study of table-to-text models for WikiBio, we show that PARENT correlates with human judgments better than existing text generation metrics. We also adapt and evaluate the information extraction based evaluation proposed by Wiseman et al (2017), and show that PARENT has comparable correlation to it, while being easier to use. We show that PARENT is also applicable when the reference texts are elicited from humans using the data from the WebNLG challenge.

Key Fact as Pivot: A Two-Stage Model for Low Resource Table-to-Text Generation NLG
Shuming Ma, Pengcheng Yang, Tianyu Liu, Peng Li, Jie Zhou, Xu SUN

Sparse Sequence-to-Sequence Models link
Ben Peters, Vlad Niculae, André F. T. Martins

Learning Attention-based Embeddings for Relation Prediction in Knowledge Graphs link
Deepak Nathani, Jatin Chauhan, Charu Sharma, Manohar Kaul

Morphological Irregularity Correlates with Frequency link
Shijie Wu, Ryan Cotterell, Timothy O'Donnell

We present a study of morphological irregularity. Following recent work, we define an information-theoretic measure of irregularity based on the predictability of forms in a language. Using a neural transduction model, we estimate this quantity for the forms in 28 languages. We first present several validatory and exploratory analyses of irregularity. We then show that our analyses provide evidence for a correlation between irregularity and frequency: higher frequency items are more likely to be irregular and irregular items are more likely be highly frequent. To our knowledge, this result is the first of its breadth and confirms longstanding proposals from the linguistics literature. The correlation is more robust when aggregated at the level of whole paradigms--providing support for models of linguistic structure in which inflected forms are unified by abstract underlying stems or lexemes. Code is available at https://github.com/shijie-wu/neural-transducer.

On the Compositionality Prediction of Noun Phrases using Poincaré Embeddings link
Abhik Jana, Dima Puzyrev, Alexander Panchenko, Pawan Goyal, Chris Biemann, Animesh Mukherjee

The compositionality degree of multiword expressions indicates to what extent the meaning of a phrase can be derived from the meaning of its constituents and their grammatical relations. Prediction of (non)-compositionality is a task that has been frequently addressed with distributional semantic models. We introduce a novel technique to blend hierarchical information with distributional information for predicting compositionality. In particular, we use hypernymy information of the multiword and its constituents encoded in the form of the recently introduced Poincar\'e embeddings in addition to the distributional information to detect compositionality for noun phrases. Using a weighted average of the distributional similarity and a Poincar\'e similarity function, we obtain consistent and substantial, statistically significant improvement across three gold standard datasets over state-of-the-art models based on distributional information only. Unlike traditional approaches that solely use an unsupervised setting, we have also framed the problem as a supervised task, obtaining comparable improvements. Further, we publicly release our Poincar\'e embeddings, which are trained on the output of handcrafted lexical-syntactic patterns on a large corpus.

Robust Representation Learning of Biomedical Names RepL
Minh C. Phan, Aixin Sun, Yi Tay

Semantic expressive capacity with bounded memory link
Antoine Venant, Alexander Koller

We investigate the capacity of mechanisms for compositional semantic parsing to describe relations between sentences and semantic representations. We prove that in order to represent certain relations, mechanisms which are syntactically projective must be able to remember an unbounded number of locations in the semantic representations, where nonprojective mechanisms need not. This is the first result of this kind, and has consequences both for grammar-based and for neural systems.

Abstractive text summarization based on deep learning and semantic content generalization summarization
Panagiotis Kouris, Georgios Alexandridis, Andreas Stafylopatis

Pretraining Methods for Dialog Context Representation Learning link dialogue/conversation
Shikib Mehri, Evgeniia Razumovskaia, Tiancheng Zhao, Maxine Eskenazi

This paper examines various unsupervised pretraining objectives for learning dialog context representations. Two novel methods of pretraining dialog context encoders are proposed, and a total of four methods are examined. Each pretraining objective is fine-tuned and evaluated on a set of downstream dialog tasks using the MultiWoz dataset and strong performance improvement is observed. Further evaluation shows that our pretraining objectives result in not only better performance, but also better convergence, models that are less data hungry and have better domain generalizability.

CONAN - COunter NArratives through Nichesourcing: a Multilingual Dataset of Responses to Fight Online Hate Speech corpus
Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem Tekiroglu, Marco Guerini

Correlating neural and symbolic representations of language link RepL
Grzegorz Chrupała, Afra Alishahi

Analysis methods which enable us to better understand the representations and functioning of neural models of language are increasingly needed as deep learning becomes the dominant approach in NLP. Here we present two methods based on Representational Similarity Analysis (RSA) and Tree Kernels (TK) which allow us to directly quantify how strongly the information encoded in neural activation patterns corresponds to information represented by symbolic structures such as syntax trees. We first validate our methods on the case of a simple synthetic language for arithmetic expressions with clearly defined syntax and semantics, and show that they exhibit the expected pattern of results. We then apply our methods to correlate neural representations of English sentences with their constituency parse trees.

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned link
Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov

Multi-head self-attention is a key component of the Transformer, a state-of-the-art architecture for neural machine translation. In this work we evaluate the contribution made by individual attention heads in the encoder to the overall performance of the model and analyze the roles played by them. We find that the most important and confident heads play consistent and often linguistically-interpretable roles. When pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, we observe that specialized heads are last to be pruned. Our novel pruning method removes the vast majority of heads without seriously affecting performance. For example, on the English-Russian WMT dataset, pruning 38 out of 48 encoder heads results in a drop of only 0.15 BLEU.

Explore, Propose, and Assemble: An Interpretable Model for Multi-Hop Reading Comprehension link
Yichen Jiang, Nitish Joshi, Yen-Chun Chen, Mohit Bansal

Multi-hop reading comprehension requires the model to explore and connect relevant information from multiple sentences/documents in order to answer the question about the context. To achieve this, we propose an interpretable 3-module system called Explore-Propose-Assemble reader (EPAr). First, the Document Explorer iteratively selects relevant documents and represents divergent reasoning chains in a tree structure so as to allow assimilating information from all chains. The Answer Proposer then proposes an answer from every root-to-leaf path in the reasoning tree. Finally, the Evidence Assembler extracts a key sentence containing the proposed answer from every path and combines them to predict the final answer. Intuitively, EPAr approximates the coarse-to-fine-grained comprehension behavior of human readers when facing multiple long documents. We jointly optimize our 3 modules by minimizing the sum of losses from each stage conditioned on the previous stage's output. On two multi-hop reading comprehension datasets WikiHop and MedHop, our EPAr model achieves significant improvements over the baseline and competitive results compared to the state-of-the-art model. We also present multiple reasoning-chain-recovery tests and ablation studies to demonstrate our system's ability to perform interpretable and accurate reasoning.

Low-resource Deep Entity Resolution with Transfer and Active Learning link transfer
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, Lucian Popa

Entity resolution (ER) is the task of identifying different representations of the same real-world entities across databases. It is a key step for knowledge base creation and text mining. Recent adaptation of deep learning methods for ER mitigates the need for dataset-specific feature engineering by constructing distributed representations of entity records. While these methods achieve state-of-the-art performance over benchmark data, they require large amounts of labeled data, which are typically unavailable in realistic ER applications. In this paper, we develop a deep learning-based method that targets low-resource settings for ER through a novel combination of transfer learning and active learning. We design an architecture that allows us to learn a transferable model from a high-resource setting to a low-resource one. To further adapt to the target dataset, we incorporate active learning that carefully selects a few informative examples to fine-tune the transferred model. Empirical evaluation demonstrates that our method achieves comparable, if not better, performance compared to state-of-the-art learning-based methods while using an order of magnitude fewer labels.

A Large-Scale Corpus for Conversation Disentanglement dialogue/conversation
Jonathan K. Kummerfeld, Sai R. Gouravajhala, Joseph J. Peper, Vignesh Athreya, Chulaka Gunasekara, Jatin Ganhotra, Siva Sankalp Patel, Lazaros C Polymenakos, Walter Lasecki

Hubless Nearest Neighbor Search for Bilingual Lexicon Induction
Jiaji Huang, Qiang Qiu, Kenneth Church

A Cross-Domain Transferable Neural Coherence Model link transfer
Peng Xu, Hamidreza Saghir, Jin Sung Kang, Teng Long, Avishek Joey Bose, Yanshuai Cao, Jackie Chi Kit Cheung

Zero-Shot Entity Linking by Reading Entity Descriptions link
Lajanugen Logeswaran, Ming-Wei Chang, Kristina Toutanova, Kenton Lee, Jacob Devlin, Honglak Lee

We present the zero-shot entity linking task, where mentions must be linked to unseen entities without in-domain labeled data. The goal is to enable robust transfer to highly specialized domains, and so no metadata or alias tables are assumed. In this setting, entities are only identified by text descriptions, and models must rely strictly on language understanding to resolve the new entities. First, we show that strong reading comprehension models pre-trained on large unlabeled data can be used to generalize to unseen entities. Second, we propose a simple and effective adaptive pre-training strategy, which we term domain-adaptive pre-training (DAP), to address the domain shift problem associated with linking unseen entities in a new domain. We present experiments on a new dataset that we construct for this task and show that DAP improves over strong pre-training baselines, including BERT. The data and code are available at https://github.com/lajanugen/zeshel.

Detecting Concealed Information in Text and Speech
Shengli Hu

Zero-shot Word Sense Disambiguation using Sense Definition Embeddings
Sawan Kumar, Sharmistha Jat, Karan Saxena, Partha Talukdar

Unsupervised Neural Text Simplification link
Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain, Karthik Sankaranarayanan

The paper presents a first attempt towards unsupervised neural text simplification that relies only on unlabeled text corpora. The core framework is composed of a shared encoder and a pair of attentional-decoders and gains knowledge of simplification through discrimination based-losses and denoising. The framework is trained using unlabeled text collected from en-Wikipedia dump. Our analysis (both quantitative and qualitative involving human evaluators) on a public test data shows that the proposed model can perform text-simplification at both lexical and syntactic levels, competitive to existing supervised methods. Addition of a few labelled pairs also improves the performance further.

Self-Supervised Dialogue Learning dialogue/conversation
Jiawei Wu, Xin Wang, William Yang Wang

Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge IE
Ziran Li, Ning Ding, Haitao Zheng, Zhiyuan Liu, Ying Shen

Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling link
Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas Mccoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, Shuning Jin, Berlin Chen, Benjamin Van Durme, Edouard Grave, Ellie Pavlick, Samuel R. Bowman

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo's pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.

Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming link
Caio Corro, Ivan Titov

We treat projective dependency trees as latent variables in our probabilistic model and induce them in such a way as to be beneficial for a downstream task, without relying on any direct tree supervision. Our approach relies on Gumbel perturbations and differentiable dynamic programming. Unlike previous approaches to latent tree learning, we stochastically sample global structures and our parser is fully differentiable. We illustrate its effectiveness on sentiment analysis and natural language inference tasks. We also study its properties on a synthetic structure induction task. Ablation studies emphasize the importance of both stochasticity and constraining latent structures to be projective trees.

Evidence-based Trustworthiness
Yi Zhang, Dan Roth, Zachary Ives

ELI5: Long Form Question Answering QA
Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, Michael Auli

Fine-Grained Temporal Relation Extraction IE
Siddharth Vashishtha, Benjamin Van Durme, Aaron Steven White

Distant Learning for Entity Linking with Automatic Noise Detection link
Phong Le, Ivan Titov

Accurate entity linkers have been produced for domains and languages where annotated data (i.e., texts linked to a knowledge base) is available. However, little progress has been made for the settings where no or very limited amounts of labeled data are present (e.g., legal or most scientific domains). In this work, we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels. As the learning signal is weak and our surrogate labels are noisy, we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline. For a subset of entity categories, it even approaches the performance of supervised learning.

AMR Parsing as Sequence-to-Graph Transduction link parsing
Sheng Zhang, Xutai Ma, Kevin Duh, Benjamin Van Durme

We propose an attention-based model that treats AMR parsing as sequence-to-graph transduction. Unlike most AMR parsers that rely on pre-trained aligners, external semantic resources, or data augmentation, our proposed parser is aligner-free, and it can be effectively trained with limited amounts of labeled AMR data. Our experimental results outperform all previously reported SMATCH scores, on both AMR 2.0 (76.3% F1 on LDC2017T10) and AMR 1.0 (70.2% F1 on LDC2014T12).

Boosting Entity Linking Performance by Leveraging Unlabeled Documents link
Phong Le, Ivan Titov

Modern entity linking systems rely on large collections of documents specifically annotated for the task (e.g., AIDA CoNLL). In contrast, we propose an approach which exploits only naturally occurring information: unlabeled documents and Wikipedia. Our approach consists of two stages. First, we construct a high recall list of candidate entities for each mention in an unlabeled document. Second, we use the candidate lists as weak supervision to constrain our document-level entity linking model. The model treats entities as latent variables and, when estimated on a collection of unlabelled texts, learns to choose entities relying both on local context of each mention and on coherence with other entities in the document. The resulting approach rivals fully-supervised state-of-the-art systems on standard test sets. It also approaches their performance in the very challenging setting: when tested on a test set sampled from the data used to estimate the supervised systems. By comparing to Wikipedia-only training of our model, we demonstrate that modeling unlabeled documents is beneficial.

Gender-preserving Debiasing for Pre-trained Word Embeddings link RepL
Masahiro Kaneko, Danushka Bollegala

Word embeddings learnt from massive text collections have demonstrated significant levels of discriminative biases such as gender, racial or ethnic biases, which in turn bias the down-stream NLP applications that use those word embeddings. Taking gender-bias as a working example, we propose a debiasing method that preserves non-discriminative gender-related information, while removing stereotypical discriminative gender biases from pre-trained word embeddings. Specifically, we consider four types of information: \emph{feminine}, \emph{masculine}, \emph{gender-neutral} and \emph{stereotypical}, which represent the relationship between gender vs. bias, and propose a debiasing method that (a) preserves the gender-related information in feminine and masculine words, (b) preserves the neutrality in gender-neutral words, and (c) removes the biases from stereotypical words. Experimental results on several previously proposed benchmark datasets show that our proposed method can debias pre-trained word embeddings better than existing SoTA methods proposed for debiasing word embeddings while preserving gender-related but non-discriminative information.

Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text link transfer
Bidisha Samanta, Niloy Ganguly, Soumen Chakrabarti

Multilingual writers and speakers often alternate between two languages in a single discourse, a practice called "code-switching". Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is more readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting scarce human-labeled code-switched text with plentiful synthetic code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5%, 5.11%, 7.20%) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). We also get significant gains for hate speech detection: 4% improvement using only synthetic text and 6% if augmented with real text.

Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis
Jialong Tang, Ziyao Lu, Jinsong SU, Yubin Ge, Linfeng Song, Le Sun, Jiebo Luo

End-to-End Sequential Metaphor Identification Inspired by Linguistic Theories
Rui Mao, Chenghua Lin, Frank Guerin

Like a Baby: Visually Situated Neural Language Acquisition link
Alexander Ororbia, Ankur Mali, Matthew Kelly, David Reitter

We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2\% decrease in perplexity, even when no visual context is available at test. Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional language model (BERT) in the language modeling framework yields a 3.5\% improvement. The advantage for training with visual context when testing without is robust across different languages (English, German and Spanish) and different models (GRU, LSTM, $\Delta$-RNN, as well as those that use BERT embeddings). Thus, language models perform better when they learn like a baby, i.e, in a multi-modal environment. This finding is compatible with the theory of situated cognition: language is inseparable from its physical context.

Learning to Discover, Ground and Use Words with Segmental Neural Language Models link
Kazuya Kawakami, Chris Dyer, Phil Blunsom

We propose a segmental neural language model that combines the generalization power of neural networks with the ability to discover word-like units that are latent in unsegmented character sequences. In contrast to previous segmentation models that treat word segmentation as an isolated task, our model unifies word discovery, learning how words fit together to form sentences, and, by conditioning the model on visual context, how words' meanings ground in representations of non-linguistic modalities. Experiments show that the unconditional model learns predictive distributions better than character LSTM models, discovers words competitively with nonparametric Bayesian word segmentation models, and that modeling language conditional on visual context improves performance on both.

Relational Word Embeddings link RepL
Jose Camacho-Collados, Luis Espinosa Anke, Steven Schockaert

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context link
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc Le, Ruslan Salakhutdinov

Optimal Transport-based Alignment of Learned Character Representations for String Similarity RepL
Derek Tam, Nicholas Monath, Ari Kobren, Aaron Traylor, Rajarshi Das, Andrew McCallum

Exploring Pre-trained Language Models for Event Extraction and Generation NLG IE
Sen Yang, Dawei Feng, Linbo Qiao, Zhigang Kan, Dongsheng Li

What should I ask? Using conversationally informative rewards for goal-oriented visual dialog. dialogue/conversation
pushkar shukla, Carlos Elmadjian, Richika Sharan, Vivek Kulkarni, Matthew Turk, William Yang Wang

Syntax-Infused Variational Autoencoder for Text Generation link NLG
Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen, Lawrence Carin

Generating Logical Forms from Graph Representations of Text and Entities link RepL
Peter Shaw, Philip Massey, Angelica Chen, Francesco Piccinno, Yasemin Altun

Structured information about entities is critical for many semantic parsing tasks. We present an approach that uses a Graph Neural Network (GNN) architecture to incorporate information about relevant entities and their relations during parsing. Combined with a decoder copy mechanism, this approach provides a conceptually simple mechanism to generate logical forms with entities. We demonstrate that this approach is competitive with state-of-the-art across several tasks without pre-training, and outperforms existing approaches when combined with BERT pre-training.

Interpretable Neural Predictions with Differentiable Binary Variables link
Joost Bastings, Wilker Aziz, Ivan Titov

The success of neural networks comes hand in hand with a desire for more interpretability. We focus on text classifiers and make them more interpretable by having them provide a justification, a rationale, for their predictions. We approach this problem by jointly training two neural network models: a latent model that selects a rationale (i.e. a short and informative part of the input text), and a classifier that learns from the words in the rationale alone. Previous work proposed to assign binary latent masks to input positions and to promote short selections via sparsity-inducing penalties such as L0 regularisation. We propose a latent model that mixes discrete and continuous behaviour allowing at the same time for binary selections and gradient-based training without REINFORCE. In our formulation, we can tractably compute the expected value of penalties such as L0, which allows us to directly optimise the model towards a pre-specified text selection rate. We show that our approach is competitive with previous work on rationale extraction, and explore further uses in attention mechanisms.

RankQA: Neural Question Answering with Answer Re-Ranking link QA
Bernhard Kratzwald, Anna Eigenmann, Stefan Feuerriegel

Self-Regulated Interactive Sequence-to-Sequence Learning
Julia Kreutzer, Stefan Riezler

Symbolic inductive bias for visually grounded learning of spoken language link bias
Grzegorz Chrupała

A widespread approach to processing spoken language is to first automatically transcribe it into text. An alternative is to use an end-to-end approach: recent works have proposed to learn semantic embeddings of spoken language from images with spoken captions, without an intermediate transcription step. We propose to use multitask learning to exploit existing transcribed speech within the end-to-end setting. We describe a three-task architecture which combines the objectives of matching spoken captions with corresponding images, speech with text, and text with images. We show that the addition of the speech/text task leads to substantial performance improvements on image retrieval when compared to training the speech/image task in isolation. We conjecture that this is due to a strong inductive bias transcribed speech provides to the model, and offer supporting evidence for this.

Merge and Label: A novel neural network architecture for nested NER link
Joseph Fisher, Andreas Vlachos

Named entity recognition (NER) is one of the best studied tasks in natural language processing. However, most approaches are not capable of handling nested structures which are common in many applications. In this paper we introduce a novel neural network architecture that first merges tokens and/or entities into entities forming nested structures, and then labels each of them independently. Unlike previous work, our merge and label approach predicts real-valued instead of discrete segmentation structures, which allow it to combine word and nested entity embeddings while maintaining differentiability. %which smoothly groups entities into single vectors across multiple levels. We evaluate our approach using the ACE 2005 Corpus, where it achieves state-of-the-art F1 of 74.6, further improved with contextual embeddings (BERT) to 82.4, an overall improvement of close to 8 F1 points over previous approaches trained on the same data. Additionally we compare it against BiLSTM-CRFs, the dominant approach for flat NER structures, demonstrating that its ability to predict nested structures does not impact performance in simpler cases.

Classification and Clustering of Arguments with Contextualized Word Embeddings link RepL
Nils Reimers, Benjamin Schiller, Tilman Beck, Johannes Daxenberger, Christian Stab, Iryna Gurevych

We experiment with two recent contextualized word embedding methods (ELMo and BERT) in the context of open-domain argument search. For the first time, we show how to leverage the power of contextualized word embeddings to classify and cluster topic-dependent arguments, achieving impressive results on both tasks and across multiple datasets. For argument classification, we improve the state-of-the-art for the UKP Sentential Argument Mining Corpus by 20.8 percentage points and for the IBM Debater - Evidence Sentences dataset by 7.4 percentage points. For the understudied task of argument clustering, we propose a pre-training step which improves by 7.8 percentage points over strong baselines on a novel dataset, and by 12.3 percentage points for the Argument Facet Similarity (AFS) Corpus.

A Spreading Activation Framework for Tracking Conceptual Complexity of Texts
Ioana Hulpuș, Sanja Štajner, Heiner Stuckenschmidt

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference link NLI
Tom McCoy, Ellie Pavlick, Tal Linzen

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including the state-of-the-art model BERT, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.

Bridging the Gap between Training and Inference for Neural Machine Translation link MT
Wen Zhang, Yang Feng, Fandong Meng, Di You, Qun Liu

Neural Machine Translation (NMT) generates target words sequentially in the way of predicting the next word conditioned on the context words. At training time, it predicts with the ground truth words as context while at inference it has to generate the entire sequence from scratch. This discrepancy of the fed context leads to error accumulation among the way. Furthermore, word-level training requires strict matching between the generated sequence and the ground truth sequence which leads to overcorrection over different but reasonable translations. In this paper, we address these issues by sampling context words not only from the ground truth sequence but also from the predicted sequence by the model during training, where the predicted sequence is selected with a sentence-level optimum. Experiment results on Chinese->English and WMT'14 English->German translation tasks demonstrate that our approach can achieve significant improvements on multiple datasets.

Predicting Humorousness and Metaphor Novelty with Gaussian Process Preference Learning
Edwin Simpson, Erik-Lân Do Dinh, Tristan Miller, Iryna Gurevych

A Simple Theoretical Model of Importance for Summarization summarization
Maxime Peyrard

Relating Simple Sentence Representations in Deep Neural Networks and the Brain link RepL
Sharmistha Jat, Hao Tang, Partha Talukdar, Tom Mitchell

What is the relationship between sentence representations learned by deep recurrent models against those encoded by the brain? Is there any correspondence between hidden layers of these recurrent models and brain regions when processing sentences? Can these deep models be used to synthesize brain data which can then be utilized in other extrinsic tasks? We investigate these questions using sentences with simple syntax and semantics (e.g., The bone was eaten by the dog.). We consider multiple neural network architectures, including recently proposed ELMo and BERT. We use magnetoencephalography (MEG) brain recording data collected from human subjects when they were reading these simple sentences. Overall, we find that BERT's activations correlate the best with MEG brain data. We also find that the deep network representation can be used to generate brain data from new sentences to augment existing brain data. To the best of our knowledge, this is the first work showing that the MEG brain recording when reading a word in a sentence can be used to distinguish earlier words in the sentence. Our exploration is also the first to use deep neural network representations to generate synthetic brain data and to show that it helps in improving subsequent stimuli decoding task accuracy.

Spatial Aggregation Facilitates Discovery of Spatial Topics
Aniruddha Maiti, Slobodan Vucetic

Argument Generation with Retrieval, Planning, and Realization link NLG
Xinyu Hua, Zhe Hu, Lu Wang

Automatic argument generation is an appealing but challenging task. In this paper, we study the specific problem of counter-argument generation, and present a novel framework, CANDELA. It consists of a powerful retrieval system and a novel two-step generation model, where a text planning decoder first decides on the main talking points and a proper language style for each sentence, then a content realization decoder reflects the decisions and constructs an informative paragraph-level argument. Furthermore, our generation model is empowered by a retrieval system indexed with 12 million articles collected from Wikipedia and popular English news media, which provides access to high-quality content with diversity. Automatic evaluation on a large-scale dataset collected from Reddit shows that our model yields significantly higher BLEU, ROUGE, and METEOR scores than the state-of-the-art and non-trivial comparisons. Human evaluation further indicates that our system arguments are more appropriate for refutation and richer in content.

Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations link MT
Jiatao Gu, Yong Wang, Kyunghyun Cho, Victor O.K. Li

Zero-shot translation, translating between language pairs on which a Neural Machine Translation (NMT) system has never been trained, is an emergent property when training the system in multilingual settings. However, naive training for zero-shot NMT easily fails, and is sensitive to hyper-parameter setting. The performance typically lags far behind the more conventional pivot-based approach which translates twice using a third language as a pivot. In this work, we address the degeneracy problem due to capturing spurious correlations by quantitatively analyzing the mutual information between language IDs of the source and decoded sentences. Inspired by this analysis, we propose to use two simple but effective approaches: (1) decoder pre-training; (2) back-translation. These methods show significant improvement (4~22 BLEU points) over the vanilla zero-shot translation on three challenging multilingual datasets, and achieve similar or better results than the pivot-based approach.

The (Non-)Utility of Structural Features in BiLSTM-based Dependency Parsers link parsing
Agnieszka Falenska, Jonas Kuhn

Classical non-neural dependency parsers put considerable effort on the design of feature functions. Especially, they benefit from information coming from structural features, such as features drawn from neighboring tokens in the dependency tree. In contrast, their BiLSTM-based successors achieve state-of-the-art performance without explicit information about the structural context. In this paper we aim to answer the question: How much structural context are the BiLSTM representations able to capture implicitly? We show that features drawn from partial subtrees become redundant when the BiLSTMs are used. We provide a deep insight into information flow in transition- and graph-based neural architectures to demonstrate where the implicit information comes from when the parsers make their decisions. Finally, with model ablations we demonstrate that the structural context is not only present in the models, but it significantly influences their performance.

Strategies for Structuring Story Generation link NLG
Angela Fan, Mike Lewis, Yann Dauphin

Writers generally rely on plans or sketches to write long stories, but most current language models generate word by word from left to right. We explore coarse-to-fine models for creating narrative texts of several hundred words, and introduce new models which decompose stories by abstracting over actions and entities. The model first generates the predicate-argument structure of the text, where different mentions of the same entity are marked with placeholder tokens. It then generates a surface realization of the predicate-argument structure, and finally replaces the entity placeholders with context-sensitive names and references. Human judges prefer the stories from our models to a wide range of previous approaches to hierarchical text generation. Extensive analysis shows that our methods can help improve the diversity and coherence of events and entities in generated stories.

Learning Compressed Sentence Representations for On-Device Text Processing link RepL
Dinghan Shen, Pengyu Cheng, Dhanasekar Sundararaman, Xinyuan Zhang, Qian Yang, Meng Tang, Asli Celikyilmaz, Lawrence Carin

Vector representations of sentences, trained on massive text corpora, are widely used as generic sentence embeddings across a variety of NLP problems. The learned representations are generally assumed to be continuous and real-valued, giving rise to a large memory footprint and slow retrieval speed, which hinders their applicability to low-resource (memory and computation) platforms, such as mobile devices. In this paper, we propose four different strategies to transform continuous and generic sentence embeddings into a binarized form, while preserving their rich semantic information. The introduced methods are evaluated across a wide range of downstream tasks, where the binarized sentence embeddings are demonstrated to degrade performance by only about 2% relative to their continuous counterparts, while reducing the storage requirement by over 98%. Moreover, with the learned binary representations, the semantic relatedness of two sentences can be evaluated by simply calculating their Hamming distance, which is more computational efficient compared with the inner product operation between continuous embeddings. Detailed analysis and case study further validate the effectiveness of proposed methods.

Multi-task Pairwise Neural Ranking for Hashtag Segmentation link
Mounica Maddela, Wei Xu, Daniel Preoţiuc-Pietro

Hashtags are often employed on social media and beyond to add metadata to a textual utterance with the goal of increasing discoverability, aiding search, or providing additional semantics. However, the semantic content of hashtags is not straightforward to infer as these represent ad-hoc conventions which frequently include multiple words joined together and can include abbreviations and unorthodox spellings. We build a dataset of 12,594 hashtags split into individual segments and propose a set of approaches for hashtag segmentation by framing it as a pairwise ranking problem between candidate segmentations. Our novel neural approaches demonstrate 24.6% error reduction in hashtag segmentation accuracy compared to the current state-of-the-art method. Finally, we demonstrate that a deeper understanding of hashtag semantics obtained through segmentation is useful for downstream applications such as sentiment analysis, for which we achieved a 2.6% increase in average recall on the SemEval 2017 sentiment analysis dataset.

Towards Generating Long and Coherent Text with Multi-Level Latent Variable Models link
Dinghan Shen, Asli Celikyilmaz, Yizhe Zhang, Liqun Chen, Xin Wang, Jianfeng Gao, Lawrence Carin

Variational autoencoders (VAEs) have received much attention recently as an end-to-end architecture for text generation with latent variables. In this paper, we investigate several multi-level structures to learn a VAE model to generate long, and coherent text. In particular, we use a hierarchy of stochastic layers between the encoder and decoder networks to generate more informative latent codes. We also investigate a multi-level decoder structure to learn a coherent long-term structure by generating intermediate sentence representations as high-level plan vectors. Empirical results demonstrate that a multi-level VAE model produces more coherent and less repetitive long text compared to the standard VAE models and can further mitigate the posterior-collapse issue.

Using LSTMs to Assess the Obligatoriness of Phonological Distinctive Features for Phonotactic Learning
Nicole Mirea, Klinton Bicknell

Observing Dialogue in Therapy: Categorizing and Forecasting Behavioral Codes link dialogue/conversation
Jie Cao, Michael Tanana, Zac Imel, Eric Poitras, David Atkins, Vivek Srikumar

Automatically analyzing dialogue can help understand and guide behavior in domains such as counseling, where interactions are largely mediated by conversation. In this paper, we study modeling behavioral codes used to asses a psychotherapy treatment style called Motivational Interviewing (MI), which is effective for addressing substance abuse and related problems. Specifically, we address the problem of providing real-time guidance to therapists with a dialogue observer that (1) categorizes therapist and client MI behavioral codes and, (2) forecasts codes for upcoming utterances to help guide the conversation and potentially alert the therapist. For both tasks, we define neural network models that build upon recent successes in dialogue modeling. Our experiments demonstrate that our models can outperform several baselines for both tasks. We also report the results of a careful analysis that reveals the impact of the various network design tradeoffs for modeling therapy dialogue.

Entity-Centric Contextual Affective Analysis link
Anjalie Field, Yulia Tsvetkov

While contextualized word representations have improved state-of-the-art benchmarks in many NLP tasks, their potential usefulness for social-oriented tasks remains largely unexplored. We show how contextualized word embeddings can be used to capture affect dimensions in portrayals of people. We evaluate our methodology quantitatively, on held-out affect lexicons, and qualitatively, through case examples. We find that contextualized word representations do encode meaningful affect information, but they are heavily biased towards their training data, which limits their usefulness to in-domain analyses. We ultimately use our method to examine differences in portrayals of men and women.

Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog link dialogue/conversation
Zhe Gan, Yu Cheng, Ahmed Kholy, Linjie Li, Jingjing Liu, Jianfeng Gao

This paper presents a new model for visual dialog, Recurrent Dual Attention Network (ReDAN), using multi-step reasoning to answer a series of questions about an image. In each question-answering turn of a dialog, ReDAN infers the answer progressively through multiple reasoning steps. In each step of the reasoning process, the semantic representation of the question is updated based on the image and the previous dialog history, and the recurrently-refined representation is used for further reasoning in the subsequent step. On the VisDial v1.0 dataset, the proposed ReDAN model achieves a new state-of-the-art of 64.47% NDCG score. Visualization on the reasoning process further demonstrates that ReDAN can locate context-relevant visual and textual clues via iterative refinement, which can lead to the correct answer step-by-step.

Lattice Transformer for Speech Translation link MT
Pei Zhang, Niyu Ge, Boxing Chen, Kai Fan

Recent advances in sequence modeling have highlighted the strengths of the transformer architecture, especially in achieving state-of-the-art machine translation results. However, depending on the up-stream systems, e.g., speech recognition, or word segmentation, the input to translation system can vary greatly. The goal of this work is to extend the attention mechanism of the transformer to naturally consume the lattice in addition to the traditional sequential input. We first propose a general lattice transformer for speech translation where the input is the output of the automatic speech recognition (ASR) which contains multiple paths and posterior scores. To leverage the extra information from the lattice structure, we develop a novel controllable lattice attention mechanism to obtain latent representations. On the LDC Spanish-English speech translation corpus, our experiments show that lattice transformer generalizes significantly better and outperforms both a transformer baseline and a lattice LSTM. Additionally, we validate our approach on the WMT 2017 Chinese-English translation task with lattice inputs from different BPE segmentations. In this task, we also observe the improvements over strong baselines.

Generalized Data Augmentation for Low-Resource Translation link MT
Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, Graham Neubig

Translation to or from low-resource languages LRLs poses challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing large amounts of monolingual data is regarded as an effective way to alleviate these problems. In this paper, we propose a general framework for data augmentation in low-resource machine translation that not only uses target-side monolingual data, but also pivots through a related high-resource language HRL. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the LRL, making use of available resources to better approximate the true data distribution of the LRL. First, we inject LRL words into HRL sentences through an induced bilingual dictionary. Second, we further edit these modified sentences using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to~1.5 to~8 BLEU points compared to supervised back-translation baselines

Language Modelling Makes Sense: Propagating Representations through WordNet for Full-Coverage Word Sense Disambiguation link NLG
Daniel Loureiro, Alípio Jorge

Contextual embeddings represent a new generation of semantic representations learned from Neural Language Modelling (NLM) that addresses the issue of meaning conflation hampering traditional word embeddings. In this work, we show that contextual embeddings can be used to achieve unprecedented gains in Word Sense Disambiguation (WSD) tasks. Our approach focuses on creating sense-level embeddings with full-coverage of WordNet, and without recourse to explicit knowledge of sense distributions or task-specific modelling. As a result, a simple Nearest Neighbors (k-NN) method using our representations is able to consistently surpass the performance of previous systems using powerful neural sequencing models. We also analyse the robustness of our approach when ignoring part-of-speech and lemma features, requiring disambiguation against the full sense inventory, and revealing shortcomings to be improved. Finally, we explore applications of our sense embeddings for concept-level analyses of contextual embeddings and their respective NLMs.

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings link RepL
Yadollah Yaghoobzadeh, Katharina Kann, T. J. Hazen, Eneko Agirre, Hinrich Schütze

Word embeddings typically represent different meanings of a word in a single conflated vector. Empirical analysis of embeddings of ambiguous words is currently limited by the small size of manually annotated resources and by the fact that word senses are treated as unrelated individual concepts. We present a large dataset based on manual Wikipedia annotations and word senses, where word senses from different words are related by semantic classes. This is the basis for novel diagnostic tests for an embedding's content: we probe word embeddings for semantic classes and analyze the embedding space by classifying embeddings into semantic classes. Our main findings are: (i) Information about a sense is generally represented well in a single-vector embedding - if the sense is frequent. (ii) A classifier can accurately predict whether a word is single-sense or multi-sense, based only on its embedding. (iii) Although rare senses are not well represented in single-vector embeddings, this does not have negative impact on an NLP application whose performance depends on frequent senses.

Multi-Task Learning for Coherence Modeling
Youmna Farag, Helen Yannakoudakis

One Time of Interaction May Not Be Enough: Go Deep with an Interaction-over-Interaction Network for Response Selection in Dialogues dialogue/conversation
Chongyang Tao, wei wu, Can Xu, Wenpeng Hu, Dongyan Zhao, Rui Yan

Disentangled Representation Learning for Non-Parallel Text Style Transfer link transfer
Vineet John, Lili Mou, Hareesh Bahuleyan, Olga Vechtomova

HighRES: Highlight-based Reference-less Evaluation of Summarization summarization
Hardy Hardy, Shashi Narayan, Andreas Vlachos

Unraveling Antonym's Word Vectors through a Siamese-like Network
Mathias Etcheverry, Dina Wonsever

Latent Retrieval for Weakly Supervised Open Domain Question Answering link QA
Kenton Lee, Ming-Wei Chang, Kristina Toutanova

Recent work on open domain question answering (QA) assumes strong supervision of the supporting evidence and/or assumes a blackbox information retrieval (IR) system to retrieve evidence candidates. We argue that both are suboptimal, since gold evidence is not always available, and QA is fundamentally different from IR. We show for the first time that it is possible to jointly learn the retriever and reader from question-answer string pairs and without any IR system. In this setting, evidence retrieval from all of Wikipedia is treated as a latent variable. Since this is impractical to learn from scratch, we pre-train the retriever with an Inverse Cloze Task. We evaluate on open versions of five QA datasets. On datasets where the questioner already knows the answer, a traditional IR system such as BM25 is sufficient. On datasets where a user is genuinely seeking an answer, we show that learned retrieval is crucial, outperforming BM25 by up to 19 points in exact match.

Unsupervised Multilingual Word Embedding with Limited Resources using Neural Language Models RepL
Takashi Wada, Tomoharu Iwata, Yuji Matsumoto

Improving Textual Network Embedding with Global Attention via Optimal Transport link
Liqun Chen, Guoyin Wang, Chenyang Tao, Dinghan Shen, Pengyu Cheng, Xinyuan Zhang, Wenlin Wang, Yizhe Zhang, Lawrence Carin

Constituting highly informative network embeddings is an important tool for network analysis. It encodes network topology, along with other useful side information, into low-dimensional node-based feature representations that can be exploited by statistical modeling. This work focuses on learning context-aware network embeddings augmented with text data. We reformulate the network-embedding problem, and present two novel strategies to improve over traditional attention mechanisms: ($i$) a content-aware sparse attention module based on optimal transport, and ($ii$) a high-level attention parsing module. Our approach yields naturally sparse and self-normalized relational inference. It can capture long-term interactions between sequences, thus addressing the challenges faced by existing textual network embedding schemes. Extensive experiments are conducted to demonstrate our model can consistently outperform alternative state-of-the-art methods.

Sentiment Tagging with Partial Labels using Modular Architectures link
Xiao Zhang, Dan Goldwasser

Many NLP learning tasks can be decomposed into several distinct sub-tasks, each associated with a partial label. In this paper we focus on a popular class of learning problems, sequence prediction applied to several sentiment analysis tasks, and suggest a modular learning approach in which different sub-tasks are learned using separate functional modules, combined to perform the final task while sharing information. Our experiments show this approach helps constrain the learning process and can alleviate some of the supervision efforts.

Open Domain Event Extraction Using Neural Latent Variable Models link IE
Xiao Liu, Heyan Huang, Yue Zhang

We consider open domain event extraction, the task of extracting unconstraint types of events from news clusters. A novel latent variable neural model is constructed, which is scalable to very large corpus. A dataset is collected and manually annotated, with task-specific evaluation metrics being designed. Results show that the proposed unsupervised model gives better performance compared to the state-of-the-art method for event schema induction.

Choosing Transfer Languages for Cross-Lingual Learning link transfer
Yu-Hsiang Lin, Chian-Yu Chen, Jean Lee, Zirui Li, Yuyan Zhang, Mengzhou Xia, Shruti Rijhwani, Junxian He, Zhisong Zhang, Xuezhe Ma, Antonios Anastasopoulos, Patrick Littell, Graham Neubig

Unsupervised Discovery of Gendered Language through Latent-Variable Modeling link
Alexander Miserlis Hoyle, Lawrence Wolf-Sonkin, Hanna Wallach, Isabelle Augenstein, Ryan Cotterell

Studying the ways in which language is gendered has long been an area of interest in sociolinguistics. Studies have explored, for example, the speech of male and female characters in film and the language used to describe male and female politicians. In this paper, we aim not to merely study this phenomenon qualitatively, but instead to quantify the degree to which the language used to describe men and women is different and, moreover, different in a positive or negative way. To that end, we introduce a generative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We find that there are significant differences between descriptions of male and female nouns and that these differences align with common gender stereotypes: Positive adjectives used to describe women are more often related to their bodies than adjectives used to describe men.

Informative Image Captioning with External Sources of Information link
Sanqiang Zhao, Piyush Sharma, Tomer Levinboim, Radu Soricut

An image caption should fluently present the essential information in a given image, including informative, fine-grained entity mentions and the manner in which these entities interact. However, current captioning models are usually trained to generate captions that only contain common object names, thus falling short on an important "informativeness" dimension. We present a mechanism for integrating image information together with fine-grained labels (assumed to be generated by some upstream models) into a caption that describes the image in a fluent and informative manner. We introduce a multimodal, multi-encoder model based on Transformer that ingests both image features and multiple sources of entity labels. We demonstrate that we can learn to control the appearance of these entity labels in the output, resulting in captions that are both fluent and informative.

Sequence-to-Nuggets: Nested Entity Mention Detection via Anchor-Region Networks link
Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun

Sequential labeling-based NER approaches restrict each word belonging to at most one entity mention, which will face a serious problem when recognizing nested entity mentions. In this paper, we propose to resolve this problem by modeling and leveraging the head-driven phrase structures of entity mentions, i.e., although a mention can nest other mentions, they will not share the same head word. Specifically, we propose Anchor-Region Networks (ARNs), a sequence-to-nuggets architecture for nested mention detection. ARNs first identify anchor words (i.e., possible head words) of all mentions, and then recognize the mention boundaries for each anchor word by exploiting regular phrase structures. Furthermore, we also design Bag Loss, an objective function which can train ARNs in an end-to-end manner without using any anchor word annotation. Experiments show that ARNs achieve the state-of-the-art performance on three standard nested entity mention detection benchmarks.

Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives link
Yi Tay, Shuohang Wang, Anh Tuan Luu, Jie Fu, Minh C. Phan, Xingdi Yuan, Jinfeng Rao, Siu Cheung Hui, Aston Zhang

This paper tackles the problem of reading comprehension over long narratives where documents easily span over thousands of tokens. We propose a curriculum learning (CL) based Pointer-Generator framework for reading/sampling over large documents, enabling diverse training of the neural model based on the notion of alternating contextual difficulty. This can be interpreted as a form of domain randomization and/or generative pretraining during training. To this end, the usage of the Pointer-Generator softens the requirement of having the answer within the context, enabling us to construct diverse training samples for learning. Additionally, we propose a new Introspective Alignment Layer (IAL), which reasons over decomposed alignments using block-based self-attention. We evaluate our proposed method on the NarrativeQA reading comprehension benchmark, achieving state-of-the-art performance, improving existing baselines by $51\%$ relative improvement on BLEU-4 and $17\%$ relative improvement on Rouge-L. Extensive ablations confirm the effectiveness of our proposed IAL and CL components.

Neural Network Alignment for Sentential Paraphrases
Jessica Ouyang, Kathy McKeown

A Unified Linear-Time Framework for Sentence-Level Discourse Parsing link parsing
Xiang Lin, Shafiq Joty, Prathyusha Jwalapuram, M Saiful Bari

We propose an efficient neural framework for sentence-level discourse analysis in accordance with Rhetorical Structure Theory (RST). Our framework comprises a discourse segmenter to identify the elementary discourse units (EDU) in a text, and a discourse parser that constructs a discourse tree in a top-down fashion. Both the segmenter and the parser are based on Pointer Networks and operate in linear time. Our segmenter yields an $F_1$ score of 95.4, and our parser achieves an $F_1$ score of 81.7 on the aggregated labeled (relation) metric, surpassing previous approaches by a good margin and approaching human agreement on both tasks (98.3 and 83.0 $F_1$).

Literary Event Detection
Matthew Sims, Jong Ho Park, David Bamman

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication link
Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel "crosstalk" evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

Syntactically Supervised Transformers for Faster Neural Machine Translation link MT
Nader Akoury, Kalpesh Krishna, Mohit Iyyer

Standard decoders for neural machine translation autoregressively generate a single target token per time step, which slows inference especially for long outputs. While architectural advances such as the Transformer fully parallelize the decoder computations at training time, inference still proceeds sequentially. Recent developments in non- and semi- autoregressive decoding produce multiple tokens per time step independently of the others, which improves inference speed but deteriorates translation quality. In this work, we propose the syntactically supervised Transformer (SynST), which first autoregressively predicts a chunked parse tree before generating all of the target tokens in one shot conditioned on the predicted parse. A series of controlled experiments demonstrates that SynST decodes sentences ~ 5x faster than the baseline autoregressive Transformer while achieving higher BLEU scores than most competing methods on En-De and En-Fr datasets.

Duality of Link Prediction and Entailment Graph Induction
Mohammad Javad Hosseini, Shay B. Cohen, Mark Johnson, Mark Steedman

Lightweight and Efficient Neural Natural Language Processing with Quaternion Networks link
Yi Tay, Aston Zhang, Anh Tuan Luu, Jinfeng Rao, Shuai Zhang, Shuohang Wang, Jie Fu, Siu Cheung Hui

Many state-of-the-art neural models for NLP are heavily parameterized and thus memory inefficient. This paper proposes a series of lightweight and memory efficient neural architectures for a potpourri of natural language processing (NLP) tasks. To this end, our models exploit computation using Quaternion algebra and hypercomplex spaces, enabling not only expressive inter-component interactions but also significantly ($75\%$) reduced parameter size due to lesser degrees of freedom in the Hamilton product. We propose Quaternion variants of models, giving rise to new architectures such as the Quaternion attention Model and Quaternion Transformer. Extensive experiments on a battery of NLP tasks demonstrates the utility of proposed Quaternion-inspired models, enabling up to $75\%$ reduction in parameter size without significant loss in performance.

Sentence-Level Evidence Embedding for Claim Verification with Hierarchical Attention Networks
Jing Ma, Wei Gao, Shafiq Joty, Kam-Fai Wong

Identification of Tasks, Datasets, Evaluation Metrics, and Numeric Scores for Scientific Leaderboards Construction link corpus
Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, Debasis Ganguly

While the fast-paced inception of novel tasks and new datasets helps foster active research in a community towards interesting directions, keeping track of the abundance of research activity in different areas on different datasets is likely to become increasingly difficult. The community could greatly benefit from an automatic system able to summarize scientific results, e.g., in the form of a leaderboard. In this paper we build two datasets and develop a framework (TDMS-IE) aimed at automatically extracting task, dataset, metric and score from NLP papers, towards the automatic construction of leaderboards. Experiments show that our model outperforms several baselines by a large margin. Our model is a first step towards automatic leaderboard construction, e.g., in the NLP domain.

Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset link dialogue/conversation
Hannah Rashkin, Eric Michael Smith, Margaret Li, Y-Lan Boureau

One challenge for dialogue agents is recognizing feelings in the conversation partner and replying accordingly, a key communicative skill. While it is straightforward for humans to recognize and acknowledge others' feelings in a conversation, this is a significant challenge for AI systems due to the paucity of suitable publicly-available datasets for training and evaluation. This work proposes a new benchmark for empathetic dialogue generation and EmpatheticDialogues, a novel dataset of 25k conversations grounded in emotional situations. Our experiments indicate that dialogue models that use our dataset are perceived to be more empathetic by human evaluators, compared to models merely trained on large-scale Internet conversation data. We also present empirical comparisons of dialogue model adaptations for empathetic responding, leveraging existing models or datasets without requiring lengthy re-training of the full model.

Predicting Human Activities from User-Generated Content
Steven Wilson, Rada Mihalcea

Scaling Up Open Tagging from Tens to Thousands: Comprehension Empowered Attribute Value Extraction from Product Title IE
Huimin Xu, Wenting Wang, Xin Mao, Xinyu Jiang, Man Lan

Multi-hop Reading Comprehension through Question Decomposition and Rescoring link
Sewon Min, Victor Zhong, Luke Zettlemoyer, Hannaneh Hajishirzi

Multi-hop Reading Comprehension (RC) requires reasoning and aggregation across several paragraphs. We propose a system for multi-hop RC that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models. Since annotations for such decomposition are expensive, we recast sub-question generation as a span prediction problem and show that our method, trained using only 400 labeled examples, generates sub-questions that are as effective as human-authored sub-questions. We also introduce a new global rescoring approach that considers each decomposition (i.e. the sub-questions and their answers) to select the best final answer, greatly improving overall performance. Our experiments on HotpotQA show that this approach achieves the state-of-the-art results, while providing explainable evidence for its decision making in the form of sub-questions.

Incremental Transformer with Deliberation Decoder for Document Grounded Conversations dialogue/conversation
Zekang Li, Cheng Niu, Fandong Meng, Yang Feng, Qian Li, Jie Zhou

Matching the Blanks: Distributional Similarity for Relation Learning link
Livio Baldini Soares, Nicholas FitzGerald, Jeffrey Ling, Tom Kwiatkowski

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation link
Vihan Jain, Gabriel Magalhaes, Alexander Ku, Ashish Vaswani, Eugene Ie, Jason Baldridge

Advances in learning and representations have reinvigorated work that connects language to other modalities. A particularly exciting direction is Vision-and-Language Navigation(VLN), in which agents interpret natural language instructions and visual scenes to move through environments and reach goals. Despite recent progress, current research leaves unclear how much of a role language understanding plays in this task, especially because dominant evaluation metrics have focused on goal completion rather than the sequence of actions corresponding to the instructions. Here, we highlight shortcomings of current metrics for the Room-to-Room dataset (Anderson et al.,2018b) and propose a new metric, Coverage weighted by Length Score (CLS). We also show that the existing paths in the dataset are not ideal for evaluating instruction following because they are direct-to-goal shortest paths. We join existing short paths to form more challenging extended paths to create a new data set, Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive rewards for instruction fidelity outperform agents that focus on goal completion.

Compound Probabilistic Context-Free Grammars for Grammar Induction link
Yoon Kim, Chris Dyer, Alexander Rush

We study a formalization of the grammar induction problem that models sentences as being generated by a compound probabilistic context-free grammar. In contrast to traditional formulations which learn a single stochastic grammar, our context-free rule probabilities are modulated by a per-sentence continuous latent variable, which induces marginal dependencies beyond the traditional context-free assumptions. Inference in this grammar is performed by collapsed variational inference, in which an amortized variational posterior is placed on the continuous variable, and the latent trees are marginalized with dynamic programming. Experiments on English and Chinese show the effectiveness of our approach compared to recent state-of-the-art methods for grammar induction from words with neural language models.

Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization link NLG parsing
Hai Ye, Wenjie Li, Lu Wang

Semantic parsing aims to transform natural language (NL) utterances into formal meaning representations (MRs), whereas an NL generator achieves the reverse: producing a NL description for some given MRs. Despite this intrinsic connection, the two tasks are often studied separately in prior work. In this paper, we model the duality of these two tasks via a joint learning framework, and demonstrate its effectiveness of boosting the performance on both tasks. Concretely, we propose a novel method of dual information maximization (DIM) to regularize the learning process, where DIM empirically maximizes the variational lower bounds of expected joint distributions of NL and MRs. We further extend DIM to a semi-supervision setup (SemiDIM), which leverages unlabeled data of both tasks. Experiments on three datasets of dialogue management and code generation (and summarization) show that performance on both semantic parsing and NL generation can be consistently improved by DIM, in both supervised and semi-supervised setups.

Accelerating Sparse Matrix Operations in Neural Networks on Graphics Processing Units
Arturo Argueta, David Chiang

Combining Knowledge Hunting and Neural Language Models to Solve the Winograd Schema Challenge
Ashok Prakash, Arpit Sharma, Arindam Mitra, Chitta Baral

HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization link summarization
Xingxing Zhang, Furu Wei, Ming Zhou

Neural extractive summarization models usually employ a hierarchical encoder for document encoding and they are trained using sentence-level labels, which are created heuristically using rule-based methods. Training the hierarchical encoder with these \emph{inaccurate} labels is challenging. Inspired by the recent work on pre-training transformer sentence encoders \cite{devlin:2018:arxiv}, we propose {\sc Hibert} (as shorthand for {\bf HI}erachical {\bf B}idirectional {\bf E}ncoder {\bf R}epresentations from {\bf T}ransformers) for document encoding and a method to pre-train it using unlabeled data. We apply the pre-trained {\sc Hibert} to our summarization model and it outperforms its randomly initialized counterpart by 1.25 ROUGE on the CNN/Dailymail dataset and by 2.0 ROUGE on a version of New York Times dataset. We also achieve the state-of-the-art performance on these two datasets.

Multi-Task Deep Neural Networks for Natural Language Understanding link
Xiaodong Liu, Pengcheng He, Weizhu Chen, Jianfeng Gao

DisSent: Learning Sentence Representations from Explicit Discourse Relations link RepL
Allen Nie, Erin Bennett, Noah Goodman

Massively Multilingual Transfer for NER link transfer
Afshin Rahimi, Yuan Li, Trevor Cohn

In cross-lingual transfer, NLP models over one or more source languages are applied to a low-resource target language. While most prior work has used a single source model or a few carefully selected models, here we consider a `massive' setting with many such models. This setting raises the problem of poor transfer, particularly from distant languages. We propose two techniques for modulating the transfer, suitable for zero-shot or few-shot learning, respectively. Evaluating on named entity recognition, we show that our techniques are much more effective than strong baselines, including standard ensembling, and our unsupervised method rivals oracle selection of the single best individual model.

COMET: Commonsense Transformers for Automatic Knowledge Graph Construction link
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, Yejin Choi

We present the first comprehensive study on automatic knowledge base construction for two prevalent commonsense knowledge graphs: ATOMIC (Sap et al., 2019) and ConceptNet (Speer et al., 2017). Contrary to many conventional KBs that store knowledge with canonical templates, commonsense KBs only store loosely structured open-text descriptions of knowledge. We posit that an important step toward automatic commonsense completion is the development of generative models of commonsense knowledge, and propose COMmonsEnse Transformers (COMET) that learn to generate rich and diverse commonsense descriptions in natural language. Despite the challenges of commonsense modeling, our investigation reveals promising results when implicit knowledge from deep pre-trained language models is transferred to generate explicit knowledge in commonsense knowledge graphs. Empirical results demonstrate that COMET is able to generate novel knowledge that humans rate as high quality, with up to 77.5% (ATOMIC) and 91.7% (ConceptNet) precision at top 1, which approaches human performance for these resources. Our findings suggest that using generative commonsense models for automatic commonsense KB completion could soon be a plausible alternative to extractive methods.

Unsupervised Pivot Translation for Distant Languages link MT
Yichong Leng, Xu Tan, Tao QIN, Xiang-Yang Li, Tie-Yan Liu

Unsupervised neural machine translation (NMT) has attracted a lot of attention recently. While state-of-the-art methods for unsupervised translation usually perform well between similar languages (e.g., English-German translation), they perform poorly between distant languages, because unsupervised alignment does not work well for distant languages. In this work, we introduce unsupervised pivot translation for distant languages, which translates a language to a distant language through multiple hops, and the unsupervised translation on each hop is relatively easier than the original direct translation. We propose a learning to route (LTR) method to choose the translation path between the source and target languages. LTR is trained on language pairs whose best translation path is available and is applied on the unseen language pairs for path selection. Experiments on 20 languages and 294 distant language pairs demonstrate the advantages of the unsupervised pivot translation for distant languages, as well as the effectiveness of the proposed LTR for path selection. Specifically, in the best case, LTR achieves an improvement of 5.58 BLEU points over the conventional direct unsupervised method.

CogNet: a Large-Scale Cognate Database
Khuyagbaatar Batsuren, Gabor Bella, Fausto Giunchiglia

Semi-supervised Stochastic Multi-Domain Learning using Variational Inference
Yitong Li, Timothy Baldwin, Trevor Cohn

Tree Communication Models for Sentiment Analysis
Yuan Zhang, Yue Zhang

Dynamically Composing Domain-Data Selection with Clean-Data Selection by ``Co-Curricular Learning" for Neural Machine Translation link MT
Wei Wang, Isaac Caswell, Ciprian Chelba

Modeling affirmative and negated action processing in the brain with lexical and compositional semantic models
Vesna Djokic, Jean Maillard, Luana Bulat, Ekaterina Shutova

Neural Temporality Adaptation for Document Classification: Diachronic Word Embeddings and Domain Adaptation Models RepL
Xiaolei Huang, Michael J. Paul

Careful Selection of Knowledge to solve Open Book Question Answering QA
Pratyay Banerjee, Kuntal Kumar Pal, Arindam Mitra, Chitta Baral

Tree LSTMs with Convolution Units to Predict Stance and Rumor Veracity in Social Media Conversations dialogue/conversation
Sumeet Kumar, Kathleen Carley

Learning to Select, Track, and Generate for Data-to-Text
Hayate Iso, Yui Uehara, Tatsuya Ishigaki, Hiroshi Noji, Eiji ARAMAKI, Ichiro Kobayashi, Yusuke Miyao, Naoaki Okazaki, Hiroya Takamura

Scoring Sentence Singletons and Pairs for Abstractive Summarization link summarization
Logan Lebanoff, Kaiqiang Song, Franck Dernoncourt, Doo Soon Kim, Seokhwan Kim, Walter Chang, Fei Liu

When writing a summary, humans tend to choose content from one or two sentences and merge them into a single summary sentence. However, the mechanisms behind the selection of one or multiple source sentences remain poorly understood. Sentence fusion assumes multi-sentence input; yet sentence selection methods only work with single sentences and not combinations of them. There is thus a crucial gap between sentence selection and fusion to support summarizing by both compressing single sentences and fusing pairs. This paper attempts to bridge the gap by ranking sentence singletons and pairs together in a unified space. Our proposed framework attempts to model human methodology by selecting either a single sentence or a pair of sentences, then compressing or fusing the sentence(s) to produce a summary sentence. We conduct extensive experiments on both single- and multi-document summarization datasets and report findings on sentence selection and abstraction.

SParC: Cross-Domain Semantic Parsing in Context link parsing
Tao Yu, Rui Zhang, Michihiro Yasunaga, Yi Chern Tan, Xi Victoria Lin, Suyi Li, Heyang Er, Irene Li, Bo Pang, Tao Chen, Emily Ji, Shreya Dixit, David Proctor, Sungrok Shim, Jonathan Kraft, Vincent Zhang, Caiming Xiong, Richard Socher, Dragomir Radev

Scalable Syntax-Aware Language Models Using Knowledge Distillation link
Adhiguna Kuncoro, Chris Dyer, Laura Rimell, Stephen Clark, Phil Blunsom

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers knowledge from a syntactic language model trained on a small corpus to an LSTM language model, hence enabling the LSTM to develop a more structurally sensitive representation of the larger training data it learns from. On targeted syntactic evaluations, we find that, while sequential LSTMs perform much better than previously reported, our proposed technique substantially improves on this baseline, yielding a new state of the art. Our findings and analysis affirm the importance of structural biases, even in models that learn from large amounts of data.

HellaSwag: Can a Machine Really Finish Your Sentence? link
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT, near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves difficult for even state-of-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (>95% accuracy), state-of-the-art models struggle (<48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting difficulty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

Pre-Learning Environment Representations for Data-Efficient Neural Instruction Following RepL
David Gaddy, Dan Klein

Multi-Relational Script Learning for Discourse Relations
I-Ta Lee, Dan Goldwasser

EditNTS: An Neural Programmer-Interpreter Model for Sentence Simplification through Explicit Editing link
Yue Dong, Zichao Li, Mehdi Rezagholizadeh, Jackie Chi Kit Cheung

We present the first sentence simplification model that learns explicit edit operations (ADD, DELETE, and KEEP) via a neural programmer-interpreter approach. Most current neural sentence simplification systems are variants of sequence-to-sequence models adopted from machine translation. These methods learn to simplify sentences as a byproduct of the fact that they are trained on complex-simple sentence pairs. By contrast, our neural programmer-interpreter is directly trained to predict explicit edit operations on targeted parts of the input sentence, resembling the way that humans might perform simplification and revision. Our model outperforms previous state-of-the-art neural sentence simplification models (without external knowledge) by large margins on three benchmark text simplification corpora in terms of SARI (+0.95 WikiLarge, +1.89 WikiSmall, +1.41 Newsela), and is judged by humans to produce overall better and simpler output sentences.

Deep Neural Model Inspection and Comparison via Functional Neuron Pathways
James Fiacco, Samridhi Choudhary, Carolyn Rose

Learning Representation Mapping for Relation Detection in Knowledge Base Question Answering QA
Peng Wu, Shujian Huang, Rongxiang Weng, Zaixiang Zheng, Jianbing Zhang, Xiaohui Yan, Jiajun CHEN

Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks link RepL
Shikhar Vashishth, Manik Bhandari, Prateek Yadav, Piyush Rai, Chiranjib Bhattacharyya, Partha Talukdar

Word embeddings have been widely adopted across several NLP applications. Most existing word embedding methods utilize sequential context of a word to learn its embedding. While there have been some attempts at utilizing syntactic context of a word, such methods result in an explosion of the vocabulary size. In this paper, we overcome this problem by proposing SynGCN, a flexible Graph Convolution based method for learning word embeddings. SynGCN utilizes the dependency context of a word without increasing the vocabulary size. Word embeddings learned by SynGCN outperform existing methods on various intrinsic and extrinsic tasks and provide an advantage when used with ELMo. We also propose SemGCN, an effective framework for incorporating diverse semantic knowledge for further enhancing learned word representations. We make the source code of both models available to encourage reproducible research.

Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling
Nasser Zalmout, Nizar Habash

Bridging by Word: Image Grounded Vocabulary Construction for Visual Captioning
Zhihao Fan, Zhongyu Wei, Siyuan Wang, Xuanjing Huang

Assessing the Ability of Self-Attention Networks to Learn Word Order link
Baosong Yang, Longyue Wang, Derek F. Wong, Lidia S. Chao, Zhaopeng Tu

Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when "lacking positional information" have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.

The KnowRef Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution corpus
Ali Emami, Paul TRICHELAIR, Adam Trischler, Kaheer Suleman, Hannes Schulz, Jackie Chi Kit Cheung

Reinforced Dynamic Reasoning for Conversational Question Generation dialogue/conversation NLG
Boyuan Pan, Hao Li, Ziyu Yao, Deng Cai, Huan Sun

StRE: Self Attentive Edit Quality Prediction in Wikipedia link
Soumya Sarkar, Bhanu Prakash Reddy, Sandipan Sikdar, Animesh Mukherjee

Wikipedia can easily be justified as a behemoth, considering the sheer volume of content that is added or removed every minute to its several projects. This creates an immense scope, in the field of natural language processing towards developing automated tools for content moderation and review. In this paper we propose Self Attentive Revision Encoder (StRE) which leverages orthographic similarity of lexical units toward predicting the quality of new edits. In contrast to existing propositions which primarily employ features like page reputation, editor activity or rule based heuristics, we utilize the textual content of the edits which, we believe contains superior signatures of their quality. More specifically, we deploy deep encoders to generate representations of the edits from its text content, which we then leverage to infer quality. We further contribute a novel dataset containing 21M revisions across 32K Wikipedia pages and demonstrate that StRE outperforms existing methods by a significant margin at least 17% and at most 103%. Our pretrained model achieves such result after retraining on a set as small as 20% of the edits in a wikipage. This, to the best of our knowledge, is also the first attempt towards employing deep language models to the enormous domain of automated content moderation and review in Wikipedia.

Improving the Similarity Measure of Determinantal Point Processes for Extractive Multi-Document Summarization link summarization
Sangwoo Cho, Logan Lebanoff, Hassan Foroosh, Fei Liu

The most important obstacles facing multi-document summarization include excessive redundancy in source descriptions and the looming shortage of training data. These obstacles prevent encoder-decoder models from being used directly, but optimization-based methods such as determinantal point processes (DPPs) are known to handle them well. In this paper we seek to strengthen a DPP-based method for extractive multi-document summarization by presenting a novel similarity measure inspired by capsule networks. The approach measures redundancy between a pair of sentences based on surface form and semantic information. We show that our DPP system with improved similarity measure performs competitively, outperforming strong summarization baselines on benchmark datasets. Our findings are particularly meaningful for summarizing documents created by multiple authors containing redundant yet lexically diverse expressions.

Dynamically Fused Graph Network for Multi-hop Reasoning link
Lin Qiu, Yunxuan Xiao, Yanru Qu, Hao Zhou, Lei Li, Weinan Zhang, Yong Yu

Text-based question answering (TBQA) has been studied extensively in recent years. Most existing approaches focus on finding the answer to a question within a single paragraph. However, many difficult questions require multiple supporting evidence from scattered text among two or more documents. In this paper, we propose Dynamically Fused Graph Network(DFGN), a novel method to answer those questions requiring multiple scattered evidence and reasoning over them. Inspired by human's step-by-step reasoning behavior, DFGN includes a dynamic fusion layer that starts from the entities mentioned in the given query, explores along the entity graph dynamically built from the text, and gradually finds relevant supporting entities from the given documents. We evaluate DFGN on HotpotQA, a public TBQA dataset requiring multi-hop reasoning. DFGN achieves competitive results on the public board. Furthermore, our analysis shows DFGN produces interpretable reasoning chains.

On the Word Alignment from Neural Machine Translation MT
Xintong Li, Guanlin Li, Lemao Liu, Max Meng, Shuming Shi

Show, Describe and Conclude: On Exploiting the Structure Information of Chest X-ray Report
Baoyu Jing, Zeya Wang, Eric Xing

The Linguistic Development of Mental Health Counselors
Justine Zhang, Robert Filbin, Christine Morrison, Jaclyn Weiser, Cristian Danescu-Niculescu-Mizil

Towards Multimodal Sarcasm Detection (An _Obviously_ Perfect Paper) link
Santiago Castro, Devamanyu Hazarika, Verónica Pérez-Rosas, Roger Zimmermann, Rada Mihalcea, Soujanya Poria

Sarcasm is often expressed through several verbal and non-verbal cues, e.g., a change of tone, overemphasis in a word, a drawn-out syllable, or a straight looking face. Most of the recent work in sarcasm detection has been carried out on textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic classification of sarcasm. As a first step towards enabling the development of multimodal approaches for sarcasm detection, we propose a new sarcasm dataset, Multimodal Sarcasm Detection Dataset (MUStARD), compiled from popular TV shows. MUStARD consists of audiovisual utterances annotated with sarcasm labels. Each utterance is accompanied by its context of historical utterances in the dialogue, which provides additional information on the scenario where the utterance occurs. Our initial results show that the use of multimodal information can reduce the relative error rate of sarcasm detection by up to 12.9% in F-score when compared to the use of individual modalities. The full dataset is publicly available for use at https://github.com/soujanyaporia/MUStARD

Curate and Generate: A Corpus and Method for Joint Control of Semantics and Style in Neural NLG link NLG
Shereen Oraby, Vrindavan Harrison, Abteen Ebrahimi, Marilyn Walker

Neural natural language generation (NNLG) from structured meaning representations has become increasingly popular in recent years. While we have seen progress with generating syntactically correct utterances that preserve semantics, various shortcomings of NNLG systems are clear: new tasks require new training data which is not available or straightforward to acquire, and model outputs are simple and may be dull and repetitive. This paper addresses these two critical challenges in NNLG by: (1) scalably (and at no cost) creating training datasets of parallel meaning representations and reference texts with rich style markup by using data from freely available and naturally descriptive user reviews, and (2) systematically exploring how the style markup enables joint control of semantic and stylistic aspects of neural model output. We present YelpNLG, a corpus of 300,000 rich, parallel meaning representations and highly stylistically varied reference texts spanning different restaurant attributes, and describe a novel methodology that can be scalably reused to generate NLG datasets for other domains. The experiments show that the models control important aspects, including lexical choice of adjectives, output length, and sentiment, allowing the models to successfully hit multiple style targets without sacrificing semantics.

Variational Pretraining for Semi-supervised Text Classification link
Suchin Gururangan, Tam Dang, Dallas Card, Noah A. Smith

We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find that fine-tuning to in-domain data is crucial to achieving decent performance from contextual embeddings when working with limited supervision. We accompany this paper with code to pretrain and use VAMPIRE embeddings in downstream tasks.

Imitation Learning for Non-Autoregressive Neural Machine Translation MT
Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, Xu SUN

Second-Order Semantic Dependency Parsing with End-to-End Neural Networks link parsing
Xinyu Wang, Jingxian Huang, Kewei Tu

Semantic dependency parsing aims to identify semantic relationships between words in a sentence that form a graph. In this paper, we propose a second-order semantic dependency parser, which takes into consideration not only individual dependency edges but also interactions between pairs of edges. We show that second-order parsing can be approximated using mean field (MF) variational inference or loopy belief propagation (LBP). We can unfold both algorithms as recurrent layers of a neural network and therefore can train the parser in an end-to-end manner. Our experiments show that our approach achieves state-of-the-art performance.

Automated Chess Commentator Powered by Neural Chess Engine
Hongyu Zang, Zhiwei Yu, Xiaojun Wan

Domain Adaptive Dialog Generation via Meta Learning link dialogue/conversation NLG
Kun Qian, Zhou Yu

Domain adaptation is an essential task in dialog system building because there are so many new dialog tasks created for different needs every day. Collecting and annotating training data for these new tasks is costly since it involves real user interactions. We propose a domain adaptive dialog generation method based on meta-learning (DAML). DAML is an end-to-end trainable dialog system model that learns from multiple rich-resource tasks and then adapts to new domains with minimal training samples. We train a dialog system model using multiple rich-resource single-domain dialog data by applying the model-agnostic meta-learning algorithm to dialog domain. The model is capable of learning a competitive dialog system on a new domain with only a few training examples in an efficient manner. The two-step gradient updates in DAML enable the model to learn general features across multiple tasks. We evaluate our method on a simulated dialog dataset and achieve state-of-the-art performance, which is generalizable to new tasks.

Cross-Sentence Grammatical Error Correction
Shamil Chollampatt, Weiqi Wang, Hwee Tou Ng

Course Concept Expansion in MOOCs with External Knowledge and Interactive Game
Jifan Yu, Chenyu Wang, Gan Luo, Lei Hou, Juanzi Li, Jie Tang, Zhiyuan Liu

Word and Document Embedding with vMF-Mixture Priors on Context Word Vectors
Shoaib Jameel, Steven Schockaert

Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA
Yichen Jiang, Mohit Bansal

TWEETQA: A Social Media Focused Question Answering Dataset QA
Wenhan Xiong, Jiawei Wu, Hong Wang, Vivek Kulkarni, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang

Is Attention Interpretable? link
Sofia Serrano, Noah A. Smith

Exploiting Explicit Paths for Multi-hop Reading Comprehension link
Souvik Kundu, Tushar Khot, Ashish Sabharwal, Peter Clark

We focus on the task of multi-hop reading comprehension where a system is required to reason over a chain of multiple facts, distributed across multiple passages, to answer a question. Inspired by graph-based reasoning, we present a path-based reasoning approach for textual reading comprehension. It operates by generating potential paths across multiple passages, extracting implicit relations along this path, and composing them to encode each path. The proposed model achieves a 2.3% gain on the WikiHop Dev set as compared to previous state-of-the-art and, as a side-effect, is also able to explain its reasoning through explicit paths of sentences.

Incorporating Priors with Feature Attribution on Text Classification link
Frederick Liu, Besim Avci

Feature attribution methods, proposed recently, help users interpret the predictions of complex models. Our approach integrates feature attributions into the objective function to allow machine learning practitioners to incorporate priors in model building. To demonstrate the effectiveness our technique, we apply it to two tasks: (1) mitigating unintended bias in text classifiers by neutralizing identity terms; (2) improving classifier performance in a scarce data setting by forcing the model to focus on toxic terms. Our approach adds an L2 distance loss between feature attributions and task-specific prior values to the objective. Our experiments show that i) a classifier trained with our technique reduces undesired model biases without a trade off on the original task; ii) incorporating priors helps model performance in scarce data settings.

Explain Yourself! Leveraging Language Models for Commonsense Reasoning link
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, Richard Socher

Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation link RepL
Benjamin Heinzerling, Michael Strube

Pretrained contextual and non-contextual subword embeddings have become available in over 250 languages, allowing massively multilingual NLP. However, while there is no dearth of pretrained embeddings, the distinct lack of systematic evaluations makes it difficult for practitioners to choose between them. In this work, we conduct an extensive evaluation comparing non-contextual subword embeddings, namely FastText and BPEmb, and a contextual representation method, namely BERT, on multilingual named entity recognition and part-of-speech tagging. We find that overall, a combination of BERT, BPEmb, and character representations works best across languages and tasks. A more detailed analysis reveals different strengths and weaknesses: Multilingual BERT performs well in medium- to high-resource languages, but is outperformed by non-contextual subword embeddings in a low-resource setting.

Multimodal and Multi-view Models for Emotion Recognition
Gustavo Aguilar, Viktor Rozgic, Weiran Wang, Chao Wang

Learning How to Active Learn by Dreaming
Thuy-Trang Vu, Ming Liu, Dinh Phung, Gholamreza Haffari

Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections link transfer
Junxian He, Zhisong Zhang, Taylor Berg-Kirkpatrick, Graham Neubig

Cross-lingual transfer is an effective way to build syntactic analysis tools in low-resource languages. However, transfer is difficult when transferring to typologically distant languages, especially when neither annotated target data nor parallel corpora are available. In this paper, we focus on methods for cross-lingual transfer to distant languages and propose to learn a generative model with a structured prior that utilizes labeled source data and unlabeled target data jointly. The parameters of source model and target model are softly shared through a regularized log likelihood objective. An invertible projection is employed to learn a new interlingual latent embedding space that compensates for imperfect cross-lingual word embedding input. We evaluate our method on two syntactic tasks: part-of-speech (POS) tagging and dependency parsing. On the Universal Dependency Treebanks, we use English as the only source corpus and transfer to a wide range of target languages. On the 10 languages in this dataset that are distant from English, our method yields an average of 5.2% absolute improvement on POS tagging and 8.3% absolute improvement on dependency parsing over a direct transfer method using state-of-the-art discriminative models.

Handling Domain Shift in Coreference Evaluation by Using Automatically Extracted Minimum Spans
Nafise Sadat Moosavi, Leo Born, Massimo Poesio, Michael Strube

Interpretable Question Answering on Knowledge Bases and Text QA
Alona Sydorova, Nina Poerner, Benjamin Roth

Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts
Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith

Know More about Each Other: Evolving Dialogue Strategy via Compound Assessment link dialogue/conversation
Siqi Bao, Huang He, Fan Wang, Rongzhong Lian, Hua Wu

In this paper, a novel Generation-Evaluation framework is developed for multi-turn conversations with the objective of letting both participants know more about each other. For the sake of rational knowledge utilization and coherent conversation flow, a dialogue strategy which controls knowledge selection is instantiated and continuously adapted via reinforcement learning. Under the deployed strategy, knowledge grounded conversations are conducted with two dialogue agents. The generated dialogues are comprehensively evaluated on aspects like informativeness and coherence, which are aligned with our objective and human instinct. These assessments are integrated as a compound reward to guide the evolution of dialogue strategy via policy gradient. Comprehensive experiments have been carried out on the publicly available dataset, demonstrating that the proposed method outperforms the other state-of-the-art approaches significantly.

DeepSentiPeer: Harnessing Sentiment in Review Texts To Recommend Peer Review Decisions
Tirthankar Ghosal, Rajeev Verma, Asif Ekbal, Pushpak Bhattacharyya

Unified Semantic Parsing with Weak Supervision link parsing
Priyanka Agrawal, Ayushi Dalmia, Parag Jain, Abhishek Bansal, Ashish Mittal, Karthik Sankaranarayanan

Semantic parsing over multiple knowledge bases enables a parser to exploit structural similarities of programs across the multiple domains. However, the fundamental challenge lies in obtaining high-quality annotations of (utterance, program) pairs across various domains needed for training such models. To overcome this, we propose a novel framework to build a unified multi-domain enabled semantic parser trained only with weak supervision (denotations). Weakly supervised training is particularly arduous as the program search space grows exponentially in a multi-domain setting. To solve this, we incorporate a multi-policy distillation mechanism in which we first train domain-specific semantic parsers (teachers) using weak supervision in the absence of the ground truth programs, followed by training a single unified parser (student) from the domain specific policies obtained from these teachers. The resultant semantic parser is not only compact but also generalizes better, and generates more accurate programs. It further does not require the user to provide a domain label while querying. On the standard Overnight dataset (containing multiple domains), we demonstrate that the proposed model improves performance by 20% in terms of denotation accuracy in comparison to baseline techniques.

How Large Are Lions? Inducing Distributions over Quantitative Attributes link
Yanai Elazar, Abhijit Mahabal, Deepak Ramachandran, Tania Bedrax-Weiss, Dan Roth

Most current NLP systems have little knowledge about quantitative attributes of objects and events. We propose an unsupervised method for collecting quantitative information from large amounts of web data, and use it to create a new, very large resource consisting of distributions over physical quantities associated with objects, adjectives, and verbs which we call Distributions over Quantitative (DoQ). This contrasts with recent work in this area which has focused on making only relative comparisons such as "Is a lion bigger than a wolf?". Our evaluation shows that DoQ compares favorably with state of the art results on existing datasets for relative comparisons of nouns and adjectives, and on a new dataset we introduce.

Combating Adversarial Misspellings with Robust Word Recognition link
Danish Pruthi, Bhuwan Dhingra, Zachary C. Lipton

To combat adversarial spelling mistakes, we propose placing a word recognition model in front of the downstream classifier. Our word recognition models build upon the RNN semi-character architecture, introducing several new backoff strategies for handling rare and unseen words. Trained to recognize words corrupted by random adds, drops, swaps, and keyboard mistakes, our method achieves 32% relative (and 3.3% absolute) error reduction over the vanilla semi-character model. Notably, our pipeline confers robustness on the downstream classifier, outperforming both adversarial training and off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment analysis, a single adversarially-chosen character attack lowers accuracy from 90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word recognition does not always entail greater robustness. Our analysis reveals that robustness also depends upon a quantity that we denote the sensitivity.

Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling link
Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, Sameer Singh

Unsupervised Learning of PCFGs with Normalizing Flow
Lifeng Jin, Finale Doshi-Velez, Timothy Miller, Lane Schwartz, William Schuler

Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation link RepL
Jiaqi Guo, Zecheng Zhan, Yan Gao, Yan Xiao, Jian-Guang LOU, Ting Liu, Dongmei Zhang

We present a neural approach called IRNet for complex and cross-domain Text-to-SQL. IRNet aims to address two challenges: 1) the mismatch between intents expressed in natural language (NL) and the implementation details in SQL; 2) the challenge in predicting columns caused by the large number of out-of-domain words. Instead of end-to-end synthesizing a SQL query, IRNet decomposes the synthesis process into three phases. In the first phase, IRNet performs a schema linking over a question and a database schema. Then, IRNet adopts a grammar-based neural model to synthesize a SemQL query which is an intermediate representation that we design to bridge NL and SQL. Finally, IRNet deterministically infers a SQL query from the synthesized SemQL query with domain knowledge. On the challenging Text-to-SQL benchmark Spider, IRNet achieves 46.7% accuracy, obtaining 19.5% absolute improvement over previous state-of-the-art approaches. At the time of writing, IRNet achieves the first position on the Spider leaderboard.

Variance of average surprisal: a better predictor for quality of grammar from unsupervised PCFG induction
Lifeng Jin, William Schuler

Training Neural Response Selection for Task-Oriented Dialogue Systems link dialogue/conversation
Matthew Henderson, Ivan Vulić, Daniela Gerz, Iñigo Casanueva, Paweł Budzianowski, Sam Coope, Georgios Spithourakis, Tsung-Hsien Wen, Nikola Mrkšić, Pei-Hao Su

Despite their popularity in the chatbot literature, retrieval-based models have had modest impact on task-oriented dialogue systems, with the main obstacle to their application being the low-data regime of most task-oriented dialogue tasks. Inspired by the recent success of pretraining in language modelling, we propose an effective method for deploying response selection in task-oriented dialogue. To train response selection models for task-oriented dialogue tasks, we propose a novel method which: 1) pretrains the response selection model on large general-domain conversational corpora; and then 2) fine-tunes the pretrained model for the target dialogue domain, relying only on the small in-domain dataset to capture the nuances of the given dialogue domain. Our evaluation on six diverse application domains, ranging from e-commerce to banking, demonstrates the effectiveness of the proposed training method.

Cross-Domain NER using Cross-Domain Language Modeling NER
Chen Jia, Xiaobo Liang, Yue Zhang

Improving Multi-turn Dialogue Modelling with Utterance ReWriter link dialogue/conversation
Hui Su, Xiaoyu Shen, Rongzhi Zhang, Fei Sun, Pengwei Hu, Cheng Niu, Jie Zhou

Recent research has made impressive progress in single-turn dialogue modelling. In the multi-turn setting, however, current models are still far from satisfactory. One major challenge is the frequently occurred coreference and information omission in our daily conversation, making it hard for machines to understand the real intention. In this paper, we propose rewriting the human utterance as a pre-process to help multi-turn dialgoue modelling. Each utterance is first rewritten to recover all coreferred and omitted information. The next processing steps are then performed based on the rewritten utterance. To properly train the utterance rewriter, we collect a new dataset with human annotations and introduce a Transformer-based utterance rewriting architecture using the pointer network. We show the proposed architecture achieves remarkably good performance on the utterance rewriting task. The trained utterance rewriter can be easily integrated into online chatbots and brings general improvement over different domains.

Fine-tuning Pre-Trained Transformer Language Models to Distantly Supervised Relation Extraction link IE
Christoph Alt, Marc Hübner, Leonhard Hennig

Distantly supervised relation extraction is widely used to extract relational facts from text, but suffers from noisy labels. Current relation extraction methods try to alleviate the noise by multi-instance learning and by providing supporting linguistic and contextual information to more efficiently guide the relation classification. While achieving state-of-the-art results, we observed these models to be biased towards recognizing a limited set of relations with high precision, while ignoring those in the long tail. To address this gap, we utilize a pre-trained language model, the OpenAI Generative Pre-trained Transformer (GPT) [Radford et al., 2018]. The GPT and similar models have been shown to capture semantic and syntactic features, and also a notable amount of "common-sense" knowledge, which we hypothesize are important features for recognizing a more diverse set of relations. By extending the GPT to the distantly supervised setting, and fine-tuning it on the NYT10 dataset, we show that it predicts a larger set of distinct relation types with high confidence. Manual and automated evaluation of our model shows that it achieves a state-of-the-art AUC score of 0.422 on the NYT10 dataset, and performs especially well at higher recall levels.

Controllable Paraphrase Generation with a Syntactic Exemplar link NLG
Mingda Chen, Qingming Tang, Sam Wiseman, Kevin Gimpel

Prior work on controllable text generation usually assumes that the controlled attribute can take on one of a small set of values known a priori. In this work, we propose a novel task, where the syntax of a generated sentence is controlled rather by a sentential exemplar. To evaluate quantitatively with standard metrics, we create a novel dataset with human annotations. We also develop a variational model with a neural module specifically designed for capturing syntactic knowledge and several multitask training objectives to promote disentangled representation learning. Empirically, the proposed model is observed to achieve improvements over baselines and learn to capture desirable characteristics.

Collaborative Dialogue in Minecraft dialogue/conversation
Anjali Narayan-Chen, Prashant Jayannavar, Julia Hockenmaier

Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading link dialogue/conversation
Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, Jianfeng Gao

Although neural conversation models are effective in learning how to produce fluent responses, their primary challenge lies in knowing what to say to make the conversation contentful and non-vacuous. We present a new end-to-end approach to contentful neural conversation that jointly models response generation and on-demand machine reading. The key idea is to provide the conversation model with relevant long-form text on the fly as a source of external knowledge. The model performs QA-style reading comprehension on this text in response to each conversational turn, thereby allowing for more focused integration of external knowledge than has been possible in prior approaches. To support further research on knowledge-grounded conversation, we introduce a new large-scale conversation dataset grounded in external web pages (2.8M turns, 7.4M sentences of grounding). Both human evaluation and automated metrics show that our approach results in more contentful responses compared to a variety of previous methods, improving both the informativeness and diversity of generated output.

Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications link
Wei Zhao, Haiyun Peng, Steffen Eger, Erik Cambria, Min Yang

Obstacles hindering the development of capsule networks for challenging NLP applications include poor scalability to large output spaces and less reliable routing processes. In this paper, we introduce: 1) an agreement score to evaluate the performance of routing processes at instance level; 2) an adaptive optimizer to enhance the reliability of routing; 3) capsule compression and partial routing to improve the scalability of capsule networks. We validate our approach on two NLP tasks, namely: multi-label text classification and question answering. Experimental results show that our approach considerably improves over strong competitors on both tasks. In addition, we gain the best results in low-resource settings with few training instances.

Decomposable Neural Paraphrase Generation link NLG
Zichao Li, Xin Jiang, Lifeng Shang, Qun Liu

Paraphrasing exists at different granularity levels, such as lexical level, phrasal level and sentential level. This paper presents Decomposable Neural Paraphrase Generator (DNPG), a Transformer-based model that can learn and generate paraphrases of a sentence at different levels of granularity in a disentangled way. Specifically, the model is composed of multiple encoders and decoders with different structures, each of which corresponds to a specific granularity. The empirical study shows that the decomposition mechanism of DNPG makes paraphrase generation more interpretable and controllable. Based on DNPG, we further develop an unsupervised domain adaptation method for paraphrase generation. Experimental results show that the proposed model achieves competitive in-domain performance compared to the state-of-the-art neural models, and significantly better performance when adapting to a new domain.

Incorporating Linguistic Constraints into Keyphrase Generation NLG
Jing Zhao, Yuxiang Zhang

Few-Shot Representation Learning for Out-Of-Vocabulary Words RepL
Ziniu Hu, Ting Chen, Kai-Wei Chang, Yizhou Sun

An automated framework for fast cognate detection and Bayesian phylogenetic inference in computational historical linguistics
Taraka Rama, Johann-Mattis List

Determining Relative Argument Specificity and Stance for Complex Argumentative Structures link
Esin Durmus, Faisal Ladhak, Claire Cardie

Systems for automatic argument generation and debate require the ability to (1) determine the stance of any claims employed in the argument and (2) assess the specificity of each claim relative to the argument context. Existing work on understanding claim specificity and stance, however, has been limited to the study of argumentative structures that are relatively shallow, most often consisting of a single claim that directly supports or opposes the argument thesis. In this paper, we tackle these tasks in the context of complex arguments on a diverse set of topics. In particular, our dataset consists of manually curated argument trees for 741 controversial topics covering 95,312 unique claims; lines of argument are generally of depth 2 to 6. We find that as the distance between a pair of claims increases along the argument path, determining the relative specificity of a pair of claims becomes easier and determining their relative stance becomes harder.

Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages
Garrett Nicolai, David Yarowsky

Keeping Notes: Conditional Natural Language Generation with a Scratchpad Encoder link NLG
Ryan Benmalek, Madian Khabsa, Suma Desu, Claire Cardie, Michele Banko

Fine-Grained Sentence Functions for Short-Text Conversation dialogue/conversation
Wei Bi, Jun Gao, Xiaojiang Liu, Shuming Shi

From Surrogacy to Adoption; From Bitcoin to Cryptocurrency: Debate Topic Expansion
Roy Bar-Haim, Dalia Krieger, Orith Toledo-Ronen, Lilach Edelstein, Yonatan Bilu, Alon Halfon, Yoav Katz, Amir Menczel, Ranit Aharonov, Noam Slonim

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation link MT
Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, Chung-Cheng Chiu, Semih Yavuz, Ruoming Pang, Wei Li, Colin Raffel

Simultaneous machine translation begins to translate each source sentence before the source speaker is finished speaking, with applications to live and streaming scenarios. Simultaneous systems must carefully schedule their reading of the source sentence to balance quality against latency. We present the first simultaneous translation system to learn an adaptive schedule jointly with a neural machine translation (NMT) model that attends over all source tokens read thus far. We do so by introducing Monotonic Infinite Lookback (MILk) attention, which maintains both a hard, monotonic attention head to schedule the reading of the source sentence, and a soft attention head that extends from the monotonic head back to the beginning of the source. We show that MILk's adaptive schedule allows it to arrive at latency-quality trade-offs that are favorable to those of a recently proposed wait-k strategy for many latency values.

Unsupervised Information Extraction: Regularizing Discriminative Approaches with Relation Distribution Losses IE
Étienne Simon, Vincent Guigue, Benjamin Piwowarski

Soft Representation Learning for Sparse Transfer transfer
Haeju Park, Jinyoung Yeo, Gengyu Wang, Seung-won Hwang

Word2Sense: Sparse Interpretable Word Embeddings RepL
Abhishek Panigrahi, Harsha Vardhan Simhadri, Chiranjib Bhattacharyya

Domain Adaptation of Neural Machine Translation by Lexicon Induction link MT
Junjie Hu, Mengzhou Xia, Graham Neubig, Jaime Carbonell

It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.

EigenSent: Spectral sentence embeddings using higher-order Dynamic Mode Decomposition
Subhradeep Kayal, George Tsatsaronis

Multi-task Learning with Task, Group, and Universe Feature Learning
Shiva Pentyala, Mengwen Liu, Markus Dreyer

Automatically Identifying Complaints in Social Media link
Daniel Preoţiuc-Pietro, Mihaela Gaman, Nikolaos Aletras

Complaining is a basic speech act regularly used in human and computer mediated communication to express a negative mismatch between reality and expectations in a particular situation. Automatically identifying complaints in social media is of utmost importance for organizations or brands to improve the customer experience or in developing dialogue systems for handling and responding to complaints. In this paper, we introduce the first systematic analysis of complaints in computational linguistics. We collect a new annotated data set of written complaints expressed in English on Twitter.\footnote{Data and code is available here: \url{https://github.com/danielpreotiuc/complaints-social-media}} We present an extensive linguistic analysis of complaining as a speech act in social media and train strong feature-based and neural models of complaints across nine domains achieving a predictive performance of up to 79 F1 using distant supervision.

Matching Article Pairs with Graphical Decomposition and Convolutions link
Bang Liu, Di Niu, Haojie Wei, Jinghong Lin, Yancheng He, Kunfeng Lai, Yu Xu

Identifying the relationship between two articles, e.g., whether two articles published from different sources describe the same breaking news, is critical to many document understanding tasks. Existing approaches for modeling and matching sentence pairs do not perform well in matching longer documents, which embody more complex interactions between the enclosed entities than a sentence does. To model article pairs, we propose the Concept Interaction Graph to represent an article as a graph of concepts. We then match a pair of articles by comparing the sentences that enclose the same concept vertex through a series of encoding techniques, and aggregate the matching signals through a graph convolutional network. To facilitate the evaluation of long article matching, we have created two datasets, each consisting of about 30K pairs of breaking news articles covering diverse topics in the open domain. Extensive evaluations of the proposed methods on the two datasets demonstrate significant improvements over a wide range of state-of-the-art methods for natural language matching.

Neural Decipherment via Minimum-Cost Flow: from Ugaritic to Linear B link
Jiaming Luo, Regina Barzilay, Yuan Cao

In this paper we propose a novel neural approach for automatic decipherment of lost languages. To compensate for the lack of strong supervision signal, our model design is informed by patterns in language change documented in historical linguistics. The model utilizes an expressive sequence-to-sequence model to capture character-level correspondences between cognates. To effectively train the model in an unsupervised manner, we innovate the training procedure by formalizing it as a minimum-cost flow problem. When applied to the decipherment of Ugaritic, we achieve a 5.5% absolute improvement over state-of-the-art results. We also report the first automatic results in deciphering Linear B, a syllabic language related to ancient Greek, where our model correctly translates 67.3% of cognates.

Exploiting Invertible Decoders for Unsupervised Sentence Representation Learning link RepL
Shuai Tang, Virginia R. de Sa

The encoder-decoder models for unsupervised sentence representation learning tend to discard the decoder after being trained on a large unlabelled corpus, since only the encoder is needed to map the input sentence into a vector representation. However, parameters learnt in the decoder also contain useful information about language. In order to utilise the decoder after learning, we present two types of decoding functions whose inverse can be easily derived without expensive inverse calculation. Therefore, the inverse of the decoding function serves as another encoder that produces sentence representations. We show that, with careful design of the decoding functions, the model learns good sentence representations, and the ensemble of the representations produced from the encoder and the inverse of the decoder demonstrate even better generalisation ability and solid transferability.

Neural Response Generation with Meta-words NLG
Can Xu, wei wu, Chongyang Tao, Huang Hu, Matt Schuerman, Ying Wang

Towards Comprehensive Description Generation from Factual Attribute-value Tables NLG
Tianyu Liu, Fuli Luo, Pengcheng Yang, Wei Wu, Baobao Chang, Zhifang Sui

Categorizing and Inferring the Relationship between the Text and Image of Twitter Posts
Alakananda Vempala, Daniel Preoţiuc-Pietro

Expressing Visual Relationships via Language link
Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, Mohit Bansal

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation, and retrieval), generating relational captions for two images, can also be very useful. This important problem has not been explored mostly due to lack of datasets and effective models. To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions. We then propose a new relational speaker model based on an encoder-decoder architecture with static relational attention and sequential multi-head attention. We also extend the model with dynamic relational attention, which calculates visual alignment while decoding. Our models are evaluated on our newly collected and two public datasets consisting of image pairs annotated with relationship sentences. Experimental results, based on both automatic and human evaluation, demonstrate that our model outperforms all baselines and existing methods on all the datasets.

Zero-Shot Cross-Lingual Abstractive Sentence Summarization through Teaching Generation and Attention summarization NLG
Mingming Yin, Xiangyu Duan, Min Zhang, Boxing Chen, Weihua Luo

Beyond BLEU:Training Neural Machine Translation with Semantic Similarity MT
John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, Graham Neubig

Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good link dialogue/conversation
Xuewei Wang, Weiyan Shi, Richard Kim, Yoojung Oh, Sijia Yang, Jingwen Zhang, Zhou Yu

Developing intelligent persuasive conversational agents to change people's opinions and actions for social good is the frontier in advancing the ethical development of automated dialogue systems. To do so, the first step is to understand the intricate organization of strategic disclosures and appeals employed in human persuasion conversations. We designed an online persuasion task where one participant was asked to persuade the other to donate to a specific charity. We collected a large dataset with 1,017 dialogues and annotated emerging persuasion strategies from a subset. Based on the annotation, we built a baseline classifier with context information and sentence-level features to predict the 10 persuasion strategies used in the corpus. Furthermore, to develop an understanding of personalized persuasion processes, we analyzed the relationships between individuals' demographic and psychological backgrounds including personality, morality, value systems, and their willingness for donation. Then, we analyzed which types of persuasion strategies led to a greater amount of donation depending on the individuals' personal backgrounds. This work lays the ground for developing a personalized persuasive dialogue system.

Bilingual Lexicon Induction with Semi-supervision in Non-Isometric Embedding Spaces
Barun Patra, Joel Ruben Antony Moniz, Sarthak Garg, Matthew R. Gormley, Graham Neubig

Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue link dialogue/conversation NLG
Anusha Balakrishnan, Jinfeng Rao, Kartikeya Upasani, Michael White, Rajen Subba

Generating fluent natural language responses from structured semantic representations is a critical step in task-oriented conversational systems. Avenues like the E2E NLG Challenge have encouraged the development of neural approaches, particularly sequence-to-sequence (Seq2Seq) models for this problem. The semantic representations used, however, are often underspecified, which places a higher burden on the generation model for sentence planning, and also limits the extent to which generated responses can be controlled in a live system. In this paper, we (1) propose using tree-structured semantic representations, like those used in traditional rule-based NLG systems, for better discourse-level structuring and sentence-level planning; (2) introduce a challenging dataset using this representation for the weather domain; (3) introduce a constrained decoding approach for Seq2Seq models that leverages this representation to improve semantic correctness; and (4) demonstrate promising results on our dataset and the E2E dataset.

Quantifying Similarity between Relations with Fact Distribution
Weize Chen, Hao Zhu, Xu Han, Zhiyuan Liu, Maosong Sun

An Effective Approach to Unsupervised Machine Translation link MT
Mikel Artetxe, Gorka Labaka, Eneko Agirre

NLProlog: Reasoning with Weak Unification for Question Answering in Natural Language link QA
Leon Weber, Pasquale Minervini, Jannes Münchmeyer, Ulf Leser, Tim Rocktäschel

Rule-based models are attractive for various tasks because they inherently lead to interpretable and explainable decisions and can easily incorporate prior knowledge. However, such systems are difficult to apply to problems involving natural language, due to its linguistic variability. In contrast, neural models can cope very well with ambiguity by learning distributed representations of words and their composition from data, but lead to models that are difficult to interpret. In this paper, we describe a model combining neural networks with logic programming in a novel manner for solving multi-hop reasoning tasks over natural language. Specifically, we propose to use a Prolog prover which we extend to utilize a similarity function over pretrained sentence encoders. We fine-tune the representations for the similarity function via backpropagation. This leads to a system that can apply rule-based reasoning to natural language, and induce domain-specific rules from training data. We evaluate the proposed system on two different question answering tasks, showing that it outperforms two baselines -- BIDAF (Seo et al., 2016a) and FAST QA (Weissenborn et al., 2017b) on a subset of the WikiHop corpus and achieves competitive results on the MedHop data set (Welbl et al., 2017).

Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems link dialogue/conversation
Hung Le, Doyen Sahoo, Nancy Chen, Steven Hoi

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance. We implemented our models using PyTorch and the code is released at https://github.com/henryhungle/MTN.

Graph-based Dependency Parsing with Graph Neural Networks parsing
Tao Ji, Yuanbin Wu, Man Lan

You Write Like You Eat: Stylistic Variation as a Predictor of Social Stratification
Angelo Basile, Albert Gatt, Malvina Nissim

Transforming Complex Sentences into a Semantic Hierarchy link
Christina Niklaus, Matthias Cetto, André Freitas, Siegfried Handschuh

We present an approach for recursively splitting and rephrasing complex English sentences into a novel semantic hierarchy of simplified sentences, with each of them presenting a more regular structure that may facilitate a wide variety of artificial intelligence tasks, such as machine translation (MT) or information extraction (IE). Using a set of hand-crafted transformation rules, input sentences are recursively transformed into a two-layered hierarchical representation in the form of core sentences and accompanying contexts that are linked via rhetorical relations. In this way, the semantic relationship of the decomposed constituents is preserved in the output, maintaining its interpretability for downstream applications. Both a thorough manual analysis and automatic evaluation across three datasets from two different domains demonstrate that the proposed syntactic simplification approach outperforms the state of the art in structural text simplification. Moreover, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall. To enable reproducible research, all code is provided online.

Global Optimization under Length Constraint for Neural Text Summarization summarization
Takuya Makino, Tomoya Iwakura, Hiroya Takamura, Manabu Okumura

Latent Variable Sentiment Grammar link
Liwen Zhang, Kewei Tu, Yue Zhang

Wide-Coverage Neural A* Parsing for Minimalist Grammars parsing
John Torr, Milos Stanojevic, Mark Steedman, Shay B. Cohen

Enhancing Unsupervised Generative Dependency Parser with Contextual Information parsing
Wenjuan Han, Yong Jiang, Kewei Tu

Open-Domain Why-Question Answering with Adversarial Learning to Encode Answer Texts QA
Jong-Hoon Oh, Kazuma Kadowaki, Julien Kloetzer, Ryu Iida, Kentaro Torisawa

Emotion-Cause Pair Extraction: A New Task to Emotion Analysis in Texts link IE
Rui Xia, Zixiang Ding

Emotion cause extraction (ECE), the task aimed at extracting the potential causes behind certain emotions in text, has gained much attention in recent years due to its wide applications. However, it suffers from two shortcomings: 1) the emotion must be annotated before cause extraction in ECE, which greatly limits its applications in real-world scenarios; 2) the way to first annotate emotion and then extract the cause ignores the fact that they are mutually indicative. In this work, we propose a new task: emotion-cause pair extraction (ECPE), which aims to extract the potential pairs of emotions and corresponding causes in a document. We propose a 2-step approach to address this new ECPE task, which first performs individual emotion extraction and cause extraction via multi-task learning, and then conduct emotion-cause pairing and filtering. The experimental results on a benchmark emotion cause corpus prove the feasibility of the ECPE task as well as the effectiveness of our approach.

Eliciting Knowledge from Experts: Automatic Transcript Parsing for Cognitive Task Analysis parsing
Junyi Du, He Jiang, Jiaming Shen, Xiang Ren

Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change link
Haim Dubossarsky, Simon Hengchen, Nina Tahmasebi, Dominik Schlechtweg

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases.

Language Modeling with Shared Grammar
Yuyu Zhang, Le Song

Ordinal and Attribute Aware Response Generation in a Multimodal Dialogue System dialogue/conversation NLG
Hardik Chauhan, Mauajama Firdaus, Asif Ekbal, Pushpak Bhattacharyya

Improving Neural Conversational Models with Entropy-Based Data Filtering link dialogue/conversation
Richárd Csáky, Patrik Purgai, Gábor Recski

Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain response generation focused on either the underlying model or the training objective, we present a method of filtering dialog datasets by removing generic utterances from training data using a simple entropy-based approach that does not require human supervision. We conduct extensive experiments with different variations of our method, and compare dialog models across 17 evaluation metrics to show that training on datasets filtered this way results in better conversational quality as chatbots learn to output more diverse responses.

Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation link transfer
Ning Dai, Jianze Liang, Xipeng Qiu, Xuanjing Huang

Disentangling the content and style in the latent space is prevalent in unpaired text style transfer. However, two major issues exist in most of the current neural models. 1) It is difficult to completely strip the style information from the semantics for a sentence. 2) The recurrent neural network (RNN) based encoder and decoder, mediated by the latent representation, cannot well deal with the issue of the long-term dependency, resulting in poor preservation of non-stylistic semantic content.In this paper, we propose the Style Transformer, which makes no assumption about the latent representation of source sentence and equips the power of attention mechanism in Transformer to achieve better style transfer and better content preservation.

Distilling Translations with Visual Awareness link MT
Julia Ive, Pranava Madhyastha, Lucia Specia

Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.

Encoding Social Information with Graph Convolutional Networks forPolitical Perspective Detection in News Media
Chang Li, Dan Goldwasser

Reference Network for Neural Machine Translation MT
Han Fu, Chenghao Liu, Jianling Sun

Target-Guided Open-Domain Conversation link dialogue/conversation
Jianheng Tang, Zhiting Hu, Tiancheng Zhao, Chenyan Xiong, Xiaodan Liang, Eric Xing

Mitigating Gender Bias in Natural Language Processing: Literature Review link bias
Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, William Yang Wang

As Natural Language Processing (NLP) and Machine Learning (ML) tools rise in popularity, it becomes increasingly vital to recognize the role they play in shaping societal biases and stereotypes. Although NLP models have shown success in modeling various applications, they propagate and may even amplify gender bias found in text corpora. While the study of bias in artificial intelligence is not new, methods to mitigate gender bias in NLP are relatively nascent. In this paper, we review contemporary studies on recognizing and mitigating gender bias in NLP. We discuss gender bias based on four forms of representation bias and analyze methods recognizing gender bias. Furthermore, we discuss the advantages and drawbacks of existing gender debiasing methods. Finally, we discuss future studies for recognizing and mitigating gender bias in NLP.

VIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
Pranava Madhyastha, Josiah Wang, Lucia Specia

Searching for Effective Neural Extractive Summarization: What Works and What’s Next summarization
Ming Zhong, Pengfei Liu, Danqing Wang, Xipeng Qiu, Xuanjing Huang

Give Me More Feedback II: Annotating Thesis Strength and Related Attributes in Student Essays
Zixuan Ke, Vincent Ng

Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation link MT
Chenze Shao, Yang Feng, Jinchao Zhang, Fandong Meng, Xilin Chen, Jie Zhou

Non-Autoregressive Transformer (NAT) aims to accelerate the Transformer model through discarding the autoregressive mechanism and generating target words independently, which fails to exploit the target sequential information. Over-translation and under-translation errors often occur for the above reason, especially in the long sentence translation scenario. In this paper, we propose two approaches to retrieve the target sequential information for NAT to enhance its translation ability while preserving the fast-decoding property. Firstly, we propose a sequence-level training method based on a novel reinforcement algorithm for NAT (Reinforce-NAT) to reduce the variance and stabilize the training procedure. Secondly, we propose an innovative Transformer decoder named FS-decoder to fuse the target sequential information into the top layer of the decoder. Experimental results on three translation tasks show that the Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU without decelerating the decoding speed and the FS-decoder achieves comparable translation performance to the autoregressive Transformer with considerable speedup.

Modeling Semantic Compositionality with Sememe Knowledge
Fanchao Qi, Junjie Huang, Chenghao Yang, Zhiyuan Liu, Xiao Chen, Qun Liu, Maosong Sun

Topic Sensitive Attention on Generic Corpora Corrects Sense Bias in Pretrained Embeddings link bias
Vihari Piratla, Sunita Sarawagi, Soumen Chakrabarti

Given a small corpus $\mathcal D_T$ pertaining to a limited set of focused topics, our goal is to train embeddings that accurately capture the sense of words in the topic in spite of the limited size of $\mathcal D_T$. These embeddings may be used in various tasks involving $\mathcal D_T$. A popular strategy in limited data settings is to adapt pre-trained embeddings $\mathcal E$ trained on a large corpus. To correct for sense drift, fine-tuning, regularization, projection, and pivoting have been proposed recently. Among these, regularization informed by a word's corpus frequency performed well, but we improve upon it using a new regularizer based on the stability of its cooccurrence with other words. However, a thorough comparison across ten topics, spanning three tasks, with standardized settings of hyper-parameters, reveals that even the best embedding adaptation strategies provide small gains beyond well-tuned baselines, which many earlier comparisons ignored. In a bold departure from adapting pretrained embeddings, we propose using $\mathcal D_T$ to probe, attend to, and borrow fragments from any large, topic-rich source corpus (such as Wikipedia), which need not be the corpus used to pretrain embeddings. This step is made scalable and practical by suitable indexing. We reach the surprising conclusion that even limited corpus augmentation is more useful than adapting embeddings, which suggests that non-dominant sense information may be irrevocably obliterated from pretrained embeddings and cannot be salvaged by adaptation.

Learning to Ask Unanswerable Questions for Machine Reading Comprehension link
Haichao Zhu, Li Dong, Furu Wei, Wenhui Wang, Bing Qin, Ting Liu

Machine reading comprehension with unanswerable questions is a challenging task. In this work, we propose a data augmentation technique by automatically generating relevant unanswerable questions according to an answerable question paired with its corresponding paragraph that contains the answer. We introduce a pair-to-sequence model for unanswerable question generation, which effectively captures the interactions between the question and the paragraph. We also present a way to construct training data for our question generation models by leveraging the existing reading comprehension dataset. Experimental results show that the pair-to-sequence model performs consistently better compared with the sequence-to-sequence baseline. We further use the automatically generated unanswerable questions as a means of data augmentation on the SQuAD 2.0 dataset, yielding 1.9 absolute F1 improvement with BERT-base model and 1.7 absolute F1 improvement with BERT-large model.

Graph Neural Networks with Generated Parameters for Relation Extraction link IE
Hao Zhu, Yankai Lin, Zhiyuan Liu, Jie Fu, Tat-Seng Chua, Maosong Sun

Recently, progress has been made towards improving relational reasoning in machine learning field. Among existing models, graph neural networks (GNNs) is one of the most effective approaches for multi-hop relational reasoning. In fact, multi-hop relational reasoning is indispensable in many natural language processing tasks such as relation extraction. In this paper, we propose to generate the parameters of graph neural networks (GP-GNNs) according to natural language sentences, which enables GNNs to process relational reasoning on unstructured text inputs. We verify GP-GNNs in relation extraction from text. Experimental results on a human-annotated dataset and two distantly supervised datasets show that our model achieves significant improvements compared to baselines. We also perform a qualitative analysis to demonstrate that our model could discover more accurate relations by multi-hop relational reasoning.

A Unified Multi-task Adversarial Learning Framework for Pharmacovigilance Mining
Shweta Yadav, Asif Ekbal, Sriparna Saha, Pushpak Bhattacharyya

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion link dialogue/conversation
Suyoun Kim, Siddharth Dalmia, Florian Metze

We present a novel conversational-context aware end-to-end speech recognizer based on a gated neural network that incorporates conversational-context/word/speech embeddings. Unlike conventional speech recognition models, our model learns longer conversational-context information that spans across sentences and is consequently better at recognizing long conversations. Specifically, we propose to use the text-based external word and/or sentence embeddings (i.e., fastText, BERT) within an end-to-end framework, yielding a significant improvement in word error rate with better conversational-context representation. We evaluated the models on the Switchboard conversational speech corpus and show that our model outperforms standard end-to-end speech recognition models.

A Lightweight Recurrent Network for Sequence Modeling link
Biao Zhang, Rico Sennrich

Recurrent networks have achieved great success on various sequential tasks with the assistance of complex recurrent units, but suffer from severe computational inefficiency due to weak parallelization. One direction to alleviate this issue is to shift heavy computations outside the recurrence. In this paper, we propose a lightweight recurrent network, or LRN. LRN uses input and forget gates to handle long-range dependencies as well as gradient vanishing and explosion, with all parameter related calculations factored outside the recurrence. The recurrence in LRN only manipulates the weight assigned to each token, tightly connecting LRN with self-attention networks. We apply LRN as a drop-in replacement of existing recurrent units in several neural sequential models. Extensive experiments on six NLP tasks show that LRN yields the best running efficiency with little or no loss in model performance.

STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework link MT
Mingbo Ma, Liang Huang, Hao Xiong, Renjie Zheng, Kaibo Liu, Baigong Zheng, Chuanqiang Zhang, Zhongjun He, Hairong Liu, Xing Li, Hua Wu, Haifeng Wang

Simultaneous translation, which translates sentences before they are finished, is useful in many scenarios but is notoriously difficult due to word-order differences. While the conventional seq-to-seq framework is only suitable for full-sentence translation, we propose a novel prefix-to-prefix framework for simultaneous translation that implicitly learns to anticipate in a single translation model. Within this framework, we present a very simple yet surprisingly effective wait-k policy trained to generate the target sentence concurrently with the source sentence, but always k words behind. Experiments show our strategy achieves low latency and reasonable quality (compared to full-sentence translation) on 4 directions: zh<->en and de<->en.

Generating Sentences from Disentangled Syntactic and Semantic Spaces
Yu Bao, Hao Zhou, Shujian Huang, Lei Li, Lili Mou, Olga Vechtomova, XIN-YU DAI, Jiajun CHEN

Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index (DenSPI) QA
Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, Hannaneh Hajishirzi

Multi-Channel Graph Neural Network for Entity Alignment
Yixin Cao, Zhiyuan Liu, Chengjiang Li, Zhiyuan Liu, Juanzi Li, Tat-Seng Chua

Short Papers (Main Conference)

Neural-based Chinese Idiom Recommendation for Enhancing Elegance in Essay Writing
yuanchao liu

Figurative Usage Detection of Symptom Words to Improve Personal Health Mention Detection link
Adith Iyer, Aditya Joshi, Sarvnaz Karimi, Ross Sparks, Cecile Paris

Personal health mention detection deals with predicting whether or not a given sentence is a report of a health condition. Past work mentions errors in this prediction when symptom words, i.e. names of symptoms of interest, are used in a figurative sense. Therefore, we combine a state-of-the-art figurative usage detection with CNN-based personal health mention detection. To do so, we present two methods: a pipeline-based approach and a feature augmentation-based approach. The introduction of figurative usage detection results in an average improvement of 2.21% F-score of personal health mention detection, in the case of the feature augmentation-based approach. This paper demonstrates the promise of using figurative usage detection to improve personal health mention detection.

Better Exploiting Latent Variables in Text Modeling
Canasai Kruengkrai

Crowdsourcing and Validating Event-focused Emotion Corpora for German and English link
Enrica Troiano, Sebastian Padó, Roman Klinger

Sentiment analysis has a range of corpora available across multiple languages. For emotion analysis, the situation is more limited, which hinders potential research on cross-lingual modeling and the development of predictive models for other languages. In this paper, we fill this gap for German by constructing deISEAR, a corpus designed in analogy to the well-established English ISEAR emotion dataset. Motivated by Scherer's appraisal theory, we implement a crowdsourcing experiment which consists of two steps. In step 1, participants create descriptions of emotional events for a given emotion. In step 2, five annotators assess the emotion expressed by the texts. We show that transferring an emotion classification model from the original English ISEAR to the German crowdsourced deISEAR via machine translation does not, on average, cause a performance drop.

Hierarchical Transfer Learning for Multi-label Text Classification transfer
Siddhartha Banerjee, Cem Akkaya, Francisco Perez-Sorrosal, Kostas Tsioutsiouliklis

Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization summarization
Manling Li, Lingyu Zhang, Heng Ji, Richard J. Radke

Adversarial Domain Adaptation Using Artificial Titles for Abstractive Title Generation NLG
Francine Chen, Yan-Ying Chen

SemBleu: A Robust Metric for AMR Parsing Evaluation parsing
Linfeng Song, Daniel Gildea

Look Harder: A Neural Machine Translation Model with Hard Attention MT
Sathish Reddy Indurthi, Insoo Chung, Sangha Kim

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation link
Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

Vision-and-Language Navigation (VLN) requires grounding instructions, such as "turn right and stop at the door", to routes in a visual environment. The actual grounding can connect language to the environment through multiple modalities, e.g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route. We investigate where the natural language empirically grounds under two recent state-of-the-art VLN models. Surprisingly, we discover that visual features may actually hurt these models: models which only use route structure, ablating visual features, outperform their visual counterparts in unseen new environments on the benchmark Room-to-Room dataset. To better use all the available modalities, we propose to decompose the grounding procedure into a set of expert models with access to different modalities (including object detections) and ensemble them at prediction time, improving the performance of state-of-the-art models on the VLN task.

Robust Neural Machine Translation with Joint Textual and Phonetic Embedding link MT
Hairong Liu, Mingbo Ma, Liang Huang, hao xiong, Zhongjun He

Neural machine translation (NMT) is notoriously sensitive to noises, but noises are almost inevitable in practice. One special kind of noise is the homophone noise, where words are replaced by other words with similar pronunciations. We propose to improve the robustness of NMT to homophone noises by 1) jointly embedding both textual and phonetic information of source sentences, and 2) augmenting the training dataset with homophone noises. Interestingly, to achieve better translation quality and more robustness, we found that most (though not all) weights should be put on the phonetic rather than textual information. Experiments show that our method not only significantly improves the robustness of NMT to homophone noises, but also surprisingly improves the translation quality on some clean test sets.

The Referential Reader: A Recurrent Entity Network for Anaphora Resolution link
Fei Liu, Luke Zettlemoyer, Jacob Eisenstein

We present a new architecture for storing and accessing entity mentions during online text processing. While reading the text, entity references are identified, and may be stored by either updating or overwriting a cell in a fixed-length memory. The update operation implies coreference with the other mentions that are stored in the same cell; the overwrite operations causes these mentions to be forgotten. By encoding the memory operations as differentiable gates, it is possible to train the model end-to-end, using both a supervised anaphora resolution objective as well as a supplementary language modeling objective. Evaluation on a dataset of pronoun-name anaphora demonstrates that the model achieves state-of-the-art performance with purely left-to-right processing of the text.

A Neural Multi-digraph Model for Chinese NER with Gazetteers NER
Ruixue Ding, Pengjun Xie, Xiaoyan Zhang, Wei Lu, Linlin Li, Luo Si

Neural Architectures for Nested NER through Linearization NER
Jana Straková, Milan Straka, Jan Hajic

Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets corpus
Nicole Peinelt, Maria Liakata, Dong Nguyen

Bias Analysis and Mitigation in the Evaluation of Authorship Verification
Janek Bevendorff, Matthias Hagen, Benno Stein, Martin Potthast

Self-Supervised Neural Machine Translation MT
Dana Ruiter, Cristina España-Bonet, Josef van Genabith

Who Sides With Whom? Towards Computational Construction of Discourse Networks for Political Debates
Sebastian Padó, Andre Blessing, Nico Blokker, Erenay Dayanik, Sebastian Haunss, Jonas Kuhn

Visual Story Post-Editing link
Ting-Yao Hsu, Chieh-Yang Huang, Yen-Chia Hsu, Ting-Hao Huang

Quantity Tagger: A Latent-Variable Sequence Labeling Approach to Solving Addition-Subtraction Word Problems
Yanyan Zou, Wei Lu

Numeracy-600K: Learning Numeracy for Detecting Exaggerated Information in Market Comments
Chung-Chi Chen, Hen-Hsen Huang, Hiroya Takamura, Hsin-Hsi Chen

Adaptive Attention Span in Transformers link
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, Armand Joulin

We propose a novel self-attention mechanism that can learn its optimal attention span. This allows us to extend significantly the maximum context size used in Transformer, while maintaining control over their memory footprint and computational time. We show the effectiveness of our approach on the task of character level language modeling, where we achieve state-of-the-art performances on text8 and enwiki8 by using a maximum context of 8k characters.

Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers
Haoyu Wang, Ming Tan, Mo Yu, Shiyu Chang, Dakuo Wang, Kun Xu, Xiaoxiao Guo, Saloni Potdar

Generalized Tuning of Distributional Word Vectors for Monolingual and Cross-Lingual Lexical Entailment
Goran Glavaš, Ivan Vulić

Inter-sentence Relation Extraction with Document-level Graph Convolutional Neural Network IE
Sunil Kumar Sahu, Fenia Christopoulou, Makoto Miwa, Sophia Ananiadou

NNE: A Dataset for Nested Named Entity Recognition in English Newswire link NER
Nicky Ringland, Xiang Dai, Sarvnaz Karimi, Ben Hachey, Cecile Paris, James R. Curran

Named entity recognition (NER) is widely used in natural language processing applications and downstream tasks. However, most NER tools target flat annotation from popular datasets, eschewing the semantic information available in nested entity mentions. We describe NNE---a fine-grained, nested named entity dataset over the full Wall Street Journal portion of the Penn Treebank (PTB). Our annotation comprises 279,795 mentions of 114 entity types with up to 6 layers of nesting. We hope the public release of this large dataset for English newswire will encourage development of new techniques for nested NER.

Misleading Failures of Partial-input Baselines link
Shi Feng, Eric Wallace, Jordan Boyd-Graber

Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only models for SNLI or question-only models for VQA). When a partial-input baseline gets high accuracy, a dataset is cheatable. However, the converse is not necessarily true: the failure of a partial-input baseline does not mean a dataset is free of artifacts. To illustrate this, we first design artificial datasets which contain trivial patterns in the full input that are undetectable by any partial-input model. Next, we identify such artifacts in the SNLI dataset - a hypothesis-only model augmented with trivial patterns in the premise can solve 15% of the examples that are previously considered "hard". Our work provides a caveat for the use of partial-input baselines for dataset verification and creation.

Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations link RepL
Rui Zhang, Caitlin Westerfield, Sungrok Shim, Garrett Bingham, Alexander Fabbri, Neha Verma, William Hu, Dragomir Radev

In this paper, we propose to boost low-resource cross-lingual document retrieval performance with deep bilingual query-document representations. We match queries and documents in both source and target languages with four components, each of which is implemented as a term interaction-based deep neural network with cross-lingual word embeddings as input. By including query likelihood scores as extra features, our model effectively learns to rerank the retrieved documents by using a small number of relevance labels for low-resource language pairs. Due to the shared cross-lingual word embedding space, the model can also be directly applied to another language pair without any training label. Experimental results on the MATERIAL dataset show that our model outperforms the competitive translation-based baselines on English-Swahili, English-Tagalog, and English-Somali cross-lingual information retrieval tasks.

Online Infix Probability Computation for Probabilistic Finite Automata
Marco Cognetta, Yo-Sub Han, Soon Chan Kwon

Exact Hard Monotonic Attention for Character-Level Transduction link
Shijie Wu, Ryan Cotterell

Many common character-level, string-to-string transduction tasks, e.g., grapheme-to-phoneme conversion and morphological inflection, consist almost exclusively of monotonic transduction. Neural sequence-to-sequence models with soft attention, non-monotonic models, outperform popular monotonic models. In this work, we ask the following question: Is monotonicity really a helpful inductive bias in these tasks? We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns alignment jointly. With the help of dynamic programming, we are able to compute the exact marginalization over all alignments. Our models achieve state-of-the-art performance on morphological inflection. Furthermore, we find strong performance on two other character-level transduction tasks.

Soft Contextual Data Augmentation for Neural Machine Translation link MT
Jinhua Zhu, Fei Gao, Lijun Wu, Yingce Xia, Tao QIN, Wengang Zhou, Xueqi Cheng, Tie-Yan Liu

While data augmentation is an important trick to boost the accuracy of deep learning methods in computer vision tasks, its study in natural language tasks is still very limited. In this paper, we present a novel data augmentation method for neural machine translation. Different from previous augmentation methods that randomly drop, swap or replace words with other words in a sentence, we softly augment a randomly chosen word in a sentence by its contextual mixture of multiple related words. More accurately, we replace the one-hot representation of a word by a distribution (provided by a language model) over the vocabulary, i.e., replacing the embedding of this word by a weighted combination of multiple semantically similar words. Since the weights of those words depend on the contextual information of the word to be replaced, the newly generated sentences capture much richer information than previous augmentation methods. Experimental results on both small scale and large scale machine translation datasets demonstrate the superiority of our method over strong baselines.

How to best use Syntax in Semantic Role Labelling link
Yufei Wang, Mark Johnson, Stephen Wan, Yifang Sun, Wei Wang

Leveraging Meta Information in Short Text Aggregation
He Zhao, Lan Du, Guanfeng Liu, Wray Buntine

Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference link dialogue/conversation
He Bai, Yu Zhou, Jiajun Zhang, Chengqing Zong

Dialogue contexts are proven helpful in the spoken language understanding (SLU) system and they are typically encoded with explicit memory representations. However, most of the previous models learn the context memory with only one objective to maximizing the SLU performance, leaving the context memory under-exploited. In this paper, we propose a new dialogue logistic inference (DLI) task to consolidate the context memory jointly with SLU in the multi-task framework. DLI is defined as sorting a shuffled dialogue session into its original logical order and shares the same memory encoder and retrieval mechanism as the SLU model. Our experimental results show that various popular contextual SLU models can benefit from our approach, and improvements are quite impressive, especially in slot filling.

Cross-lingual Knowledge Graph Alignment via Graph Matching Neural Network link
Kun Xu, liwei wang, Mo Yu, Yansong Feng, Yan Song, Zhiguo Wang, Dong Yu

Previous cross-lingual knowledge graph (KG) alignment studies rely on entity embeddings derived only from monolingual KG structural information, which may fail at matching entities that have different facts in two KGs. In this paper, we introduce the topic entity graph, a local sub-graph of an entity, to represent entities with their contextual information in KG. From this view, the KB-alignment task can be formulated as a graph matching problem; and we further propose a graph-attention based solution, which first matches all entities in two topic entity graphs, and then jointly model the local matching information to derive a graph-level matching vector. Experiments show that our model outperforms previous state-of-the-art methods by a large margin.

A Simple and Effective Approach to Automatic Post-Editing with Transfer Learning transfer
Gonçalo M. Correia, André F. T. Martins

The Risk of Racial Bias in Hate Speech Detection bias
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, Noah A. Smith

Pay attention when you pay the bills. A multilingual corpus with dependency-based and semantic annotation of collocations. corpus
Marcos Garcia, Marcos García Salido, Susana Sotelo Docío, Estela Mosqueira, Margarita Alonso-Ramos

CNNs found to jump around more skillfully than RNNs: Compositional generalization in seq2seq convolutional networks link
Roberto Dessì, Marco Baroni

Lake and Baroni (2018) introduced the SCAN dataset probing the ability of seq2seq models to capture compositional generalizations, such as inferring the meaning of "jump around" 0-shot from the component words. Recurrent networks (RNNs) were found to completely fail the most challenging generalization cases. We test here a convolutional network (CNN) on these tasks, reporting hugely improved performance with respect to RNNs. Despite the big improvement, the CNN has however not induced systematic rules, suggesting that the difference between compositional and non-compositional behaviour is not clear-cut.

Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment
Nanjiang Jiang, Marie-Catherine de Marneffe

An Imitation Learning Approach to Unsupervised Parsing link parsing
Bowen Li, Lili Mou, Frank Keller

Recently, there has been an increasing interest in unsupervised parsers that optimize semantically oriented objectives, typically using reinforcement learning. Unfortunately, the learned trees often do not match actual syntax trees well. Shen et al. (2018) propose a structured attention mechanism for language modeling (PRPN), which induces better syntactic structures but relies on ad hoc heuristics. Also, their model lacks interpretability as it is not grounded in parsing actions. In our work, we propose an imitation learning approach to unsupervised parsing, where we transfer the syntactic knowledge induced by the PRPN to a Tree-LSTM model with discrete parsing actions. Its policy is then refined by Gumbel-Softmax training towards a semantically oriented objective. We evaluate our approach on the All Natural Language Inference dataset and show that it achieves a new state of the art in terms of parsing $F$-score, outperforming our base models, including the PRPN.

Domain Adaptive Inference for Neural Machine Translation link MT
Danielle Saunders, Felix Stahlberg, Adrià de Gispert, Bill Byrne

We investigate adaptive ensemble weighting for Neural Machine Translation, addressing the case of improving performance on a new and potentially unknown domain without sacrificing performance on the original domain. We adapt sequentially across two Spanish-English and three English-German tasks, comparing unregularized fine-tuning, L2 and Elastic Weight Consolidation. We then report a novel scheme for adaptive NMT ensemble decoding by extending Bayesian Interpolation with source information, and show strong improvements across test domains without access to the domain label.

Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization link
Mozhi Zhang, Keyulu Xu, Ken-ichi Kawarabayashi, Stefanie Jegelka, Jordan Boyd-Graber

Cross-lingual word embeddings (CLWE) underlie many multilingual natural language processing systems, often through orthogonal transformations of pre-trained monolingual embeddings. However, orthogonal mapping only works on language pairs whose embeddings are naturally isomorphic. For non-isomorphic pairs, our method (Iterative Normalization) transforms monolingual embeddings to make orthogonal alignment easier by simultaneously enforcing that (1) individual word vectors are unit length, and (2) each language's average vector is zero. Iterative Normalization consistently improves word translation accuracy of three CLWE methods, with the largest improvement observed on English-Japanese (from 2% to 44% test accuracy).

BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization link summarization
Eva Sharma, Chen Li, Lu Wang

Most existing text summarization datasets are compiled from the news domain, where summaries have a flattened discourse structure. In such datasets, summary-worthy content often appears in the beginning of input articles. Moreover, large segments from input articles are present verbatim in their respective summaries. These issues impede the learning and evaluation of systems that can understand an article's global content structure as well as produce abstractive summaries with high compression ratio. In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii) lesser and shorter extractive fragments are present in the summaries. Finally, we train and evaluate baselines and popular learning models on BIGPATENT to shed light on new challenges and motivate future directions for summarization research.

A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification
Pengcheng Yang, Fuli Luo, Shuming Ma, Junyang Lin, Xu SUN

Cross-Modal Commentator: Automatic Machine Commenting Based on Cross-Modal Information
Pengcheng Yang, Zhihan Zhang, Fuli Luo, Lei Li, Chengyang Huang, Xu SUN

Learning to Control the Fine-grained Sentiment for Story Ending Generation NLG
Fuli Luo, Damai Dai, Pengcheng Yang, Tianyu Liu, Baobao Chang, Zhifang Sui, Xu SUN

Reranking for Neural Semantic Parsing parsing
Pengcheng Yin, Graham Neubig

An Empirical Investigation of Structured Output Modeling for Graph-based Neural Dependency Parsing parsing
Zhisong Zhang, Xuezhe Ma, Eduard Hovy

PTB Graph Parsing with Tree Approximation parsing
Yoshihide Kato, Shigeki Matsubara

Do Neural Dialog Systems Use the Conversation History Effectively? An Empirical Study link dialogue/conversation
Chinnadhurai Sankar, Sandeep Subramanian, Chris Pal, Sarath Chandar, Yoshua Bengio

Neural generative models have been become increasingly popular when building conversational agents. They offer flexibility, can be easily adapted to new domains, and require minimal domain engineering. A common criticism of these systems is that they seldom understand or use the available dialog history effectively. In this paper, we take an empirical approach to understanding how these models use the available dialog history by studying the sensitivity of the models to artificially introduced unnatural changes or perturbations to their context at test time. We experiment with 10 different types of perturbations on 4 multi-turn dialog datasets and find that commonly used neural dialog architectures like recurrent and transformer-based seq2seq models are rarely sensitive to most perturbations such as missing or reordering utterances, shuffling words, etc. Also, by open-sourcing our code, we believe that it will serve as a useful diagnostic tool for evaluating dialog systems in the future.

Cost-sensitive Regularization for Label Confusion-aware Event Detection
Hongyu Lin, Yaojie Lu, Xianpei Han, Le Sun

Using Human Attention to Extract Keyphrase from Microblog Post
Yingyi Zhang, Chengzhi Zhang

Does It Make Sense And Why A Pilot Study for Sense Making and Explanation
Cunxiang Wang, Shuailong Liang, Yue Zhang, Xiaonan Li, Tian Gao

Delta Embedding Learning link
Xiao Zhang, Ji Wu, Dejing Dou

Unsupervised word embeddings have become a popular approach of word representation in NLP tasks. However there are limitations to the semantics represented by unsupervised embeddings, and inadequate fine-tuning of embeddings can lead to suboptimal performance. We propose a novel learning technique called Delta Embedding Learning, which can be applied to general NLP tasks to improve performance by optimized tuning of the word embeddings. A structured regularization is applied to the embeddings to ensure they are tuned in an incremental way. As a result, the tuned word embeddings become better word representations by absorbing semantic information from supervision without "forgetting." We apply the method to various NLP tasks and see a consistent improvement in performance. Evaluation also confirms the tuned word embeddings have better semantic properties.

Personalizing Dialogue Agents via Meta-Learning link dialogue/conversation
Andrea Madotto, Zhaojiang Lin, Chien-Sheng Wu, Pascale Fung

Fine-Grained Spoiler Detection from Large-Scale Review Corpora link
Mengting Wan, Rishabh Misra, Ndapa Nakashole, Julian McAuley

This paper presents computational approaches for automatically detecting critical plot twists in reviews of media products. First, we created a large-scale book review dataset that includes fine-grained spoiler annotations at the sentence-level, as well as book and (anonymized) user information. Second, we carefully analyzed this dataset, and found that: spoiler language tends to be book-specific; spoiler distributions vary greatly across books and review authors; and spoiler sentences tend to jointly appear in the latter part of reviews. Third, inspired by these findings, we developed an end-to-end neural network architecture to detect spoiler sentences in review corpora. Quantitative and qualitative results demonstrate that the proposed method substantially outperforms existing baselines.

An Investigation of Transfer Learning-Based Sentiment Analysis in Japanese link transfer
Enkhbold Bataa, Joshua Wu

Text classification approaches have usually required task-specific model architectures and huge labeled datasets. Recently, thanks to the rise of text-based transfer learning techniques, it is possible to pre-train a language model in an unsupervised manner and leverage them to perform effective on downstream tasks. In this work we focus on Japanese and show the potential use of transfer learning techniques in text classification. Specifically, we perform binary and multi-class sentiment classification on the Rakuten product review and Yahoo movie review datasets. We show that transfer learning-based approaches perform better than task-specific models trained on 3 times as much data. Furthermore, these approaches perform just as well for language modeling pre-trained on only 1/30 of the data. We release our pre-trained models and code as open source.

MAAM: A Morphology-Aware Alignment Model for Unsupervised Bilingual Lexicon Induction
Pengcheng Yang, Fuli Luo, Peng Chen, Tianyu Liu, Xu SUN

Better Character Language Modeling Through Morphology link
Terra Blevins, Luke Zettlemoyer

We incorporate morphological supervision into character language models (CLMs) via multitasking and show that this addition improves bits-per-character (BPC) performance across 24 languages, even when the morphology data and language modeling data are disjoint. Analyzing the CLMs shows that inflected words benefit more from explicitly modeling morphology than uninflected words, and that morphological supervision improves performance even as the amount of language modeling data grows. We then transfer morphological supervision across languages to improve language modeling performance in the low-resource setting.

Reading Turn by Turn: Hierarchical Attention Architecture for Spoken Dialogue Comprehension dialogue/conversation
Zhengyuan Liu, Nancy Chen

Data Programming for Learning Discourse Structure
Sonia Badene, Kate Thompson, Jean-Pierre Lorré, Nicholas Asher

Complex Word Identification as a Sequence Labelling Task
Sian Gooding, Ekaterina Kochmar

Annotation and automatic classification of aspectual categories
Markus Egg, Helena Prepens, Will Roberts

Probing Neural Network Comprehension of Natural Language Arguments
Timothy Niven, Hung-Yu Kao

Uncovering Probabilistic Implications in Typological Knowledge Bases link
Johannes Bjerva, Yova Kementchedjhieva, Ryan Cotterell, Isabelle Augenstein

The study of linguistic typology is rooted in the implications we find between linguistic features, such as the fact that languages with object-verb word ordering tend to have post-positions. Uncovering such implications typically amounts to time-consuming manual processing by trained and experienced linguists, which potentially leaves key linguistic universals unexplored. In this paper, we present a computational model which successfully identifies known universals, including Greenberg universals, but also uncovers new ones, worthy of further linguistic investigation. Our approach outperforms baselines previously used for this problem, as well as a strong baseline from knowledge base population.

Recognising Agreement and Disagreement between Stances with Reason Comparing Networks link
Chang Xu, Cecile Paris, Surya Nepal, Ross Sparks

We identify agreement and disagreement between utterances that express stances towards a topic of discussion. Existing methods focus mainly on conversational settings, where dialogic features are used for (dis)agreement inference. We extend this scope and seek to detect stance (dis)agreement in a broader setting, where independent stance-bearing utterances, which prevail in many stance corpora and real-world scenarios, are compared. To cope with such non-dialogic utterances, we find that the reasons uttered to back up a specific stance can help predict stance (dis)agreements. We propose a reason comparing network (RCN) to leverage reason information for stance comparison. Empirical results on a well-known stance corpus show that our method can discover useful reason information, enabling it to outperform several baselines in stance (dis)agreement detection.

Sequence Labeling Parsing by Learning Across Representations link parsing
Michalina Strzyz, David Vilares, Carlos Gómez-Rodríguez

We use parsing as sequence labeling as a common framework to learn across constituency and dependency syntactic abstractions. To do so, we cast the problem as multitask learning (MTL). First, we show that adding a parsing paradigm as an auxiliary loss consistently improves the performance on the other paradigm. Secondly, we explore an MTL sequence labeling model that parses both representations, at almost no cost in terms of performance and speed. The results across the board show that on average MTL models with auxiliary losses for constituency parsing outperform single-task ones by 1.05 F1 points, and for dependency parsing by 0.62 UAS points.

Confusionset-guided Pointer Networks for Chinese Spelling Check
Dingmin Wang, Yi Tay, Li Zhong

Historical Text Normalization with Delayed Rewards
Simon Flachs, Marcel Bollmann, Anders Søgaard

Multi-grained Attention with Object-level Grounding for Visual Question Answering QA
Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, Yong Zhu

Modeling Intra-Relation in Math Word Problems with Different Functional Multi-Head Attentions
Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang, Bing Tian Dai, Dongxiang Zhang

Representing Schema Structure with Graph Neural Networks for Text-to-SQL Parsing link parsing
Ben Bogin, Jonathan Berant, Matt Gardner

Research on parsing language to SQL has largely ignored the structure of the database (DB) schema, either because the DB was very simple, or because it was observed at both training and test time. In Spider, a recently-released text-to-SQL dataset, new and complex DBs are given at test time, and so the structure of the DB schema can inform the predicted SQL query. In this paper, we present an encoder-decoder semantic parser, where the structure of the DB schema is encoded with a graph neural network, and this representation is later used at both encoding and decoding time. Evaluation shows that encoding the schema structure improves our parser accuracy from 33.8% to 39.4%, dramatically above the current state of the art, which is at 19.7%.

Analyzing the limitations of cross-lingual word embedding mappings RepL
Aitor Ormazabal, Mikel Artetxe, Gorka Labaka, Aitor Soroa, Eneko Agirre

Large Dataset and Language Model Fun-Tuning for Humor Recognition corpus
Vladislav Blinov, Valeria Bolotova-Baranova, Pavel Braslavski

A Working Memory Model for Task-oriented Dialog Response Generation dialogue/conversation NLG
Xiuyi Chen, Jiaming Xu, Bo Xu

Towards Automating Healthcare Question Answering in a Noisy Multilingual Low-Resource Setting QA
Jeanne E. Daniel, Willie Brink, Ryan Eloff, Charles Copley

HEAD-QA: A Healthcare Dataset for Complex Reasoning link corpus
David Vilares, Carlos Gómez-Rodríguez

We present HEAD-QA, a multi-choice question answering testbed to encourage research on complex reasoning. The questions come from exams to access a specialized position in the Spanish healthcare system, and are challenging even for highly specialized humans. We then consider monolingual (Spanish) and cross-lingual (to English) experiments with information retrieval and neural techniques. We show that: (i) HEAD-QA challenges current methods, and (ii) the results lag well behind human performance, demonstrating its usefulness as a benchmark for future work.

A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling link
Haihong E, Peiqing Niu, Zhongfu Chen, Meina Song

A spoken language understanding (SLU) system includes two main tasks, slot filling (SF) and intent detection (ID). The joint model for the two tasks is becoming a tendency in SLU. But the bi-directional interrelated connections between the intent and slots are not established in the existing joint models. In this paper, we propose a novel bi-directional interrelated model for joint intent detection and slot filling. We introduce an SF-ID network to establish direct connections for the two tasks to help them promote each other mutually. Besides, we design an entirely new iteration mechanism inside the SF-ID network to enhance the bi-directional interrelated connections. The experimental results show that the relative improvement in the sentence-level semantic frame accuracy of our model is 3.79% and 5.42% on ATIS and Snips datasets, respectively, compared to the state-of-the-art model.

Improving Open Information Extraction via Iterative Rank-Aware Learning link IE
Zhengbao Jiang, Pengcheng Yin, Graham Neubig

Open information extraction (IE) is the task of extracting open-domain assertions from natural language sentences. A key step in open IE is confidence modeling, ranking the extractions based on their estimated quality to adjust precision and recall of extracted assertions. We found that the extraction likelihood, a confidence measure used by current supervised open IE systems, is not well calibrated when comparing the quality of assertions extracted from different sentences. We propose an additional binary classification loss to calibrate the likelihood to make it more globally comparable, and an iterative learning process, where extractions generated by the open IE model are incrementally included as training samples to help the model learn from trial and error. Experiments on OIE2016 demonstrate the effectiveness of our method. Code and data are available at https://github.com/jzbjyb/oie_rank.

Joint Entity Extraction and Assertion Detection for Clinical Text link IE
Parminder Bhatia, Busra Celikkaya, Mohammed Khalilia

Negative medical findings are prevalent in clinical reports, yet discriminating them from positive findings remains a challenging task for information extraction. Most of the existing systems treat this task as a pipeline of two separate tasks, i.e., named entity recognition (NER) and rule-based negation detection. We consider this as a multi-task problem and present a novel end-to-end neural model to jointly extract entities and negations. We extend a standard hierarchical encoder-decoder NER model and first adopt a shared encoder followed by separate decoders for the two tasks. This architecture performs considerably better than the previous rule-based and machine learning-based systems. To overcome the problem of increased parameter size especially for low-resource settings, we propose the \textit{Conditional Softmax Shared Decoder} architecture which achieves state-of-art results for NER and negation detection on the 2010 i2b2/VA challenge dataset and a proprietary de-identified clinical dataset.

Neural News Recommendation with Topic-Aware News Representation RepL
Chuhan Wu, Fangzhao Wu, Mingxiao An, Yongfeng Huang, Xing Xie

Revisiting Low-Resource Neural Machine Translation: A Case Study link MT
Rico Sennrich, Biao Zhang

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions, underperforming phrase-based statistical machine translation (PBSMT) and requiring large amounts of auxiliary data to achieve competitive results. In this paper, we re-assess the validity of these results, arguing that they are the result of lack of system adaptation to low-resource settings. We discuss some pitfalls to be aware of when training low-resource NMT systems, and recent techniques that have shown to be especially helpful in low-resource settings, resulting in a set of best practices for low-resource NMT. In our experiments on German--English with different amounts of IWSLT14 training data, we show that, without the use of any auxiliary monolingual or multilingual data, an optimized NMT system can outperform PBSMT with far less data than previously claimed. We also apply these techniques to a low-resource Korean-English dataset, surpassing previously reported results by 4 BLEU.

Putting words in context: LSTM language models and lexical ambiguity link
Laura Aina, Kristina Gulordava, Gemma Boleda

In neural network models of language, words are commonly represented using context-invariant representations (word embeddings) which are then put in context in the hidden layers. Since words are often ambiguous, representing the contextually relevant information is not trivial. We investigate how an LSTM language model deals with lexical ambiguity in English, designing a method to probe its hidden representations for lexical and contextual information about words. We find that both types of information are represented to a large extent, but also that there is room for improvement for contextual information.

Interpolated Spectral NGram Language Models
Ariadna Quattoni, Xavier Carreras

Target Conditioned Sampling: Optimizing Data Selection for Multilingual Neural Machine Translation link MT
Xinyi Wang, Graham Neubig

To improve low-resource Neural Machine Translation (NMT) with multilingual corpora, training on the most related high-resource language only is often more effective than using all data available (Neubig and Hu, 2018). However, it is possible that an intelligent data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose an efficient algorithm, Target Conditioned Sampling (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show that TCS brings significant gains of up to 2 BLEU on three of four languages we test, with minimal training overhead.

Evaluating Discourse in Structured Text Representations link RepL
Elisa Ferracane, Greg Durrett, Junyi Jessy Li, Katrin Erk

Discourse structure is integral to understanding a text and is helpful in many NLP tasks. Learning latent representations of discourse is an attractive alternative to acquiring expensive labeled discourse data. Liu and Lapata (2018) propose a structured attention mechanism for text classification that derives a tree over a text, akin to an RST discourse tree. We examine this model in detail, and evaluate on additional discourse-relevant tasks and datasets, in order to assess whether the structured attention improves performance on the end task and whether it captures a text's discourse structure. We find the learned latent trees have little to no structure and instead focus on lexical cues; even after obtaining more structured trees with proposed model modifications, the trees are still far from capturing discourse structure when compared to discourse dependency trees from an existing discourse parser. Finally, ablation studies show the structured attention provides little benefit, sometimes even hurting performance.

Know What You Don't Know: Modeling a Pragmatic Speaker that Refers to Objects of Unknown Categories link
Sina Zarrieß, David Schlangen

Zero-shot learning in Language & Vision is the task of correctly labelling (or naming) objects of novel categories. Another strand of work in L&V aims at pragmatically informative rather than ``correct'' object descriptions, e.g. in reference games. We combine these lines of research and model zero-shot reference games, where a speaker needs to successfully refer to a novel object in an image. Inspired by models of "rational speech acts", we extend a neural generator to become a pragmatic speaker reasoning about uncertain object categories. As a result of this reasoning, the generator produces fewer nouns and names of distractor categories as compared to a literal speaker. We show that this conversational strategy for dealing with novel objects often improves communicative success, in terms of resolution accuracy of an automatic listener.

Corpus-based Check-up for Thesaurus corpus
Natalia Loukachevitch

Poetry to Prose Conversion in Sanskrit as a Linearisation Task: A case for Low-Resource Languages
Amrith Krishna, Vishnu Sharma, Bishal Santra, Aishik Chakraborty, Pavankumar Satuluri, Pawan Goyal

Self-Attention Architectures for Answer-Agnostic Neural Question Generation NLG
Thomas Scialom, Benjamin Piwowarski, Jacopo Staiano

Toward Comprehensive Understanding of a Sentiment Based on Human Motives
Naoki Otani, Eduard Hovy

Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark link
Nikita Nangia, Samuel R. Bowman

The GLUE benchmark (Wang et al., 2019b) is a suite of language understanding tasks which has seen dramatic progress in the past year, with average performance moving from 70.0 at launch to 83.9, state of the art at the time of writing (May 24, 2019). Here, we measure human performance on the benchmark, in order to learn whether significant headroom remains for further progress. We provide a conservative estimate of human performance on the benchmark through crowdsourcing: Our annotators are non-experts who must learn each task from a brief set of instructions and 20 examples. In spite of limited training, these annotators robustly outperform the state of the art on six of the nine GLUE tasks and achieve an average score of 87.1. Given the fast pace of progress however, the headroom we observe is quite limited. To reproduce the data-poor setting that our annotators must learn in, we also train the BERT model (Devlin et al., 2019) in limited-data regimes, and conclude that low-resource sentence classification remains a challenge for modern neural network approaches to text understanding.

Context-aware Embedding for Targeted Aspect-based Sentiment Analysis
Bin Liang, Jiachen Du, Ruifeng Xu, Binyang Li, Hejiao Huang

Reducing Word Omission Errors in Neural Machine Translation: A Contrastive Learning Approach MT
Zonghan Yang, Yong Cheng, Yang Liu, Maosong Sun

Towards Improving Neural Named Entity Recognition with Gazetteers NER
Tianyu Liu, Jin-Ge Yao, Chin-Yew Lin

Large-Scale Multi-Label Text Classification on EU Legislation link
Ilias Chalkidis, Emmanouil Fergadiotis, Prodromos Malakasiotis, Ion Androutsopoulos

We consider Large-Scale Multi-Label Text Classification (LMTC) in the legal domain. We release a new dataset of 57k legislative documents from EURLEX, annotated with ~4.3k EUROVOC labels, which is suitable for LMTC, few- and zero-shot learning. Experimenting with several neural classifiers, we show that BIGRUs with label-wise attention perform better than other current state of the art methods. Domain-specific WORD2VEC and context-sensitive ELMO embeddings further improve performance. We also find that considering only particular zones of the documents is sufficient. This allows us to bypass BERT's maximum text length limit and fine-tune BERT, obtaining the best results in all but zero-shot learning cases.

Celebrity Profiling
Matti Wiegmann, Benno Stein, Martin Potthast

Translating Translationese: A Two-Step Approach to Unsupervised Machine Translation link MT
Nima Pourdamghani, Nada Aldarrab, Marjan Ghazvininejad, Kevin Knight, Jonathan May

Compositional Semantic Parsing Across Graphbanks link parsing
Matthias Lindemann, Jonas Groschwitz, Alexander Koller

Most semantic parsers that map sentences to graph-based meaning representations are hand-designed for specific graphbanks. We present a compositional neural semantic parser which achieves, for the first time, competitive accuracies across a diverse range of graphbanks. Incorporating BERT embeddings and multi-task learning improves the accuracy further, setting new states of the art on DM, PAS, PSD, AMR 2015 and EDS.

A2N: Attending to Neighbors for Knowledge Graph Inference
Trapit Bansal, Da-Cheng Juan, Sujith Ravi, Andrew McCallum

You Only Need Attention to Traverse Trees
Mahtab Ahmed, Muhammad Rifayat Samee, Robert E. Mercer

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range summarization
Maxime Peyrard

BAM! Born-Again Multi-Task Networks for Natural Language Understanding link
Kevin Clark, Minh-Thang Luong, Urvashi Khandelwal, Christopher D. Manning, Quoc V. Le

Making Fast Graph-based Algorithms with Graph Metric Embeddings link
Andrey Kutuzov, Mohammad Dorgham, Oleksiy Oliynyk, Chris Biemann, Alexander Panchenko

The computation of distance measures between nodes in graphs is inefficient and does not scale to large graphs. We explore dense vector representations as an effective way to approximate the same information: we introduce a simple yet efficient and effective approach for learning graph embeddings. Instead of directly operating on the graph structure, our method takes structural measures of pairwise node similarities into account and learns dense node representations reflecting user-defined graph distance measures, such as e.g.the shortest path distance or distance measures that take information beyond the graph structure into account. We demonstrate a speed-up of several orders of magnitude when predicting word similarity by vector operations on our embeddings as opposed to directly computing the respective path-based measures, while outperforming various other graph embeddings on semantic similarity and word sense disambiguation tasks and show evaluations on the WordNet graph and two knowledge base graphs.

Is word segmentation child's play in all languages?
Georgia R. Loukatou, Steven Moran, Damian Blasi, Sabine Stoll, Alejandrina Cristia

Every child should have parents: a taxonomy refinement algorithm based on hyperbolic term embeddings link
Rami Aly, Shantanu Acharya, Alexander Ossa, Arne Köhn, Chris Biemann, Alexander Panchenko

We introduce the use of Poincar\'e embeddings to improve existing state-of-the-art approaches to domain-specific taxonomy induction from text as a signal for both relocating wrong hyponym terms within a (pre-induced) taxonomy as well as for attaching disconnected terms in a taxonomy. This method substantially improves previous state-of-the-art results on the SemEval-2016 Task 13 on taxonomy extraction. We demonstrate the superiority of Poincar\'e embeddings over distributional semantic representations, supporting the hypothesis that they can better capture hierarchical lexical-semantic relationships than embeddings in the Euclidean space.

What does BERT learn about the structure of language?
Ganesh Jawahar, Benoît Sagot, Djamé Seddah

Neural Legal Judgment Prediction in English link
Ilias Chalkidis, Ion Androutsopoulos, Nikolaos Aletras

Legal judgment prediction is the task of automatically predicting the outcome of a court case, given a text describing the case's facts. Previous work on using neural models for this task has focused on Chinese; only feature-based models (e.g., using bags of words and topics) have been considered in English. We release a new English legal judgment prediction dataset, containing cases from the European Court of Human Rights. We evaluate a broad variety of neural models on the new dataset, establishing strong baselines that surpass previous feature-based models in three tasks: (1) binary violation classification; (2) multi-label classification; (3) case importance prediction. We also explore if models are biased towards demographic information via data anonymization. As a side-product, we propose a hierarchical version of BERT, which bypasses BERT's length limitation.

Dual Supervised Learning for Natural Language Understanding and Generation link NLG
Shang-Yu Su, Chao-Wei Huang, Yun-Nung Chen

Natural language understanding (NLU) and natural language generation (NLG) are both critical research topics in the NLP field. Natural language understanding is to extract the core semantic meaning from the given utterances, while natural language generation is opposite, of which the goal is to construct corresponding sentences based on the given semantics. However, such dual relationship has not been investigated in the literature. This paper proposes a new learning framework for language understanding and generation on top of dual supervised learning, providing a way to exploit the duality. The preliminary experiments show that the proposed approach boosts the performance for both tasks.

Yes, we can! Mining Arguments in 50 Years of US Presidential Campaign Debates
Shohreh Haddadan, Elena Cabrio, Serena Villata

Unsupervised Paraphrasing without Translation link MT
Aurko Roy, David Grangier

Paraphrasing exemplifies the ability to abstract semantic content from surface forms. Recent work on automatic paraphrasing is dominated by methods leveraging Machine Translation (MT) as an intermediate step. This contrasts with humans, who can paraphrase without being bilingual. This work proposes to learn paraphrasing models from an unlabeled monolingual corpus only. To that end, we propose a residual variant of vector-quantized variational auto-encoder. We compare with MT-based approaches on paraphrase identification, generation, and training augmentation. Monolingual paraphrasing outperforms unsupervised translation in all settings. Comparisons with supervised translation are more mixed: monolingual paraphrasing is interesting for identification and augmentation; supervised translation is superior for generation.

Reversing Gradients in Adversarial Domain Adaptation for Question Deduplication and Textual Entailment Tasks
Anush Kamath, Sparsh Gupta, Vitor Carvalho

Graph based Neural Networks for Event Factuality Prediction using Syntactic and Semantic Structures
Amir Pouran Ben Veyseh, Thien Huu Nguyen, Dejing Dou

Training Hybrid Language Models by Marginalizing over Segmentations
Edouard Grave, Sainbayar Sukhbaatar, Piotr Bojanowski, Armand Joulin

Psycholinguistics meets Continual Learning: Measuring Catastrophic Forgetting in Visual Question Answering link QA
Claudio Greco, Barbara Plank, Raquel Fernández, Raffaella Bernardi

We study the issue of catastrophic forgetting in the context of neural multimodal approaches to Visual Question Answering (VQA). Motivated by evidence from psycholinguistics, we devise a set of linguistically-informed VQA tasks, which differ by the types of questions involved (Wh-questions and polar questions). We test what impact task difficulty has on continual learning, and whether the order in which a child acquires question types facilitates computational models. Our results show that dramatic forgetting is at play and that task difficulty and order matter. Two well-known current continual learning methods mitigate the problem only to a limiting degree.

On the distribution of deep clausal embeddings: a large cross-linguistic study
Damian Blasi, Ryan Cotterell, Lawrence Wolf-Sonkin, Sabine Stoll, Balthasar Bickel, Marco Baroni

Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings link corpus
Mikel Artetxe, Holger Schwenk

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. Our approach uses an encoder-decoder trained over an initial parallel corpus to build multilingual sentence representations, which are then incorporated into a new margin-based method to score, mine and filter parallel sentences. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC shared task on parallel corpus mining by more than 10 F1 points. We also improve the precision from 48.9 to 83.3 on the reconstruction of 11.3M English-French sentence pairs of the UN corpus. Finally, filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages corpus
Željko Agić, Ivan Vulić

Embedding Imputation with Grounded Language Information link
Ziyi Yang, Chenguang Zhu, Vin Sachidananda, Eric Darve

Due to the ubiquitous use of embeddings as input representations for a wide range of natural language tasks, imputation of embeddings for rare and unseen words is a critical problem in language processing. Embedding imputation involves learning representations for rare or unseen words during the training of an embedding model, often in a post-hoc manner. In this paper, we propose an approach for embedding imputation which uses grounded information in the form of a knowledge graph. This is in contrast to existing approaches which typically make use of vector space properties or subword information. We propose an online method to construct a graph from grounded information and design an algorithm to map from the resulting graphical structure to the space of the pre-trained embeddings. Finally, we evaluate our approach on a range of rare and unseen word tasks across various domains and show that our model can learn better representations. For example, on the Card-660 task our method improves Pearson's and Spearman's correlation coefficients upon the state-of-the-art by 11% and 17.8% respectively using GloVe embeddings.

Rewarding Smatch: Transition-Based AMR Parsing with Reinforcement Learning link parsing
Tahira Naseem, Abhishek Shah, Hui Wan, Radu Florian, Salim Roukos, Miguel Ballesteros

Our work involves enriching the Stack-LSTM transition-based AMR parser (Ballesteros and Al-Onaizan, 2017) by augmenting training with Policy Learning and rewarding the Smatch score of sampled graphs. In addition, we also combined several AMR-to-text alignments with an attention mechanism and we supplemented the parser with pre-processed concept identification, named entities and contextualized embeddings. We achieve a highly competitive performance that is comparable to the best published results. We show an in-depth study ablating each of the new components of the parser

Compositional Questions Do Not Necessitate Multi-hop Reasoning link
Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, Luke Zettlemoyer

Multi-hop reading comprehension (RC) questions are challenging because they require reading and reasoning over multiple paragraphs. We argue that it can be difficult to construct large multi-hop RC datasets. For example, even highly compositional questions can be answered with a single hop if they target specific entity types, or the facts needed to answer them are redundant. Our analysis is centered on HotpotQA, where we show that single-hop reasoning can solve much more of the dataset than previously thought. We introduce a single-hop BERT-based RC model that achieves 67 F1---comparable to state-of-the-art multi-hop models. We also design an evaluation setting where humans are not shown all of the necessary paragraphs for the intended multi-hop reasoning but can still answer over 80% of questions. Together with detailed error analysis, these results suggest there should be an increasing focus on the role of evidence in multi-hop reasoning and possibly even a shift towards information retrieval style evaluations with large and diverse evidence collections.

Boosting Dialog Response Generation dialogue/conversation NLG
Wenchao Du, Alan W Black

Embedding time expressions for deep temporal ordering models link
Tanya Goyal, Greg Durrett

Data-driven models have demonstrated state-of-the-art performance in inferring the temporal ordering of events in text. However, these models often overlook explicit temporal signals, such as dates and time windows. Rule-based methods can be used to identify the temporal links between these time expressions (timexes), but they fail to capture timexes' interactions with events and are hard to integrate with the distributed representations of neural net models. In this paper, we introduce a framework to infuse temporal awareness into such models by learning a pre-trained model to embed timexes. We generate synthetic data consisting of pairs of timexes, then train a character LSTM to learn embeddings and classify the timexes' temporal relation. We evaluate the utility of these embeddings in the context of a strong neural model for event temporal ordering, and show a small increase in performance on the MATRES dataset and more substantial gains on an automatically collected dataset with more frequent event-timex interactions.

Learning to Rank for Plausible Plausibility link
Zhongyang Li, Tongfei Chen, Benjamin Van Durme

Researchers illustrate improvements in contextual encoding strategies via resultant performance on a battery of shared Natural Language Understanding (NLU) tasks. Many of these tasks are of a categorical prediction variety: given a conditioning context (e.g., an NLI premise), provide a label based on an associated prompt (e.g., an NLI hypothesis). The categorical nature of these tasks has led to common use of a cross entropy log-loss objective during training. We suggest this loss is intuitively wrong when applied to plausibility tasks, where the prompt by design is neither categorically entailed nor contradictory given the context. Log-loss naturally drives models to assign scores near 0.0 or 1.0, in contrast to our proposed use of a margin-based loss. Following a discussion of our intuition, we describe a confirmation study based on an extreme, synthetically curated task derived from MultiNLI. We find that a margin-based loss leads to a more plausible model of plausibility. Finally, we illustrate improvements on the Choice Of Plausible Alternative (COPA) task through this change in loss.

Learning Emphasis Selection for Written Text in Visual Media from Crowd-Sourced Label Distributions
Amirreza Shirani, Franck Dernoncourt, Paul Asente, Nedim Lipka, Seokhwan Kim, Jose Echevarria, Thamar Solorio

SUMBT: Slot-Utterance Matching for Universal and Scalable Belief Tracking
Hwaran Lee, Jinsik Lee, Tae-Yoon Kim

Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks link
Ahmad Aghaebrahimian, Mark Cieliebak

We report our ongoing work about a new deep architecture working in tandem with a statistical test procedure for jointly training texts and their label descriptions for multi-label and multi-class classification tasks. A statistical hypothesis testing method is used to extract the most informative words for each given class. These words are used as a class description for more label-aware text classification. Intuition is to help the model to concentrate on more informative words rather than more frequent ones. The model leverages the use of label descriptions in addition to the input text to enhance text classification performance. Our method is entirely data-driven, has no dependency on other sources of information than the training data, and is adaptable to different classification problems by providing appropriate training data without major hyper-parameter tuning. We trained and tested our system on several publicly available datasets, where we managed to improve the state-of-the-art on one set with a high margin, and to obtain competitive results on all other ones.

Why Didn’t You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models link
Varun Kumar, Alison Smith-Renner, Leah Findlater, Kevin Seppi, Jordan Boyd-Graber

Synthetic QA Corpora Generation with Roundtrip Consistency link NLG
Chris Alberti, Daniel Andor, Emily Pitler, Jacob Devlin, Michael Collins

We introduce a novel method of generating synthetic question answering corpora by combining models of question generation and answer extraction, and by filtering the results to ensure roundtrip consistency. By pretraining on the resulting corpora we obtain significant improvements on SQuAD2 and NQ, establishing a new state-of-the-art on the latter. Our synthetic data generation models, for both question generation and answer extraction, can be fully reproduced by finetuning a publicly available BERT model on the extractive subsets of SQuAD2 and NQ. We also describe a more powerful variant that does full sequence-to-sequence pretraining for question generation, obtaining exact match and F1 at less than 0.1% and 0.4% from human performance on SQuAD2.

Attention-based Conditioning Methods for External Knowledge Integration link
Katerina Margatina, Christos Baziotis, Alexandros Potamianos

In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.

Generating Diverse Translations with Sentence Codes MT
Raphael Shu, Hideki Nakayama, Kyunghyun Cho

A Transparent Framework for Evaluating Unintended Demographic Bias in Word Embeddings bias RepL
Chris Sweeney, Maryam Najafian

End-to-end Deep Reinforcement Learning Based Coreference Resolution
Hongliang Fei, Xu Li, Dingcheng Li, Ping Li

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation link MT
Elizabeth Salesky, Matthias Sperber, Alan W Black

Previous work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.

A Semi-Markov Structured Support Vector Machine Model for High-Precision Named Entity Recognition NER
Chen-Tse Tsai, Ravneet Arora, Ketevan Tsereteli, Prabhanjan Kambadur, Yi Yang

Robust Zero-Shot Cross-Domain Slot Filling with Example Values link
Darsh Shah, Raghav Gupta, Amir Fayazi, Dilek Hakkani-Tur

Task-oriented dialog systems increasingly rely on deep learning-based slot filling models, usually needing extensive labeled training data for target domains. Often, however, little to no target domain training data may be available, or the training and target domain schemas may be misaligned, as is common for web forms on similar websites. Prior zero-shot slot filling models use slot descriptions to learn concepts, but are not robust to misaligned schemas. We propose utilizing both the slot description and a small number of examples of slot values, which may be easily available, to learn semantic representations of slots which are transferable across domains and robust to misaligned schemas. Our approach outperforms state-of-the-art models on two multi-domain datasets, especially in the low-data setting.

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification link
Tu Vu, Mohit Iyyer

While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.

Employing the Correspondence of Relations and Connectives to Identify Implicit Discourse Relations via Label Embeddings
Linh The Nguyen, Linh Van Ngo, Khoat Than, Thien Huu Nguyen

Multimodal Abstractive Summarization for How2 Videos link summarization
Shruti Palaskar, Jindřich Libovický, Spandana Gella, Florian Metze

In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compare various models trained with different modalities and present pilot experiments on the How2 corpus of instructional videos. We also propose a new evaluation metric (Content F1) for abstractive summarization task that measures semantic adequacy rather than fluency of the summaries, which is covered by metrics like ROUGE and BLEU.

Rumor Detection By Exploiting User Credibility Information, Attention and Multi-task Learning
Quanzhi Li, Qiong Zhang, Luo Si

Deep Unknown Intent Detection with Margin Loss link
Ting-En Lin, Hua Xu

Identifying the unknown (novel) user intents that have never appeared in the training set is a challenging task in the dialogue system. In this paper, we present a two-stage method for detecting unknown intents. We use bidirectional long short-term memory (BiLSTM) network with the margin loss as the feature extractor. With margin loss, we can learn discriminative deep features by forcing the network to maximize inter-class variance and to minimize intra-class variance. Then, we feed the feature vectors to the density-based novelty detection algorithm, local outlier factor (LOF), to detect unknown intents. Experiments on two benchmark datasets show that our method can yield consistent improvements compared with the baseline methods.

A Simple Recipe towards Reducing Hallucination in Neural Surface Realisation
Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, Chin-Yew Lin

Training Neural Machine Translation To Apply Terminology Constraints link MT
Georgiana Dinu, Prashant Mathur, Marcello Federico, Yaser Al-Onaizan

This paper proposes a novel method to inject custom terminology into neural machine translation at run time. Previous works have mainly proposed modifications to the decoding algorithm in order to constrain the output to include run-time-provided target terms. While being effective, these constrained decoding methods add, however, significant computational overhead to the inference step, and, as we show in this paper, can be brittle when tested in realistic conditions. In this paper we approach the problem by training a neural MT system to learn how to use custom terminology when provided with the input. Comparative experiments show that our method is not only more effective than a state-of-the-art implementation of constrained decoding, but is also as fast as constraint-free decoding.

BERT Rediscovers the Classical NLP Pipeline link
Ian Tenney, Dipanjan Das, Ellie Pavlick

Pre-trained text encoders have rapidly advanced the state of the art on many NLP tasks. We focus on one such model, BERT, and aim to quantify where linguistic information is captured within the network. We find that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference. Qualitative analysis reveals that the model can and often does adjust this pipeline dynamically, revising lower-level decisions on the basis of disambiguating information from higher-level representations.

Exploiting Sentential Context for Neural Machine Translation link MT
Xing Wang, Zhaopeng Tu, Longyue Wang, Shuming Shi

In this work, we present novel approaches to exploit sentential context for neural machine translation (NMT). Specifically, we first show that a shallow sentential context extracted from the top encoder layer only, can improve translation performance via contextualizing the encoding representations of individual words. Next, we introduce a deep sentential context, which aggregates the sentential context representations from all the internal layers of the encoder to form a more comprehensive context representation. Experimental results on the WMT14 English-to-German and English-to-French benchmarks show that our model consistently improves performance over the strong TRANSFORMER model (Vaswani et al., 2017), demonstrating the necessity and effectiveness of exploiting sentential context for NMT.

Multilingual Constituency Parsing with Self-Attention and Pre-Training link parsing
Nikita Kitaev, Steven Cao, Dan Klein

We show that constituency parsing benefits from unsupervised pre-training across a variety of languages and a range of pre-training conditions. We first compare the benefits of no pre-training, fastText, ELMo, and BERT for English and find that BERT outperforms ELMo, in large part due to increased model capacity, whereas ELMo in turn outperforms the non-contextual fastText embeddings. We also find that pre-training is beneficial across all 11 languages tested; however, large model sizes (more than 100 million parameters) make it computationally expensive to train separate models for each language. To address this shortcoming, we show that joint multilingual pre-training and fine-tuning allows sharing all but a small number of parameters between ten languages in the final model. The 10x reduction in model size compared to fine-tuning one model per language causes only a 3.2% relative error increase in aggregate. We further explore the idea of joint fine-tuning and show that it gives low-resource languages a way to benefit from the larger datasets of other languages. Finally, we demonstrate new state-of-the-art results for 11 languages, including English (95.8 F1) and Chinese (91.8 F1).

Simple Unsupervised Summarization by Contextual Matching summarization
Jiawei Zhou, Alexander Rush

Women's Syntactic Resilience and Men's Grammatical Luck: Gender-Bias in Part-of-Speech Tagging and Dependency Parsing parsing
Aparna Garimella, Carmen Banea, Dirk Hovy, Rada Mihalcea

Storyboarding of Recipes: Grounded Contextual Generation link NLG
Khyathi Chandu, Eric Nyberg, Alan W Black

Energy and Policy Considerations for Deep Learning in NLP link
Emma Strubell, Ananya Ganesh, Andrew McCallum

Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.

Depth Growing for Neural Machine Translation link MT
Lijun Wu, Yiren Wang, Yingce Xia, Fei Tian, Fei Gao, Tao QIN, Jianhuang Lai, Tie-Yan Liu

While very deep neural networks have shown effectiveness for computer vision and text classification applications, how to increase the network depth of neural machine translation (NMT) models for better translation quality remains a challenging problem. Directly stacking more blocks to the NMT model results in no improvement and even reduces performance. In this work, we propose an effective two-stage approach with three specially designed components to construct deeper NMT models, which result in significant improvements over the strong Transformer baselines on WMT$14$ English$\to$German and English$\to$French translation tasks\footnote{Our code is available at \url{https://github.com/apeterswu/Depth_Growing_NMT}}.

Effective Adversarial Regularization for Neural Machine Translation MT
Motoki Sato, Jun Suzuki, Shun Kiyono

Improving Question Answering over Incomplete KBs with Knowledge-Aware Reader link QA
Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, William Yang Wang

We propose a new end-to-end question answering model, which learns to aggregate answer evidence from an incomplete knowledge base (KB) and a set of retrieved text snippets. Under the assumptions that the structured KB is easier to query and the acquired knowledge can help the understanding of unstructured text, our model first accumulates knowledge of entities from a question-related KB subgraph; then reformulates the question in the latent space and reads the texts with the accumulated entity knowledge at hand. The evidence from KB and texts are finally aggregated to predict answers. On the widely-used KBQA benchmark WebQSP, our model achieves consistent improvements across settings with different extents of KB incompleteness.

Span-Level Model for Relation Extraction IE
Kalpit Dixit, Yaser Al-Onaizan

Evaluating Gender Bias in Machine Translation link MT bias
Gabriel Stanovsky, Noah A. Smith, Luke Zettlemoyer

Learning to Relate from Captions and Bounding Boxes
Sarthak Garg, Joel Ruben Antony Moniz, Anshu Aviral, Priyatham Bollimpalli

Attention Is (not) All You Need for Commonsense Reasoning link
Tassilo Klein, Moin Nabi

The recently introduced BERT model exhibits strong performance on several language understanding benchmarks. In this paper, we describe a simple re-implementation of BERT for commonsense reasoning. We show that the attentions produced by BERT can be directly utilized for tasks such as the Pronoun Disambiguation Problem and Winograd Schema Challenge. Our proposed attention-guided commonsense reasoning method is conceptually simple yet empirically powerful. Experimental analysis on multiple datasets demonstrates that our proposed system performs remarkably well on all cases while outperforming the previously reported state of the art by a margin. While results suggest that BERT seems to implicitly learn to establish complex relationships between entities, solving commonsense reasoning tasks might require more than unsupervised models learned from huge text corpora.

How Multilingual is Multilingual BERT? link
Telmo Pires, Eva Schlinger, Dan Garrette

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference NLI
Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, Iryna Gurevych

Dataset Creation for Ranking Constructive News Comments corpus
Soichiro Fujita, Hayato Kobayashi, Manabu Okumura

Global Textual Relation Embedding for Relational Understanding link
Zhiyu Chen, Hanwen Zha, Honglei Liu, Wenhu Chen, Xifeng Yan, Yu Su

Pre-trained embeddings such as word embeddings and sentence embeddings are fundamental tools facilitating a wide range of downstream NLP tasks. In this work, we investigate how to learn a general-purpose embedding of textual relations, defined as the shortest dependency path between entities. Textual relation embedding provides a level of knowledge between word/phrase level and sentence level, and we show that it can facilitate downstream tasks requiring relational understanding of the text. To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase. We use global co-occurrence statistics between textual and knowledge base relations as the supervision signal to train the embedding. Evaluation on two relational understanding tasks demonstrates the usefulness of the learned textual relation embedding. The data and code can be found at https://github.com/czyssrs/GloREPlus

Self-Supervised Learning for Contextualized Extractive Summarization summarization
Hong Wang, Xin Wang, Wenhan Xiong, Mo Yu, Xiaoxiao Guo, Shiyu Chang, William Yang Wang

Unsupervised Joint Training of Bilingual Word Embeddings RepL
Benjamin Marie, Atsushi Fujita

A Just and Comprehensive Strategy for Using NLP to Address Online Abuse link
David Jurgens, Libby Hemphill, Eshwar Chandrasekharan

Online abusive behavior affects millions and the NLP community has attempted to mitigate this problem by developing technologies to detect abuse. However, current methods have largely focused on a narrow definition of abuse to detriment of victims who seek both validation and solutions. In this position paper, we argue that the community needs to make three substantive changes: (1) expanding our scope of problems to tackle both more subtle and more serious forms of abuse, (2) developing proactive technologies that counter or inhibit abuse before it harms, and (3) reframing our effort within a framework of justice to promote healthy communities.

Towards Near-imperceptible Steganographic Text
Falcon Dai, Zheng Cai

Analyzing Linguistic Differences between Owner and Staff Attributed Tweets
Daniel Preoţiuc-Pietro, Rita Devlin Marier

AdaNSP: Uncertainty-driven Adaptive Decoding in Neural Semantic Parsing parsing
Xiang Zhang, Shizhu He, Kang Liu, Jun Zhao

Cross-Domain Generalization of Neural Constituency Parsers parsing
Daniel Fried, Nikita Kitaev, Dan Klein

Are Red Roses Red? Evaluating Consistency of Question-Answering Models link QA
Marco Tulio Ribeiro, Carlos Guestrin, Sameer Singh

Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization link RepL
Paul Pu Liang, Zhun Liu, Yao-Hung Hubert Tsai, Qibin Zhao, Ruslan Salakhutdinov, Louis-Philippe Morency

There has been an increased interest in multimodal language processing including multimodal dialog, question answering, sentiment analysis, and speech recognition. However, naturally occurring multimodal data is often imperfect as a result of imperfect modalities, missing entries or noise corruption. To address these concerns, we present a regularization method based on tensor rank minimization. Our method is based on the observation that high-dimensional multimodal time series data often exhibit correlations across time and modalities which leads to low-rank tensor representations. However, the presence of noise or incomplete values breaks these correlations and results in tensor representations of higher rank. We design a model to learn such tensor representations and effectively regularize their rank. Experiments on multimodal language data show that our model achieves good results across various levels of imperfection.

Enhancing Air Quality Prediction with Social Media and Natural Language Processing
Jyun-Yu Jiang, Xue Sun, Wei Wang, Sean Young

Improving Visual Question Answering by Referring to Generated Paragraph Captions link QA
Hyounghun Kim, Mohit Bansal

Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.

Putting Evaluation in Context: Contextual Embeddings improve Machine Translation Evaluation MT
Nitika Mathur, Timothy Baldwin, Trevor Cohn

Context-specific language modeling for human trafficking detection from online advertisements
Saeideh Shahrokh Esfahani, Michael J. Cafarella, Gregory DeAngelo, Elena Eneva, Maziyar Baran Pouyan, Andy E. Fano

Leveraging Local and Global Patterns for Self-Attention Networks
Mingzhou Xu, Derek F. Wong, Baosong Yang, Yue Zhang, Lidia S. Chao

Implicit Discourse Relation Identification for Open-domain Dialogues dialogue/conversation
Mingyu Derek Ma, Kevin Bowden, Jiaqi Wu, Wen Cui, Marilyn Walker

Sentence-Level Agreement for Neural Machine Translation MT
Mingming Yang, Rui Wang, Kehai Chen, Masao Utiyama, eiichiro sumita, Min Zhang, Tiejun Zhao

Negative Lexically Constrained Decoding for Paraphrase Generation NLG
Tomoyuki Kajiwara

Wetin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions
Innocent Ndubuisi-Obi, Sayan Ghosh, David Jurgens

Towards Lossless Encoding of Sentences link
Gabriele Prato, Mathieu Duchesneau, Sarath Chandar, Alain Tapp

A lot of work has been done in the field of image compression via machine learning, but not much attention has been given to the compression of natural language. Compressing text into lossless representations while making features easily retrievable is not a trivial task, yet has huge benefits. Most methods designed to produce feature rich sentence embeddings focus solely on performing well on downstream tasks and are unable to properly reconstruct the original sequence from the learned embedding. In this work, we propose a near lossless method for encoding long sequences of texts as well as all of their sub-sequences into feature rich representations. We test our method on sentiment analysis and show good performance across all sub-sentence and sentence embeddings.

A Corpus for Modeling User and Language Effects in Argumentation on Online Debating link corpus
Esin Durmus, Claire Cardie

Existing argumentation datasets have succeeded in allowing researchers to develop computational methods for analyzing the content, structure and linguistic features of argumentative text. They have been much less successful in fostering studies of the effect of "user" traits -- characteristics and beliefs of the participants -- on the debate/argument outcome as this type of user information is generally not available. This paper presents a dataset of 78, 376 debates generated over a 10-year period along with surprisingly comprehensive participant profiles. We also complete an example study using the dataset to analyze the effect of selected user traits on the debate outcome in comparison to the linguistic features typically employed in studies of this kind.

The Effectiveness of Simple Hybrid Systems for Hypernym Discovery
William Held, Nizar Habash

Multilingual Unsupervised NMT using Shared Encoder and Language-Specific Decoders
Sukanta Sen, Kamal Kumar Gupta, Asif Ekbal, Pushpak Bhattacharyya

TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks link summarization
Guy Lev, Michal Shmueli-Scheuer, Jonathan Herzig, Achiya Jerbi, David Konopnicki

Currently, no large-scale training data is available for the task of scientific paper summarization. In this paper, we propose a novel method that automatically generates summaries for scientific papers, by utilizing videos of talks at scientific conferences. We hypothesize that such talks constitute a coherent and concise description of the papers' content, and can form the basis for good summaries. We collected 1716 papers and their corresponding videos, and created a dataset of paper summaries. A model trained on this dataset achieves similar performance as models trained on a dataset of summaries created manually. In addition, we validated the quality of our summaries by human experts.

On the Summarization of Consumer Health Questions summarization
Asma Ben Abacha, Dina Demner-Fushman

Exploring Author Context for Detecting Intended vs Perceived Sarcasm
Silviu Oprea, Walid Magdy

Simple and Effective Paraphrastic Similarity from Parallel Translations MT
John Wieting, Kevin Gimpel, Graham Neubig, Taylor Berg-Kirkpatrick

Bilingual Lexicon Induction through Unsupervised Machine Translation MT
Mikel Artetxe, Gorka Labaka, Eneko Agirre

A Surprisingly Robust Trick for the Winograd Schema Challenge link
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, Thomas Lukasiewicz

Large-Scale Transfer Learning for Natural Language Generation NLG transfer
Sergey Golovanov, Rauf Kurbanov, Sergey Nikolenko, Kyryl Truskovskyi, Alexander Tselousov, Thomas Wolf

Better OOV Translation with Bilingual Terminology Mining MT
Matthias Huck, Viktor Hangya, Alexander Fraser

A Prism Module for Semantic Disentanglement in Name Entity Recognition NER
Kun Liu, Shen Li, Daqi Zheng, Zhengdong Lu, Sheng Gao, Si Li

A Multi-Task Architecture on Relevance-based Neural Query Translation link MT
Sheikh Muhammad Sarwar, Hamed Bonab, James Allan

We describe a multi-task learning approach to train a Neural Machine Translation (NMT) model with a Relevance-based Auxiliary Task (RAT) for search query translation. The translation process for Cross-lingual Information Retrieval (CLIR) task is usually treated as a black box and it is performed as an independent step. However, an NMT model trained on sentence-level parallel data is not aware of the vocabulary distribution of the retrieval corpus. We address this problem with our multi-task learning architecture that achieves 16% improvement over a strong NMT baseline on Italian-English query-document dataset. We show using both quantitative and qualitative analysis that our model generates balanced and precise translations with the regularization effect it achieves from multi-task learning paradigm.

Lattice-Based Transformer Encoder for Neural Machine Translation MT
Fengshun Xiao, Jiangtong Li, Hai Zhao, Rui Wang, Kehai Chen

BERT-based Lexical Substitution
Wangchunshu Zhou, Tao Ge, Ke Xu, Furu Wei, Ming Zhou

An Empirical Study of Span Representations in Argumentation Structure Parsing parsing
Tatsuki Kuribayashi, Hiroki Ouchi, Naoya Inoue, Paul Reisert, Toshinori Miyoshi, Jun Suzuki, Kentaro Inui

Collocation Classification with Unsupervised Relation Vectors
Luis Espinosa Anke, Steven Schockaert, Leo Wanner

We need to talk about standard splits link
Kyle Gorman, Steven Bedrick

Generating Fluent Adversarial Examples for Natural Languages
Huangzhao Zhang, Ning Miao, Hao Zhou, Lei Li

Modeling Semantic Relationship in Multi-turn Conversations with Hierarchical Latent Variables link dialogue/conversation
Lei Shen, Yang Feng, Haolan Zhan

Multi-turn conversations consist of complex semantic structures, and it is still a challenge to generate coherent and diverse responses given previous utterances. It's practical that a conversation takes place under a background, meanwhile, the query and response are usually most related and they are consistent in topic but also different in content. However, little work focuses on such hierarchical relationship among utterances. To address this problem, we propose a Conversational Semantic Relationship RNN (CSRR) model to construct the dependency explicitly. The model contains latent variables in three hierarchies. The discourse-level one captures the global background, the pair-level one stands for the common topic information between query and response, and the utterance-level ones try to represent differences in content. Experimental results show that our model significantly improves the quality of responses in terms of fluency, coherence and diversity compared to baseline methods.

Generating Summaries with Topic Templates and Structured Convolutional Decoders link
Laura Perez-Beltrachini, Yang Liu, Mirella Lapata

Existing neural generation approaches create multi-sentence text as a single sequence. In this paper we propose a structured convolutional decoder that is guided by the content structure of target summaries. We compare our model with existing sequential decoders on three data sets representing different domains. Automatic and human evaluation demonstrate that our summaries have better content coverage.

Unsupervised Rewriter for Multi-Sentence Compression
Yang Zhao, Xiaoyu Shen, Wei Bi, Akiko Aizawa

Model-Agnostic Meta-Learning for Relation Classification with Limited Supervision
Abiola Obamuyide, Andreas Vlachos

Label-Agnostic Sequence Labeling by Copying Nearest Neighbors link
Sam Wiseman, Karl Stratos

Retrieve-and-edit based approaches to structured prediction, where structures associated with retrieved neighbors are edited to form new structures, have recently attracted increased interest. However, much recent work merely conditions on retrieved structures (e.g., in a sequence-to-sequence framework), rather than explicitly manipulating them. We show we can perform accurate sequence labeling by explicitly (and only) copying labels from retrieved neighbors. Moreover, because this copying is label-agnostic, we can achieve impressive performance in zero-shot sequence-labeling tasks. We additionally consider a dynamic programming approach to sequence labeling in the presence of retrieved neighbors, which allows for controlling the number of distinct (copied) segments used to form a prediction, and leads to both more interpretable and accurate predictions.

Rationally Reappraising ATIS-based Dialogue Systems dialogue/conversation
Jingcheng Niu, Gerald Penn

Automatic Grammatical Error Correction for Sequence-to-sequence Text Generation: An Empirical Study NLG
Tao Ge, Xingxing Zhang, Furu Wei, Ming Zhou

Coreference Resolution with Entity Equalization
Ben Kantor, Amir Globerson

Simultaneous Translation with Flexible Policy via Restricted Imitation Learning link MT
Baigong Zheng, Renjie Zheng, Mingbo Ma, Liang Huang

Simultaneous translation is widely useful but remains one of the most difficult tasks in NLP. Previous work either uses fixed-latency policies, or train a complicated two-staged model using reinforcement learning. We propose a much simpler single model that adds a `delay' token to the target vocabulary, and design a restricted dynamic oracle to greatly simplify training. Experiments on Chinese<->English simultaneous translation show that our work leads to flexible policies that achieve better BLEU scores and lower latencies compared to both fixed and RL-learned policies.

Twitter Homophily: Network Based Prediction of User's Occupation
Jiaqi Pan, Rishabh Bhardwaj, Wei Lu, Hai Leong Chieu, Xinghao Pan, Ni Yi Puay

Exploring Numeracy in Word Embeddings RepL
Aakanksha Naik, Abhilasha Ravichander, Carolyn Rose, Eduard Hovy

MC^2: Multi-perspective Convolutional Cube for Conversational Machine Reading Comprehension dialogue/conversation
Xuanyu Zhang

Constructing Interpretive Spatio-Temporal Features for Multi-Turn Responses Selection
Junyu Lu, Chenbin Zhang, Zeying Xie, Guang Ling, Tom Chao Zhou, Zenglin Xu

System Demonstration Papers

Sakura: Large-scale Incorrect Example Retrieval System for Learners of Japanese as a Second Language
Mio Arai, Tomonori Kodaira, Mamoru Komachi

SLATE: A Super-Lightweight Annotation Tool for Experts
Jonathan K. Kummerfeld

lingvis.io - A Linguistic Visual Analytics Framework
Mennatallah El-Assady, Wolfgang Jentner, Fabian Sperrle, Rita Sevastjanova, Annette Hautli-Janisz, Miriam Butt, Daniel Keim

SARAL: A Low-Resource Cross-Lingual Domain-Focused Information Retrieval System for Effective Rapid Document Triage
Elizabeth Boschee, Joel Barry, Jayadev Billa, Marjorie Freedman, Thamme Gowda, Constantine Lignos, Chester Palen-Michel, Michael Pust, Banriskhem Kayang Khonglah, Srikanth Madikeri, Jonathan May, Scott Miller

Jiuge: A Human-Machine Cooperative Chinese Classical Poetry Generation System NLG
Guo Zhipeng, Xiaoyuan Yi, Maosong Sun, Wenhao Li, Cheng Yang, Jiannan Liang, Huimin Chen, Yuhui Zhang, Ruoyu Li

Rapid Customization for Event Extraction link IE
Yee Seng Chan, Joshua Fasching, Haoling Qiu, Bonan Min

We present a system for rapidly customizing event extraction capability to find new event types and their arguments. The system allows a user to find, expand and filter event triggers for a new event type by exploring an unannotated corpus. The system will then automatically generate mention-level event annotation automatically, and train a Neural Network model for finding the corresponding event. Additionally, the system uses the ACE corpus to train an argument model for extracting Actor, Place, and Time arguments for any event types, including ones not seen in its training data. Experiments show that with less than 10 minutes of human effort per event type, the system achieves good performance for 67 novel event types. The code, documentation, and a demonstration video will be released as open source on github.com.

A Multiscale Visualization Tool for Analyzing Attention in the Transformer Model
Jesse Vig

PostAc : A Visual Interactive Search, Exploration, and Analysis Platform for PhD Intensive Job Postings
Chenchen Xu, Inger Mewburn, Will J Grant, Hanna Suominen

An adaptable task-oriented dialog system for stand-alone embedded devices dialogue/conversation
Long Duong, Vu Cong Duy Hoang, Tuyen Quang Pham, Yu-Heng Hong, Vladislavs Dovgalecs, Guy Bashkansky, Jason Black, Andrew Bleeker, Serge Le Huitouze, Mark Johnson

AlpacaTag: An Active Learning-based Crowd Annotation Framework for Sequence Tagging
Bill Yuchen Lin, Dong-Ho Lee, Frank F. Xu, Ouyu Lan, Xiang Ren

ConvLab: Multi-Domain End-to-End Dialog System Platform link dialogue/conversation
Sungjin Lee, Qi Zhu, Ryuichi Takanobu, Xiang Li, Yaoqin Zhang, Zheng Zhang, Jinchao Li, Baolin Peng, Xiujun Li, Minlie Huang, Jianfeng Gao

We present ConvLab, an open-source multi-domain end-to-end dialog system platform, that enables researchers to quickly set up experiments with reusable components and compare a large set of different approaches, ranging from conventional pipeline systems to end-to-end neural models, in common environments. ConvLab offers a set of fully annotated datasets and associated pre-trained reference models. As a showcase, we extend the MultiWOZ dataset with user dialog act annotations to train all component models and demonstrate how ConvLab makes it easy and effortless to conduct complicated experiments in multi-domain end-to-end dialog settings.

Demonstration of a Neural Machine Translation System with Online Learning for Translators MT
Miguel Domingo, Mercedes García-Martínez, Amando Estela Pastor, Laurent Bié, Alexander Helle, Álvaro Peris, Francisco Casacuberta, Manuel Herranz Pérez

FASTDial: Abstracting Dialogue Policies for Fast Development of Task Oriented Agents dialogue/conversation
Serra Sinem Tekiroglu, Bernardo Magnini, Marco Guerini

A Neural, Interactive-predictive System for Multimodal Sequence to Sequence Tasks link
Álvaro Peris, Francisco Casacuberta

We present a demonstration of a neural interactive-predictive system for tackling multimodal sequence to sequence tasks. The system generates text predictions to different sequence to sequence tasks: machine translation, image and video captioning. These predictions are revised by a human agent, who introduces corrections in the form of characters. The system reacts to each correction, providing alternative hypotheses, compelling with the feedback provided by the user. The final objective is to reduce the human effort required during this correction process. This system is implemented following a client-server architecture. For accessing the system, we developed a website, which communicates with the neural model, hosted in a local server. From this website, the different tasks can be tackled following the interactive-predictive framework. We open-source all the code developed for building this system. The demonstration in hosted in http://casmacat.prhlt.upv.es/interactive-seq2seq.

NeuralClassifier: An Open-source Neural Hierarchical Multi-label Text Classification Toolkit
Liqun Liu, Funan Mu, Pengyu Li, Xin Mu, Jing Tang, Xingsheng Ai, Ran Fu, Lifeng Wang, Xing Zhou

ADVISER: A Dialog System Framework for Education & Research dialogue/conversation
Daniel Ortega, Dirk Väth, Gianna Weber, Lindsey Vanderlyn, Maximilian Schmidt, Moritz Völkel, Zorica Karacevic, Ngoc Thang Vu

KCAT: A Knowledge-Constraint Typing Annotation Tool link
Sheng Lin, Luye Zheng, Bo Chen, Siliang Tang, Zhigang Chen, Guoping Hu, Yueting Zhuang, Fei Wu, Xiang Ren

Fine-grained Entity Typing is a tough task which suffers from noise samples extracted from distant supervision. Thousands of manually annotated samples can achieve greater performance than millions of samples generated by the previous distant supervision method. Whereas, it's hard for human beings to differentiate and memorize thousands of types, thus making large-scale human labeling hardly possible. In this paper, we introduce a Knowledge-Constraint Typing Annotation Tool (KCAT), which is efficient for fine-grained entity typing annotation. KCAT reduces the size of candidate types to an acceptable range for human beings through entity linking and provides a Multi-step Typing scheme to revise the entity linking result. Moreover, KCAT provides an efficient Annotator Client to accelerate the annotation process and a comprehensive Manager Module to analyse crowdsourcing annotations. Experiment shows that KCAT can significantly improve annotation efficiency, the time consumption increases slowly as the size of type set expands.

An Environment for the Relational Annotation of Political Debates
Andre Blessing, Nico Blokker, Sebastian Haunss, Jonas Kuhn, Gabriella Lapesa, Sebastian Padó

GLTR: Statistical Detection and Visualization of Generated Text
Sebastian Gehrmann, Hendrik Strobelt, Alexander Rush

OpenKiwi: An Open Source Framework for Quality Estimation link
Fabio Kepler, Jonay Trenous, Marcos Treviso, Miguel Vera, André F. T. Martins

We introduce OpenKiwi, a Pytorch-based open source framework for translation quality estimation. OpenKiwi supports training and testing of word-level and sentence-level quality estimation systems, implementing the winning systems of the WMT 2015-18 quality estimation campaigns. We benchmark OpenKiwi on two datasets from WMT 2018 (English-German SMT and NMT), yielding state-of-the-art performance on the word-level tasks and near state-of-the-art in the sentence-level tasks.

Microsoft Icecaps: An Open-Source Toolkit for Conversation Modeling dialogue/conversation
Vighnesh Leonardo Shiv, Chris Quirk, Anshuman Suri, Xiang Gao, Khuram Shahid, Nithya Govindarajan, Yizhe Zhang, Jianfeng Gao, Michel Galley, Chris Brockett, Tulasi Menon, Bill Dolan

PerspectroScope: A Window to the World of Diverse Perspectives link
Sihao Chen, Daniel Khashabi, Chris Callison- Burch, Dan Roth

This work presents PerspectroScope, a web-based system which lets users query a discussion-worthy natural language claim, and extract and visualize various perspectives in support or against the claim, along with evidence supporting each perspective. The system thus lets users explore various perspectives that could touch upon aspects of the issue at hand.The system is built as a combination of retrieval engines and learned textual-entailment-like classifiers built using a few recent developments in natural language understanding. To make the system more adaptive, expand its coverage, and improve its decisions over time, our platform employs various mechanisms to get corrections from the users. PerspectroScope is available at github.com/CogComp/perspectroscope.

HEIDL: Learning Linguistic Expressions with Deep Learning and Human-in-the- Loop
Prithviraj Sen, Yunyao Li, Eser Kandogan, Yiwei Yang, Walter Lasecki

My Turn To Read: Interleaved E-book Reading to Boost Readers' Confidence, Fluency & Stamina
Nitin Madnani, Beata Beigman Klebanov, Anastassia Loukina, Binod Gyawali, Patrick Lange, John Sabatini, Michael Flor, Blair Lehman

GrapAL: Connecting the Dots in Scientific Literature link
Christine Betts, Joanna Power, Waleed Ammar

We introduce GrapAL (Graph database of Academic Literature), a versatile tool for exploring and investigating a knowledge base of scientific literature, that was semi-automatically constructed using NLP methods. GrapAL satisfies a variety of use cases and information needs requested by researchers. At the core of GrapAL is a Neo4j graph database with an intuitive schema and a simple query language. In this paper, we describe the basic elements of GrapAL, how to use it, and several use cases such as finding experts on a given topic for peer reviewing, discovering indirect connections between biomedical entities and computing citation-based metrics. We open source the demo code to help other researchers develop applications that build on GrapAL.

ClaimPortal: Integrated Monitoring, Searching, Checking, and Analytics of Factual Claims on Twitter
Sarthak Majithia, Fatma Arslan, Sumeet Lubal, Damian Jimenez, Priyank Arora, Josue Caraballo, Chengkai Li

Texar: A Modularized, Versatile, and Extensible Toolkit for Text Generation link NLG
Zhiting Hu, Haoran Shi, Bowen Tan, Wentao Wang, Zichao Yang, Tiancheng Zhao, Junxian He, Lianhui Qin, Di Wang, Xuezhe Ma, Zhengzhong Liu, Xiaodan Liang, Wanrong Zhu, Devendra Sachan, Eric Xing

We introduce Texar, an open-source toolkit aiming to support the broad set of text generation tasks that transforms any inputs into natural language, such as machine translation, summarization, dialog, content manipulation, and so forth. With the design goals of modularity, versatility, and extensibility in mind, Texar extracts common patterns underlying the diverse tasks and methodologies, creates a library of highly reusable modules and functionalities, and allows arbitrary model architectures and algorithmic paradigms. In Texar, model architecture, losses, and learning processes are fully decomposed. Modules at high concept level can be freely assembled or plugged in/swapped out. These features make Texar particularly suitable for researchers and practitioners to do fast prototyping and experimentation, as well as foster technique sharing across different text generation tasks. We provide case studies to demonstrate the use and advantage of the toolkit. Texar is released under Apache license 2.0 at https://github.com/asyml/texar.

Parallax: Visualizing and Understanding the Semantics of Embedding Spaces via Algebraic Formulae link
Piero Molino, Yang Wang, Jiawei Zhang

Embeddings are a fundamental component of many modern machine learning and natural language processing models. Understanding them and visualizing them is essential for gathering insights about the information they capture and the behavior of the models. State of the art in analyzing embeddings consists in projecting them in two-dimensional planes without any interpretable semantics associated to the axes of the projection, which makes detailed analyses and comparison among multiple sets of embeddings challenging. In this work, we propose to use explicit axes defined as algebraic formulae over embeddings to project them into a lower dimensional, but semantically meaningful subspace, as a simple yet effective analysis and visualization methodology. This methodology assigns an interpretable semantics to the measures of variability and the axes of visualizations, allowing for both comparisons among different sets of embeddings and fine-grained inspection of the embedding spaces. We demonstrate the power of the proposed methodology through a series of case studies that make use of visualizations constructed around the underlying methodology and through a user study. The results show how the methodology is effective at providing more profound insights than classical projection methods and how it is widely applicable to many other use cases.

Flambé: A Customizable Framework for Multistage Natural Language Processing Experiments
Jeremy Wohlwend, Nicholas Matthews, Ivan Itzcovich

A Modular Tool for Automatic Summarization summarization
Valentin Nyzam, Aurélien Bossard

TARGER: Neural Argument Mining at Your Fingertips
Artem Chernodub, Oleksiy Oliynyk, Philipp Heidenreich, Alexander Bondarenko, Matthias Hagen, Chris Biemann, Alexander Panchenko

MoNoise: A Multi-lingual and Easy-to-use Lexical Normalization Tool
Rob van der Goot

Level-Up: Learning to Improve Proficiency Level of Essays
Wen-Bin Han, Jhih-Jie Chen, Chingyu Yang, Jason Chang

Learning to Link Grammar and Encyclopedic Information of Assist ESL Learners
Jhih-Jie Chen, Chingyu Yang, Peichen Ho, Ming Chiao Tsai, Chia-Fang Ho, Kai-Wen Tuan, Chung- Ting Tsai, Wen-Bin Han, Jason Chang

Student Research Workshop Papers

A computational linguistic study of personal recovery in bipolar disorder link
Glorianna Jagfeld

Mental health research can benefit increasingly fruitfully from computational linguistics methods, given the abundant availability of language data in the internet and advances of computational tools. This interdisciplinary project will collect and analyse social media data of individuals diagnosed with bipolar disorder with regard to their recovery experiences. Personal recovery - living a satisfying and contributing life along symptoms of severe mental health issues - so far has only been investigated qualitatively with structured interviews and quantitatively with standardised questionnaires with mainly English-speaking participants in Western countries. Complementary to this evidence, computational linguistic methods allow us to analyse first-person accounts shared online in large quantities, representing unstructured settings and a more heterogeneous, multilingual population, to draw a more complete picture of the aspects and mechanisms of personal recovery in bipolar disorder.

A Japanese Word Segmentation Proposal
Stalin Aguirre, Josafá Aguiar

A Strong and Robust Baseline for Text-Image Matching
Fangyu Liu, Rongtian Ye

Active Reading Comprehension: A dataset for learning the Question-Answer Relationship strategy QA
Diana Galvan

Analyzing and Mitigating Gender Bias in Languages with Grammatical Gender and Bilingual Word Embeddings bias RepL
Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Kai-Wei Chang

Annotating and Analyzing Semantic Role of Elementary Units and Relations in Online Persuasive Arguments
Ryo Egawa, Gaku Morio, Katsuhide Fujita

ARHNet - Leveraging Community Interaction For Detection Of Religious Hate Speech In Arabic
Arijit Ghosh Chowdhury, Aniket Didolkar, Ramit Sawhney, Rajiv Ratn Shah

Attention and Lexicon Regularized LSTM for Aspcet-based Sentiment Analysis
Lingxian Bao, Patrik Lambert, Toni Badia

Attention over Heads: A Multi-Hop Attention for Neural Machine Translation MT
Shohei Iida, Ryuichiro Kimura, Hongyi Cui, Po-Hsuan Hung, Takehito Utsuro, Masaaki Nagata

Automated Cross-language Intelligibility Analysis of Parkinson's Disease Patients Using Speech Recognition Technologies
Nina Hosseini-Kivanani, Juan Camilo Vásquez-Correa, Manfred Stede, Elmar Nöth

Automatic Data-Driven Approaches for Evaluating the Phonemic Verbal Fluency Task with Healthy Adults
Hali Lindsay, Nicklas Linz, Johannes Tröger

Automatic Generation of Personalized Comment Based on User Profile NLG
Wenhuan Zeng, Abulikemu Abuduweili, Lei Li, Pengcheng Yang

BREAKING! Presenting fake news corpus for automated fact checking corpus
Archita Pathak, Rohini Srihari

Computational ad hominem detection
Pieter Delobelle, Murilo Cunha, Eric Massip Cano, Jeroen Peperkamp, Bettina Berendt

Controllable Text Simplification with Lexical Constraint Loss
Daiki Nishihara, Tomoyuki Kajiwara, Yuki Arase

Controlling Grammatical Error Correction Using Word Edit Rate
Kengo Hotate, Masahiro Kaneko, Satoru Katsumata, Mamoru Komachi

Convolutional Neural Networks for Financial Text Regression
Neşat Dereli, Murat Saraclar

Corpus Creation and Analysis for Named Entity Recognition in Telugu-English Code-Mixed Social Media Data NER
Vamshi Krishna Srirangam, Appidi Abhinav Reddy, Vinay Singh, Manish Shrivastava

Corpus Creation and Baseline System for Aggression Detection in Telugu-English Code-Mixed Social Media Data corpus
Koushik Reddy Sane, Sushmitha Reddy Sane, Sairam Kolla, Radhika Mamidi

Cross-domain and Cross-lingual Abusive Language Detection: a Hybrid Approach with Deep Learning and a Multilingual Lexicon
Endang Wahyu Pamungkas, Viviana Patti

De-Mixing Sentiment from Code-Mixed Text
Yash Kumar Lal, Vaibhav Kumar, Mrinal Dhar, Manish Shrivastava, Philipp Koehn

Deep Neural Models for Medical Concept Normalization in User-Generated Texts
Zulfat Miftahutdinov, Elena Tutubalina

Detecting Political Bias in News Articles Using Headline Attention bias
Rama Rohit Reddy Gangula, Suma Reddy Duggenpudi, Radhika Mamidi

Developing OntoSenseNet: A Verb-Centric Ontological Resource for Indian Languages and Analysing Stylometric Difference in Men and Women Writing using the Resource
Jyoti Jha, Navjyoti Singh

Dialogue-Act Prediction of Future Responses based on Conversation History dialogue/conversation
Koji Tanaka, Junya Takayama, Yuki Arase

Distributed Knowledge Based Clinical Auto-Coding System
Rajvir Kaur

Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition NER
Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Pierre Zweigenbaum

English-Indonesian Neural Machine Translation for Spoken Language Domains MT
Meisyarah Dwiastuti

Enriching Neural Models with Targeted Features for Dementia Detection link
Flavio Di Palo, Natalie Parde

Alzheimer's disease (AD) is an irreversible brain disease that can dramatically reduce quality of life, most commonly manifesting in older adults and eventually leading to the need for full-time care. Early detection is fundamental to slowing its progression; however, diagnosis can be expensive, time-consuming, and invasive. In this work we develop a neural model based on a CNN-LSTM architecture that learns to detect AD and related dementias using targeted and implicitly-learned features from conversational transcripts. Our approach establishes the new state of the art on the DementiaBank dataset, achieving an F1 score of 0.929 when classifying participants into AD and control groups.

Exploring the potential of Neural Networks for Extracting Adverse Drug Reactions from Biomedical Texts
Ilseyar Alimova, Elena Tutubalina

Fact or Factitious? Contextualized Opinion Spam Detection
Stefan Kennedy, Niall Walsh, Kirils Sloka, Andrew McCarren, Jennifer Foster

From Bilingual to Multilingual Neural Machine Translation by Incremental Training link MT
Carlos Escolano, Marta R. Costa-jussà, José A. R. Fonollosa

Multilingual Neural Machine Translation approaches are based on the use of task-specific models and the addition of one more language can only be done by retraining the whole system. In this work, we propose a new training schedule that allows the system to scale to more languages without modification of the previous components based on joint training and language-independent encoder/decoder modules allowing for zero-shot translation. This work in progress shows close results to the state-of-the-art in the WMT task.

From brain space to distributional space: the perilous journeys of fMRI decoding
Gosse Minnema, Aurélie Herbelot

Gender Stereotypes Differ between Male and Female Writings
Yusu Qian

Hierarchical Multi-label Classification of Text with Capsule Networks
Rami Aly, Steffen Remus, Chris Biemann

Imperceptible Adversarial Examples for Automatic Speech Recognition
Yao Qin

Improving Mongolian-Chinese Neural Machine Translation with Morphological Noise MT
Yatu Ji, Hongxu Hou, Chen Junjie, Nier Wu

Improving Neural Entity Disambiguation with Graph Embeddings
Özge Sevgili, Alexander Panchenko, Chris Biemann

Incorporating Textual Information on User Behavior for Personality Prediction
Kosuke Yamada, Ryohei Sasano, Koichi Takeda

Investigating Political Herd Mentality: A Community Sentiment based Approach
Anjali Bhavan, Rohan Mishra, Pradyumna Prakhar Sinha, Ramit Sawhney, Rajiv Ratn Shah

Joint Learning of Named Entity Recognition and Entity Linking NER
Pedro Henrique Martins, Zita Marinho, André F. T. Martins

Knowledge discovery and hypothesis generation from online patient forums: A research proposal NLG
Anne Dirkson

Long-Distance Dependencies don't have to be Long: Simplifying through Provably (Approximately) Optimal Permutations
Rishi Bommasani

Māori Loanwords: A Corpus of New Zealand English Tweets corpus
David Trye, Andreea Calude, Felipe Bravo-Marquez, Te Taka Keegan

Measuring the Value of Linguistics: A Case Study from St. Lawrence Island Yupik
Emily Chen

Multilingual model using cross-task embedding projection
Jin Sakuma, Naoki Yoshinaga

Multimodal Logical Inference System for Visual-Textual Entailment link
Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, Daisuke Bekki

A large amount of research about multimodal inference across text and vision has been recently developed to obtain visually grounded word and sentence representations. In this paper, we use logic-based representations as unified meaning representations for texts and images and present an unsupervised multimodal logical inference system that can effectively prove entailment relations between them. We show that by combining semantic parsing and theorem proving, the system can handle semantically complex sentences for visual-textual inference.

Multiple Character Embeddings for Chinese Word Segmentation link
Jianing Zhou, Jingkang Wang, Gongshen Liu

Chinese word segmentation (CWS) is often regarded as a character-based sequence labeling task in most current works which have achieved great success with the help of powerful neural networks. However, these works neglect an important clue: Chinese characters incorporate both semantic and phonetic meanings. In this paper, we introduce multiple character embeddings including Pinyin Romanization and Wubi Input, both of which are easily accessible and effective in depicting semantics of characters. We propose a novel shared Bi-LSTM-CRF model to fuse linguistic features efficiently by sharing the LSTM network during the training procedure. Extensive experiments on five corpora show that extra embeddings help obtain a significant improvement in labeling accuracy. Specifically, we achieve the state-of-the-art performance in AS and CityU corpora with F1 scores of 96.9 and 97.3, respectively without leveraging any external lexical resources.

Natural Language Generation: Recently Learned Lessons, Directions for Semantic Representation-based Approaches, and the case of Brazilian Portuguese Language NLG
Marco Antonio Sobrevilla Cabezudo, Thiago Pardo

Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches MT
Talha Çolakoğlu, Umut Sulubacak, Ahmet Cüneyd Tantuğ

Not All Reviews are Equal: Towards Addressing Reviewer Biases for Opinion Summarization summarization
Wenyi Tay

On Dimensional Linguistic Properties of the Word Embedding Space RepL
Vikas Raunak, Vaibhav Kumar, Vivek Gupta, Florian Metze

On the impressive performance of randomly weighted encoders in summarization tasks summarization
Jonathan Pilault, Jaehong Park, Christopher Pal

Paraphrases as Foreign Languages in Multilingual Neural Machine Translation link MT
Zhong Zhou, Matthias Sperber, Alexander Waibel

Using paraphrases, the expression of the same semantic meaning in different words, to improve generalization and translation performance is often useful. However, prior works only explore the use of paraphrases at the word or phrase level, not at the sentence or document level. Unlike previous works, we use different translations of the whole training data that are consistent in structure as paraphrases at the corpus level. Our corpus contains parallel paraphrases in multiple languages from various sources. We treat paraphrases as foreign languages, tag source sentences with paraphrase labels, and train in the style of multilingual Neural Machine Translation (NMT). Experimental results indicate that adding paraphrases improves the rare word translation, increases entropy and diversity in lexical choice. Moreover, adding the source paraphrases improves translation performance more effectively than adding the target paraphrases. Combining both the source and the target paraphrases boosts performance further; combining paraphrases with multilingual data also helps but has mixed performance. We achieve a BLEU score of 57.2 for French-to-English translation, training on 24 paraphrases of the Bible, which is ~+27 above the WMT'14 baseline.

Predicting the Outcome of Deliberative Democracy: A Research Proposal
Conor McKillop

Public Mood Variations in Russia based on Vkontakte Content
Sergey Smetanin

Question Answering in the Biomedical Domain QA
Vincent Nguyen

Ranking of Potential Questions
Luise Schricker, Tatjana Scheffler

Reducing Gender Bias in Word-Level Language Models with A Gender-Equalizing Loss Function bias
Yusu Qian, Urwa Muaz, Ben Zhang, Jae Won Hyun

Robust to Noise Models in Natural Language Processing Tasks
Valentin Malykh

Scheduled Sampling for Transformers
Tsvetomila Mihaylova, André F. T. Martins

Sentence Level Curriculum Learning for Improved Neural Conversational Models dialogue/conversation
Sean Paulsen

Sentiment Analysis on Naija-Tweets
Taiwo Kolajo, Olawande Daramola, Ayodele Adebiyi

Sentiment Classification using Document Embeddings trained with Cosine Similarity
Tan Thongtan, Tanasanee Phienthrakul

STRASS: A Light and Effective Method for Extractive Summarization Based on Sentence Embeddings summarization
Léo Bouscarrat, Antoine Bonnefoy, Thomas Peel, Cécile Pereira

Towards incremental learning of word embeddings using context informativeness RepL
Alexandre Kabbach, Kristina Gulordava, Aurélie Herbelot

Towards Turkish Abstract Meaning Representation RepL
Zahra Azin, Gülşen Eryiğit

Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages transfer
Yohan Karunanayake, Uthayasanker Thayasivam, Surangika Ranathunga

Unsupervised Learning of Discourse-Aware Text Representation for Essay Scoring link RepL
Farjana Sultana Mim, Naoya Inoue, Paul Reisert, Hiroki Ouchi, Kentaro Inui

Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation MT
Dušan Variš, Ondřej Bojar

Using Semantic Similarity as Reward for Reinforcement Learning in Sentence Generation NLG
Go Yasui, Yoshimasa Tsuruoka, Masaaki Nagata


Back-to-top