An Investigation of Language Model Interpretability via Sentence Editing
Loading...
Date
2021-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
The Ohio State University
Abstract
Pre-trained language models (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. For example, how does domain-specific pre-training change the dynamics within a model? Is task-specific fine-tuning necessary for model interpretability? Which interpretability techniques best correlate with human rationales? In this work, we re-purpose a sentence editing dataset, where high-quality human rationales can be automatically extracted and compared with model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation of the aforementioned open questions regarding PLMs' interpretability and generate new insights. The dataset and code will be released to facilitate future research on interpretability.
Description
Keywords
natural language processing, interpretability, pre-trained language models, faithful human rationales