iamkissg
  • PaperHighlights
  • 2019
    • 03
      • Not All Contexts Are Created Equal Better Word Representations with Variable Attention
      • Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
      • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
      • pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
      • Contextual Word Representations: A Contextual Introduction
      • Not All Neural Embeddings are Born Equal
      • High-risk learning: acquiring new word vectors from tiny data
      • Learning word embeddings from dictionary definitions only
      • Dependency-Based Word Embeddings
    • 02
      • Improving Word Embedding Compositionality using Lexicographic Definitions
      • From Word Embeddings To Document Distances
      • Progressive Growing of GANs for Improved Quality, Stability, and Variation
      • Retrofitting Word Vectors to Semantic Lexicons
      • Bag of Tricks for Image Classification with Convolutional Neural Networks
      • Multi-Task Deep Neural Networks for Natural Language Understanding
      • Snapshot Ensembles: Train 1, get M for free
      • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
      • Counter-fitting Word Vectors to Linguistic Constraints
      • AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling
      • Learning semantic similarity in a continuous space
      • Progressive Neural Networks
      • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • Language Models are Unsupervised Multitask Learners
    • 01
      • Querying Word Embeddings for Similarity and Relatedness
      • Data Distillation: Towards Omni-Supervised Learning
      • A Rank-Based Similarity Metric for Word Embeddings
      • Dict2vec: Learning Word Embeddings using Lexical Dictionaries
      • Graph Convolutional Networks for Text Classification
      • Improving Distributional Similarity with Lessons Learned from Word Embeddings
      • Real-time Personalization using Embeddings for Search Ranking at Airbnb
      • Glyce: Glyph-vectors for Chinese Character Representations
      • Auto-Encoding Dictionary Definitions into Consistent Word Embeddings
      • Distilling the Knowledge in a Neural Network
      • Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrin
      • The (Too Many) Problems of Analogical Reasoning with Word Vectors
      • Linear Ensembles of Word Embedding Models
      • Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
      • Dynamic Meta-Embeddings for Improved Sentence Representations
  • 2018
    • 11
      • Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
      • Learning linear transformations between counting-based and prediction-based word embeddings
      • Learning Word Meta-Embeddings by Autoencoding
      • Learning Word Meta-Embeddings
      • Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
    • 6
      • Universal Language Model Fine-tuning for Text Classification
      • Semi-supervised sequence tagging with bidirectional language models
      • Consensus Attention-based Neural Networks for Chinese Reading Comprehension
      • Attention-over-Attention Neural Networks for Reading Comprehension
      • Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
      • Convolutional Neural Networks for Sentence Classification
      • Deep contextualized word representations
      • Neural Architectures for Named Entity Recognition
      • Improving Language Understanding by Generative Pre-Training
      • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence C
      • Teaching Machines to Read and Comprehend
    • 5
      • Text Understanding with the Attention Sum Reader Network
      • Effective Approaches to Attention-based Neural Machine Translation
      • Distance-based Self-Attention Network for Natural Language Inference
      • Deep Residual Learning for Image Recognition
      • U-Net: Convolutional Networks for Biomedical Image Segmentation
      • Memory Networks
      • Neural Machine Translation by Jointly Learning to Align and Translate
      • Convolutional Sequence to Sequence Learning
      • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
      • Graph Attention Networks
      • Attention is All You Need
      • DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
      • A Structured Self-attentive Sentence Embedding
      • Hierarchical Attention Networks for Document Classification
      • Grammar as a Foreign Language
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
      • Transforming Auto-encoders
      • Self-Attention with Relative Position Representations
    • 1
      • 20180108-20180114
  • 2017
    • 12
      • 20171218-2017124 论文笔记
    • 11
      • 20171127-20171203 论文笔记 1
      • 20171106-20171126 论文笔记
      • 20171030-20171105 论文笔记 1
  • Paper Title as Note Title
Powered by GitBook
On this page
  • Key Points
  • Notes/Questions

Was this helpful?

  1. 2018
  2. 5

A Structured Self-attentive Sentence Embedding

PreviousDiSAN: Directional Self-Attention Network for RNN/CNN-Free Language UnderstandingNextHierarchical Attention Networks for Document Classification

Last updated 5 years ago

Was this helpful?

TL;DR 本文提出了一种基于 self-attention 学习 sentence embedding 的监督学习方法. 使用 matrix 来表示 sentence embedding, 保留了句子的多种特征, 一个行向量表示一种特征. 为缓解行向量的冗余问题 (即行向量之间太相似, 只学到重复特征), 文中引入了一个鼓励行向量学习不同特征的惩罚项.

Key Points

  • 句子的向量表示往往只关注句子的一个特定成分, 比如相关的词或词组, 从而只学到一种特征. 据此, 本文提出用矩阵来表示句子, 充分考虑句子的不同成分 (词, 词组, 分句等), 从而保留句子的不同特征. 用一个行向量表示一种特征:

    • 文中用 $A=softmax(W{s2} tanh(W{s1}H^T))$ 来计算 attention matrix. 其中, $H$ 是 LSTM 状态向量构成的矩阵, $W{s1}$ 和 $W{s2}$ 是要学习的权值参数矩阵. 如果将 $W{s2}$ 替换成向量, 得到的就是一个 attention vector, 此时的计算和一般的 attention vector 计算无异. 只是 $W{s2}$ 作为矩阵之后, 为保证 attention matrix 的每一行都是一个独立的 attention vector, softmax 只应用于输入矩阵的第二维.

    • 通过 $M=AH$, 就得到了 sentence embedding matrix.

    • 由于 $W{s1}$ 和 $W{s2}$ 的大小是固定的, 不同长度的句子学到的 sentence embedding matrix 将是固定大小的 $M$.

  • embedding matrix 存在冗余的问题, 即学到的 embedding vector 之间过分相似, 实际只学到了少量几种特征. 为此, 文章引入了一项惩罚:

    • 用 $P=\left\Vert\left(AA^T-I\right) \right\VertF^2$ 来度量冗余程度. 其中 $I$ 是单位矩阵, $\left\Vert\left(\cdot\right) \right\Vert_F$ 是 Frobenius 范数: $\left\Vert\left(A\right) \right\Vert_F=\sqrt{\Sigma{i=1}^m \Sigma{j=1}^n|a{ij}^2|}$

    • 因此, 减小 $P$ 将鼓励 $AA^T-I$ 向零阵看齐. 由于减了一个单位矩阵, 随着学习, $AA^T$ 的对角元素将接近于 1. 按文中的说法, 这将使得每一个 attention vector 尽可能专注于一种成分, 即一种特征, 同时不同 attention vector 又尽可能不同.

  • 使用矩阵表示句子带来了一个问题, attention layer 的后接 FC (全连阶层) 将变得很臃肿. 事实上, 文中使用的模型, 90% 的参数都集中在这个 FC 上, 设 $M$ 的大小为 $r\times u$, FC 有 b 个单元, 此时参数量将是 $r\times u \times b$.

  • 文中根据矩阵的二维特点, 提出了一种 weight pruning method. 具体地, 将 $M$ 按行映射为 $M^v\in R^{r\times p}$, 使得 $r\times p=b$, 其中 $M^v$ 的第 r 行只和 $M$ 的第 r 行全连接, 这样就剪除了 $(r-1)/r$ 个参数 (按列映射的情况类似). 不过 weight pruning 会带来性能下降.

Notes/Questions

  • embedding matrix 的想法很具有创新性, 让人眼前一亮.

  • 除了 weight pruning 部分, 模型不复杂, 就是 Bi-LSTM + self-attention mechanism.

  • 从考察 $M$ 的行数对模型性能的影响的实验来看, 当行数取 1, 即 embedding matrix 退化为 embedding vector, 模型的性能有明显下降, 和 baselines 处于同一级别. 很难说文中提出的模型的性能提升来自 self-attention mechanism 还是 embedding matrix. 怕是 embedding matrix 更多一些, 感觉这是变相的 ensemble 啊.

  • 模型对于 sentence representation 的学习是依赖于具体的 NLP 任务的监督学习方法.

  • 在惩罚项部分, 文章考虑过 KL 散度, 放弃的原因可以借鉴:

    1. 对于 $A$, 面临的是最大化不同 attention vector 之间的 KL 散度集的问题. $A$ 中大量的零值使得训练不稳定;

    2. 作者希望 $A$ 的每一行专注于句子的一种特征, 这是 KL 散度无法提供的.

  • 个人感觉, 文章附录部分写得很难懂, 而且使用的符号和正文不一致. $r\times p =b$ 的依据, $(r-1)/r$ 是怎么出来的, 都没讲清楚.

Struce of self-attentive sentence embedding model