iamkissg
  • PaperHighlights
  • 2019
    • 03
      • Not All Contexts Are Created Equal Better Word Representations with Variable Attention
      • Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
      • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
      • pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
      • Contextual Word Representations: A Contextual Introduction
      • Not All Neural Embeddings are Born Equal
      • High-risk learning: acquiring new word vectors from tiny data
      • Learning word embeddings from dictionary definitions only
      • Dependency-Based Word Embeddings
    • 02
      • Improving Word Embedding Compositionality using Lexicographic Definitions
      • From Word Embeddings To Document Distances
      • Progressive Growing of GANs for Improved Quality, Stability, and Variation
      • Retrofitting Word Vectors to Semantic Lexicons
      • Bag of Tricks for Image Classification with Convolutional Neural Networks
      • Multi-Task Deep Neural Networks for Natural Language Understanding
      • Snapshot Ensembles: Train 1, get M for free
      • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
      • Counter-fitting Word Vectors to Linguistic Constraints
      • AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling
      • Learning semantic similarity in a continuous space
      • Progressive Neural Networks
      • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • Language Models are Unsupervised Multitask Learners
    • 01
      • Querying Word Embeddings for Similarity and Relatedness
      • Data Distillation: Towards Omni-Supervised Learning
      • A Rank-Based Similarity Metric for Word Embeddings
      • Dict2vec: Learning Word Embeddings using Lexical Dictionaries
      • Graph Convolutional Networks for Text Classification
      • Improving Distributional Similarity with Lessons Learned from Word Embeddings
      • Real-time Personalization using Embeddings for Search Ranking at Airbnb
      • Glyce: Glyph-vectors for Chinese Character Representations
      • Auto-Encoding Dictionary Definitions into Consistent Word Embeddings
      • Distilling the Knowledge in a Neural Network
      • Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrin
      • The (Too Many) Problems of Analogical Reasoning with Word Vectors
      • Linear Ensembles of Word Embedding Models
      • Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
      • Dynamic Meta-Embeddings for Improved Sentence Representations
  • 2018
    • 11
      • Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
      • Learning linear transformations between counting-based and prediction-based word embeddings
      • Learning Word Meta-Embeddings by Autoencoding
      • Learning Word Meta-Embeddings
      • Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
    • 6
      • Universal Language Model Fine-tuning for Text Classification
      • Semi-supervised sequence tagging with bidirectional language models
      • Consensus Attention-based Neural Networks for Chinese Reading Comprehension
      • Attention-over-Attention Neural Networks for Reading Comprehension
      • Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
      • Convolutional Neural Networks for Sentence Classification
      • Deep contextualized word representations
      • Neural Architectures for Named Entity Recognition
      • Improving Language Understanding by Generative Pre-Training
      • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence C
      • Teaching Machines to Read and Comprehend
    • 5
      • Text Understanding with the Attention Sum Reader Network
      • Effective Approaches to Attention-based Neural Machine Translation
      • Distance-based Self-Attention Network for Natural Language Inference
      • Deep Residual Learning for Image Recognition
      • U-Net: Convolutional Networks for Biomedical Image Segmentation
      • Memory Networks
      • Neural Machine Translation by Jointly Learning to Align and Translate
      • Convolutional Sequence to Sequence Learning
      • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
      • Graph Attention Networks
      • Attention is All You Need
      • DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
      • A Structured Self-attentive Sentence Embedding
      • Hierarchical Attention Networks for Document Classification
      • Grammar as a Foreign Language
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
      • Transforming Auto-encoders
      • Self-Attention with Relative Position Representations
    • 1
      • 20180108-20180114
  • 2017
    • 12
      • 20171218-2017124 论文笔记
    • 11
      • 20171127-20171203 论文笔记 1
      • 20171106-20171126 论文笔记
      • 20171030-20171105 论文笔记 1
  • Paper Title as Note Title
Powered by GitBook
On this page
  • TL;DR
  • Key Points
  • Notes/Questions

Was this helpful?

  1. 2018
  2. 6

Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms

PreviousAttention-over-Attention Neural Networks for Reading ComprehensionNextConvolutional Neural Networks for Sentence Classification

Last updated 5 years ago

Was this helpful?

论文地址:

TL;DR

本文提出了一类学习文本表示的模型, 仅使用不同的池化操作对词向量进行融合. 实验证明, 此类模型是一个强基线, 在时序不太重要的任务上, 性能超过简单的 CNN 和 LSTM. 此外, 参数少, 训练快得飞起.

PS: 和上周的 CNN4SC 对比, GPU (GeForce GT 750M) 下快了约 17 倍.

Key Points

  • 本文欲探究的一个问题是: 如何在模型的计算复杂性与表达能力之间取得平衡.

  • fastText 和 Deep Averaging Network 使用平均池化就在一些任务上取得了瞩目的效果. 受此启发, 本文提出了用池化来学习文本表示的一类模型, 包括:

    • 平均池化: 对文本(可以是句子, 短文, 文档)中的所有词向量, 求每一维的均值, 得到与词向量相同维数的文本向量: $z=\frac{1}{L}\Sigma_{i=1}^L v_i$;

    • 最大池化: 取所有词向量中各维的最大值: $z=Max-pooling(v_1, \dots, v_L)$ (此操作的基本思想是, 只有少数关键词才对预测起作用);

    • 池化拼接: 即分别进行最大池化和平均池化, 然后将得到的向量拼接, 得到的文本向量是词向量的两倍长 (目的是让两者互补);

    • 分层池化: 以上所述均为全局池化, 并没有考虑到时序信息, 为弥补池化对时序信息捕捉的不足, 先进行局部平均池化, 得到 n 个向量, 然后再进行全局最大池化.

  • 池化操作是无参, 相比 CNN 或 RNN 将词向量的序列加工程文本向量, 本文提出的 SWEM 模型简单得多. 模型的参数来自池化层之上的全连阶层, 另外, 对于部分任务在 embedding layer 与池化层另有一层全连阶层, 即不直接对词向量进行池化 (是否增加这一层由实验决定).

  • SWEM 在主题分类任务上基本胜过深度CNN (29层), LSTM 和 fastText, 证明了模型的有效性. 此类任务, 关键词对于预测的贡献极大, 而时序并不重要;

  • 在情感分析类任务上, 非分层池化的 SWEM 不如 CNN 和 LSTM, 证明了时序的重要性, 但使用分层池化的 SWEM 取得了比 CNN, LSTM 更好或相近的性能.

  • 在文档分类任务上, 使用最大池化训练得到的词向量 (也可以使用预训练的) 很稀疏, 证明上述观点: 只有少量关键词才对预测起关键作用.

  • 在文本匹配型的任务上, SWEM 同样基本胜过了 CNN 与 LSTM. 原因在于, 此类任务, 句子间单词的对齐程度足以做预测, 时序再次变得不重要.

  • 文中对不同类型的数据, 打乱训练集的词序, 保持测试集的词序, 以此来探究哪些任务的词序是重要的. 使用 LSTM 进行实验, 乱序训练的情况下, LSTM 性能明显下降的即为对时序(词序)敏感的任务, 包括情感分析, 问答类任务.

  • 对于短句的分类, SWEM 的性能不如 CNN 和 LSTM. 文章归因为, 时序在短句中比在长文档中更重要, 仅由词向量提供的语义信息太有限了.

  • Regularization 对于小型训练集由防止过拟合的作用.

  • 文中的一个实验将 SWEM 的池化层后的非线性层替换为了线性分类器, 模型性能只是轻微下降, 证明了模型的鲁棒性.

Notes/Questions

  • 我感觉文章还少了一组设置: 先最大池化后平均池化的实验组. 在 CNN4SC 的代码中, 我补充了对照实验.

https://arxiv.org/abs/1805.09843