iamkissg
  • PaperHighlights
  • 2019
    • 03
      • Not All Contexts Are Created Equal Better Word Representations with Variable Attention
      • Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
      • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
      • pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
      • Contextual Word Representations: A Contextual Introduction
      • Not All Neural Embeddings are Born Equal
      • High-risk learning: acquiring new word vectors from tiny data
      • Learning word embeddings from dictionary definitions only
      • Dependency-Based Word Embeddings
    • 02
      • Improving Word Embedding Compositionality using Lexicographic Definitions
      • From Word Embeddings To Document Distances
      • Progressive Growing of GANs for Improved Quality, Stability, and Variation
      • Retrofitting Word Vectors to Semantic Lexicons
      • Bag of Tricks for Image Classification with Convolutional Neural Networks
      • Multi-Task Deep Neural Networks for Natural Language Understanding
      • Snapshot Ensembles: Train 1, get M for free
      • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
      • Counter-fitting Word Vectors to Linguistic Constraints
      • AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling
      • Learning semantic similarity in a continuous space
      • Progressive Neural Networks
      • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • Language Models are Unsupervised Multitask Learners
    • 01
      • Querying Word Embeddings for Similarity and Relatedness
      • Data Distillation: Towards Omni-Supervised Learning
      • A Rank-Based Similarity Metric for Word Embeddings
      • Dict2vec: Learning Word Embeddings using Lexical Dictionaries
      • Graph Convolutional Networks for Text Classification
      • Improving Distributional Similarity with Lessons Learned from Word Embeddings
      • Real-time Personalization using Embeddings for Search Ranking at Airbnb
      • Glyce: Glyph-vectors for Chinese Character Representations
      • Auto-Encoding Dictionary Definitions into Consistent Word Embeddings
      • Distilling the Knowledge in a Neural Network
      • Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrin
      • The (Too Many) Problems of Analogical Reasoning with Word Vectors
      • Linear Ensembles of Word Embedding Models
      • Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
      • Dynamic Meta-Embeddings for Improved Sentence Representations
  • 2018
    • 11
      • Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
      • Learning linear transformations between counting-based and prediction-based word embeddings
      • Learning Word Meta-Embeddings by Autoencoding
      • Learning Word Meta-Embeddings
      • Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
    • 6
      • Universal Language Model Fine-tuning for Text Classification
      • Semi-supervised sequence tagging with bidirectional language models
      • Consensus Attention-based Neural Networks for Chinese Reading Comprehension
      • Attention-over-Attention Neural Networks for Reading Comprehension
      • Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
      • Convolutional Neural Networks for Sentence Classification
      • Deep contextualized word representations
      • Neural Architectures for Named Entity Recognition
      • Improving Language Understanding by Generative Pre-Training
      • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence C
      • Teaching Machines to Read and Comprehend
    • 5
      • Text Understanding with the Attention Sum Reader Network
      • Effective Approaches to Attention-based Neural Machine Translation
      • Distance-based Self-Attention Network for Natural Language Inference
      • Deep Residual Learning for Image Recognition
      • U-Net: Convolutional Networks for Biomedical Image Segmentation
      • Memory Networks
      • Neural Machine Translation by Jointly Learning to Align and Translate
      • Convolutional Sequence to Sequence Learning
      • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
      • Graph Attention Networks
      • Attention is All You Need
      • DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
      • A Structured Self-attentive Sentence Embedding
      • Hierarchical Attention Networks for Document Classification
      • Grammar as a Foreign Language
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
      • Transforming Auto-encoders
      • Self-Attention with Relative Position Representations
    • 1
      • 20180108-20180114
  • 2017
    • 12
      • 20171218-2017124 论文笔记
    • 11
      • 20171127-20171203 论文笔记 1
      • 20171106-20171126 论文笔记
      • 20171030-20171105 论文笔记 1
  • Paper Title as Note Title
Powered by GitBook
On this page
  • TL;DR
  • Key Points
  • Notes/Questions

Was this helpful?

  1. 2018
  2. 5

An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

PreviousConvolutional Sequence to Sequence LearningNextGraph Attention Networks

Last updated 5 years ago

Was this helpful?

TL;DR

本文就序列建模, 对 CNN 和 RNN 进行了比较. 按照文章的说法, 在 RNN 的主场打了一架, 结果 CNN 完爆 RNN. 使用的是, 针对序列建模特殊构造的 CNN, 称为 Temporal Convolutional Network, TCN 和普通 RNN, GRU, LSTM.

Key Points

  • 为了比较 RNN 和 CNN 在 Sequence Modeling 上的性能, 文章构造了一种能用于序列建模的简单通用的 CNN 架构 TCN, 结合了 causal convolution, residual connection 和 dilation convolution.

  • TCN 的典型特征是:

    1. 卷积是 causal 的, 未来的信息不会泄漏到过去;

    2. 能将任意长度的序列映射为相同长度的输出序列.

  • 为实现第二个目标, TCN 使用一维全卷积结构, 通过 zero padding 使各层保持相同长度.

  • 而所谓 causal convolution, 就是计算 t 时刻的输出时, 仅对前一层 t 时刻及之前的状态进行卷积.

  • Causal convolution 的叠加, 高层的感受野野/历史信息与网络层数呈线性关系. 对于超长序列, 网络必须很深, 才能捕捉到足够长的历史信息. 针对这个问题, 文中使用了 dilation convolution, 使得随网络的加深, 高层的感受野呈指数扩大.

  • Dilation convolution 的运算如下: $F(s)=\Sigma{i=0}^{k-1}f(i)\cdot \textbf{x}{s-d\cdot i}$ ($\textbf{x}$ 表示输入序列, f 表示 filter, d 是 dilation factor, k 是 filter size, $s-d\cdot i$ 意味着只对过去的状态作卷积). 看图最直观.

  • TCN 的感受野依赖于上式中的 dilation factor d 和 filter/kernel size k, 以及网络深度 n. 为获得足够大的感受野, TCN 还是不得不增加网络的深度, 因此它构造了残差单元来训练更深的网络. (残差单元在 ResNet 的笔记中有详细介绍)

  • TCN 在序列建模方面的优势是:

    • 可并行性 (只要抛弃了 RNN, 神经网络基本都具有了这一优点);

    • 通过调整 n, k, d, 可灵活地控制感受野, 能适应不同任务 (有些任务要求解决超长期依赖, 有些任务更依赖短期依赖);

    • 稳定的梯度 (同样地, 只要抛弃了 RNN, 时间传播方向上的梯度爆炸/消失问题就自然解决了);

    • 训练时的低内存占用 (参数共享, 以及只存在沿网络方向的反向传播带来的裨益).

  • TCN 的缺点:

    • 推断时, 需要更多的内存 (此时 RNN 只需要维护一个 hidden state, 每次接受一个输入; 而 TCN 要保持一个足够长的序列, 以保留历史状态);

    • 迁移的困难性 (不同领域任务对感受野的大小不同, 使用小 k 和小 d 学好的模型难以应用于需要大 k 和大 d 的任务).

  • 文章将 TCN 与普通 RNN, GRU, LSTM 进行比较,并没有刻意选择在各任务上 SOTA 的模型. 实验证明,TCN 基本完爆 RNN 们.

  • 理论上 RNN 具有无限大的记忆, 即能保持无限长度之外的信息, 实际效果并不如何好. 反而 TCN 能维持长得多的历史信息.

  • 在绝大多数任务上, 使用 gradient clipping 能起到 regularization 的作用, 并加速收敛.

  • TCN 对超参数的选择相对不敏感, 只要能保持一个充分大小的有效感受野即可.

  • 正文部分的实验使用 ReLU 而没有使用门控单元. 附录的实验表明, GLU 在特定任务上能进一步提升 TCN 的性能, 但在另一些任务上并没有带来更好的结果

  • 使用门控激活单元, 会使模型的复杂度增大一倍. (如在 ConvS2S 中提到的, 门控单元要求 CNN 的输出维数是输入的两倍)

Notes/Questions

  • 理论上 RNN 能保持无限的记忆, 但是它的做法和计算机表示浮点数一样, 用有限来表示无限, 这不可避免地带来了精度损失/记忆遗失.

  • 计算机的思维方式和人是完全不同的, 我们需要前些时刻的记忆来作决策, 我们的记忆会遗忘; 但计算机不会遗忘, 只要不限制存储空间, 每个时刻记录一个记忆, 它能保持超长的记忆, 而且每段记忆都是"真真切切的".

  • 过去 RNN 流行, 我觉得一个原因可能是计算机性能有限, 用一个向量来表示记忆的开销较小, 是较好的选择. 但现在我们有了更大的存储空间, 保持足够长的记忆是很轻易的事情, 相反 RNN 的不可并行性是硬伤.

  • 也许 RNN 真的可以走下历史舞台了.

dilation_conv_in_tcn.png
residual_block_in_tcn.png