iamkissg
  • PaperHighlights
  • 2019
    • 03
      • Not All Contexts Are Created Equal Better Word Representations with Variable Attention
      • Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
      • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
      • pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
      • Contextual Word Representations: A Contextual Introduction
      • Not All Neural Embeddings are Born Equal
      • High-risk learning: acquiring new word vectors from tiny data
      • Learning word embeddings from dictionary definitions only
      • Dependency-Based Word Embeddings
    • 02
      • Improving Word Embedding Compositionality using Lexicographic Definitions
      • From Word Embeddings To Document Distances
      • Progressive Growing of GANs for Improved Quality, Stability, and Variation
      • Retrofitting Word Vectors to Semantic Lexicons
      • Bag of Tricks for Image Classification with Convolutional Neural Networks
      • Multi-Task Deep Neural Networks for Natural Language Understanding
      • Snapshot Ensembles: Train 1, get M for free
      • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
      • Counter-fitting Word Vectors to Linguistic Constraints
      • AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling
      • Learning semantic similarity in a continuous space
      • Progressive Neural Networks
      • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • Language Models are Unsupervised Multitask Learners
    • 01
      • Querying Word Embeddings for Similarity and Relatedness
      • Data Distillation: Towards Omni-Supervised Learning
      • A Rank-Based Similarity Metric for Word Embeddings
      • Dict2vec: Learning Word Embeddings using Lexical Dictionaries
      • Graph Convolutional Networks for Text Classification
      • Improving Distributional Similarity with Lessons Learned from Word Embeddings
      • Real-time Personalization using Embeddings for Search Ranking at Airbnb
      • Glyce: Glyph-vectors for Chinese Character Representations
      • Auto-Encoding Dictionary Definitions into Consistent Word Embeddings
      • Distilling the Knowledge in a Neural Network
      • Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrin
      • The (Too Many) Problems of Analogical Reasoning with Word Vectors
      • Linear Ensembles of Word Embedding Models
      • Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
      • Dynamic Meta-Embeddings for Improved Sentence Representations
  • 2018
    • 11
      • Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
      • Learning linear transformations between counting-based and prediction-based word embeddings
      • Learning Word Meta-Embeddings by Autoencoding
      • Learning Word Meta-Embeddings
      • Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
    • 6
      • Universal Language Model Fine-tuning for Text Classification
      • Semi-supervised sequence tagging with bidirectional language models
      • Consensus Attention-based Neural Networks for Chinese Reading Comprehension
      • Attention-over-Attention Neural Networks for Reading Comprehension
      • Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
      • Convolutional Neural Networks for Sentence Classification
      • Deep contextualized word representations
      • Neural Architectures for Named Entity Recognition
      • Improving Language Understanding by Generative Pre-Training
      • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence C
      • Teaching Machines to Read and Comprehend
    • 5
      • Text Understanding with the Attention Sum Reader Network
      • Effective Approaches to Attention-based Neural Machine Translation
      • Distance-based Self-Attention Network for Natural Language Inference
      • Deep Residual Learning for Image Recognition
      • U-Net: Convolutional Networks for Biomedical Image Segmentation
      • Memory Networks
      • Neural Machine Translation by Jointly Learning to Align and Translate
      • Convolutional Sequence to Sequence Learning
      • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
      • Graph Attention Networks
      • Attention is All You Need
      • DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
      • A Structured Self-attentive Sentence Embedding
      • Hierarchical Attention Networks for Document Classification
      • Grammar as a Foreign Language
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
      • Transforming Auto-encoders
      • Self-Attention with Relative Position Representations
    • 1
      • 20180108-20180114
  • 2017
    • 12
      • 20171218-2017124 论文笔记
    • 11
      • 20171127-20171203 论文笔记 1
      • 20171106-20171126 论文笔记
      • 20171030-20171105 论文笔记 1
  • Paper Title as Note Title
Powered by GitBook
On this page

Was this helpful?

  1. 2019
  2. 03

High-risk learning: acquiring new word vectors from tiny data

PreviousNot All Neural Embeddings are Born EqualNextLearning word embeddings from dictionary definitions only

Last updated 5 years ago

Was this helpful?

论文地址:

项目地址:

要点

一般训练 word embeddings, 都是用一个大型语料 (比如维基百科), 训练一次, 得到一个模型 (称 word embeddings 为模型可能不太准确, 姑且这样叫吧).

本文是我苦苦寻找的, 对预训练 word embeddings 用新数据旧算法 (即非后续的神经网络) 再训练的例子. 这样的方法的一个好处是, 为新词在旧的向量空间迅速找到合适的语义, 而不必将不同语料再结合起来重新训练. (对于稀疏矩阵表示的 word embedding, 不用训练, 新词的 context 很容易确立, 与它词意相近的单词也能马上找到, 但是现在大家不用这样的表示)

本文提出的方法叫 nonce2vec, nonce word 的有道翻译是(为特殊需要)临时造出的词语 (以下就称为"新词"了). 说白了就是要在现有的向量空间中, 快速学习一个新词的表示. 私以为这种思路和人学习新词的行为是一致的, 我们通过了了几个例句就能大致摸清一个新词的意思(比如00后的黑话).

一种快速了解一个词意的方法, 以中文为例, 就是从组成它的汉字大致推断出其意思; 英文的话, 就是从词根和前后缀等信息了. 因此就是对新词进行分解, 求各组成部分的向量和或平均. 另一种思路是对新词的 context 求和(参考CBOW的学习方式).

本文的观点是, 应该有一个架构, 它能从任意量的数据中学习词表示, 可以在之前的基础上, 继续学习词表示, 不管是旧词还是新词. 而为了快速学习新词的表示, 实验中首先使用了大学习率, 即所谓的 high-risk, 并用了一些技巧来弥补大学习率带来的损害.

Nonce2vec 是基于 word2vec 的, 主要的修改包括:

  1. Initialsation: 用上面说的将上下文向量相加的方式 (限于已知的词), 初始化新词, 不过文章在这里对 context word 也进行了 subsampling;

  2. 高学习率+大窗口: 目的在于, 贪心地感知相关的上下文;

  3. Window resizing: 训练过程中不断缩小窗口;

  4. Subsampling: 只对超高频单词进行;

  5. Selective training: 只更新新词, 一个原因是防止高学习率破坏旧词表示;

  6. Parameter decay: 高学习率+大窗口的学习只在新词的最早期学习中使用, 以快速定位到向量空间中的大致位置, 一旦新词的语义差不多了, 就替换掉 high-risk 学习策略, 用传统的参数来学习, 即进行精调阶段. 具体来说, 文章对学习率进行了指数衰减, 和新词的次数挂钩; 减小窗口; 提高降 subsampling rate.

文章的实验表明, parameter decay 是必要的, 保持不变是有害的.

http://aclweb.org/anthology/D17-1030
https://github.com/minimalparts/nonce2vec