iamkissg
  • PaperHighlights
  • 2019
    • 03
      • Not All Contexts Are Created Equal Better Word Representations with Variable Attention
      • Learning Context-Sensitive Word Embeddings with Neural Tensor Skip-Gram Model
      • Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet
      • pair2vec: Compositional Word-Pair Embeddings for Cross-Sentence Inference
      • Contextual Word Representations: A Contextual Introduction
      • Not All Neural Embeddings are Born Equal
      • High-risk learning: acquiring new word vectors from tiny data
      • Learning word embeddings from dictionary definitions only
      • Dependency-Based Word Embeddings
    • 02
      • Improving Word Embedding Compositionality using Lexicographic Definitions
      • From Word Embeddings To Document Distances
      • Progressive Growing of GANs for Improved Quality, Stability, and Variation
      • Retrofitting Word Vectors to Semantic Lexicons
      • Bag of Tricks for Image Classification with Convolutional Neural Networks
      • Multi-Task Deep Neural Networks for Natural Language Understanding
      • Snapshot Ensembles: Train 1, get M for free
      • EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
      • Counter-fitting Word Vectors to Linguistic Constraints
      • AdaScale: Towards Real-time Video Object Detection Using Adaptive Scaling
      • Learning semantic similarity in a continuous space
      • Progressive Neural Networks
      • BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
      • Language Models are Unsupervised Multitask Learners
    • 01
      • Querying Word Embeddings for Similarity and Relatedness
      • Data Distillation: Towards Omni-Supervised Learning
      • A Rank-Based Similarity Metric for Word Embeddings
      • Dict2vec: Learning Word Embeddings using Lexical Dictionaries
      • Graph Convolutional Networks for Text Classification
      • Improving Distributional Similarity with Lessons Learned from Word Embeddings
      • Real-time Personalization using Embeddings for Search Ranking at Airbnb
      • Glyce: Glyph-vectors for Chinese Character Representations
      • Auto-Encoding Dictionary Definitions into Consistent Word Embeddings
      • Distilling the Knowledge in a Neural Network
      • Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrin
      • The (Too Many) Problems of Analogical Reasoning with Word Vectors
      • Linear Ensembles of Word Embedding Models
      • Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
      • Dynamic Meta-Embeddings for Improved Sentence Representations
  • 2018
    • 11
      • Think Globally, Embed Locally — Locally Linear Meta-embedding of Words
      • Learning linear transformations between counting-based and prediction-based word embeddings
      • Learning Word Meta-Embeddings by Autoencoding
      • Learning Word Meta-Embeddings
      • Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings
    • 6
      • Universal Language Model Fine-tuning for Text Classification
      • Semi-supervised sequence tagging with bidirectional language models
      • Consensus Attention-based Neural Networks for Chinese Reading Comprehension
      • Attention-over-Attention Neural Networks for Reading Comprehension
      • Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms
      • Convolutional Neural Networks for Sentence Classification
      • Deep contextualized word representations
      • Neural Architectures for Named Entity Recognition
      • Improving Language Understanding by Generative Pre-Training
      • A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence C
      • Teaching Machines to Read and Comprehend
    • 5
      • Text Understanding with the Attention Sum Reader Network
      • Effective Approaches to Attention-based Neural Machine Translation
      • Distance-based Self-Attention Network for Natural Language Inference
      • Deep Residual Learning for Image Recognition
      • U-Net: Convolutional Networks for Biomedical Image Segmentation
      • Memory Networks
      • Neural Machine Translation by Jointly Learning to Align and Translate
      • Convolutional Sequence to Sequence Learning
      • An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
      • Graph Attention Networks
      • Attention is All You Need
      • DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding
      • A Structured Self-attentive Sentence Embedding
      • Hierarchical Attention Networks for Document Classification
      • Grammar as a Foreign Language
      • Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
      • Transforming Auto-encoders
      • Self-Attention with Relative Position Representations
    • 1
      • 20180108-20180114
  • 2017
    • 12
      • 20171218-2017124 论文笔记
    • 11
      • 20171127-20171203 论文笔记 1
      • 20171106-20171126 论文笔记
      • 20171030-20171105 论文笔记 1
  • Paper Title as Note Title
Powered by GitBook
On this page
  • 要点
  • 备注

Was this helpful?

  1. 2019
  2. 01

Data Distillation: Towards Omni-Supervised Learning

PreviousQuerying Word Embeddings for Similarity and RelatednessNextA Rank-Based Similarity Metric for Word Embeddings

Last updated 5 years ago

Was this helpful?

论文地址:

要点

本文题目中的"omni"是"全方位"的意思, 所以姑且先称呼"omni-supervised learning"为"全方位监督学习"吧.

怎么个全方位法呢? 根据本文的说法就是: 利用所有可用的数据, 包括来自多个数据集的带标签的数据, 以及网上可获得的无标签的数据.

既从带标签的数据中学习了, 又能从无标签数据中学习, 文章将"全方位监督学习"归为"半监督学习"的一种. 不过文章郑重声明了一点: 大多数半监督学习将全部标记好的数据分成带标签的和无标签的数据, 以此来模拟半监督环境, 因为用于监督学习的数据减少了, 它的能力上限是使用全部标记好的数据进行训练的监督学习模型; 但是, 全方位监督学习, 在全部标记好的数据上进行监督学习之后, 再利用网上的无标签数据进行学习, 能力下限是前面所说的监督学习模型.

为此, 文中介绍了一种"data distillation"的技术. 显然, 和之前介绍的"model distillation"有关系, 关系如下:

法如其名, model distillation 将多个模型(也可以是一个大的模型)的知识(泛化能力)蒸馏(迁移)到一个小模型中. Data distillation 只用了一个模型, 图中的 model A, 这个模型被用来对同一条无标签数据的不同 transformations 进行预测, 结果再汇总成一个标签作为这条数据的标签, 我们假定这个标签是比较准确可信的, 再反过来用这样得到的数据来训练模型, 可以是这个模型本身(此时 student model 就是 model A), 也可以是一个新的模型. Data distillation 名称的由来, 就是它从数据的多个 copies 中蒸馏出了有价值的信息.

该方法的一个优点是, 数据滚滚来, 模型岿然不动, 避免了对模型的改动.

一个需要注意的点是: 文章确保了训练时, 每一个 mini-batch 不会完全是带生成标签的数据, 而是混合了真实标签的数据和生成标签的数据, 以确保 gradient 不至于太坏.

备注

本文的实验很详尽, 模范.

https://arxiv.org/abs/1712.04440
model_distillation_vs_data_distillation.png