NLP Annotations
- Natural Language Processing (Almost) from Scratch (Collobert et al., 2011)
- neural networks on POS Tagging, Chuking, NER.
- random initialized lookup word vectors.
- CoNLL challenge.
- Multitask learning.
- Better Word Representations with Recursive Neural Networks for Morphology (Luong et al., 2013)
- recursive neural networks.
- nearly same idea with fasttext.
- Context-insensitive Morphological RNN
- Context-sensitive Morphological RNN: contextual embeddings from 2013.
- morphological segmentation toolkit: Morfessor.
- pre* stm suf* instead of (pre* stm suf*)+ which is handy for words in morphologically rich languages.
- no starting training from scratch, but rather, initialize the models with existing word representations.
- On The Difficulty Of Training Recurrent Neural Networs (Pascanu et al., 2013)
- definition of exploding/vanishing gradients.
- backpropagation through time.
- exmples on matrix norms and spectral radius.
- dynamical systems.
- L1/L2, teacher forcing, LSTM, hessian-free optimization, echo state networks and their deficiencies.
- gradient clipping.
- Distributed Representations of Words and Phrases and their Compositionality (Mikolov et al., 2013)
- word vectors without corpus statistics & contexualization.
- skip-grams and cbow.
- hierarchical softmax & negative sampling.
- Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau et al., 2014)
- soft-attention definition.
- RNNencdec vs. RNNsearch (proposed).
- RNNsearch30 > RNNencdec50.
- soft-attention vs. hard-attention.
- Effective Approaches to Attention-based Neural Machine Translation (Luong et al., 2015)
- WMT’14 sota.
- proposed method: local alignment.
- scoring types in alignment (attention mechanisms).
- explains which scoring is better for which type of attention.
- ensemble rocks!
- Neural Machine Translation of Rare Words with Subword Units (Sennrich et al., 2015)
- byte pair tokenization.
- productive word information process: agglutination and compounding.
- variance in the degree of morphological synthesis between languages.
- SMT examples, must look at.
- unigram performs poorly but bigram is unable to produce some tokens in test set.
- GloVe: Global Vectors for Word Representation (Pennington et al., 2014)
- shortcomings of word2vec.
- basic knowledge on matrix factorization methods (LSA, HAL, COALS, HPCA).
- where does GloVe’s loss fn come from?
- tldr: complexity of GloVe.
- semantics vs syntax -> benchmarks
- symmetric vs. assymetric vs. dimension.
- SVD-S and SVD-L
- Enriching Word Vectors with Subword Information (Bojanowski et al., 2016)
- morphologically rich languages.
- char n-grams.
- morphological information significally improves syntactic task. And does not help semantic questions. But optional n-gram helps.
- other morphological representations.
- very good vectors on small datasets.
- Deep contextualized word representations (Peters et al., 2018)
- char convolutions as inputs.
- one billion benchmark.
- other context dependent papers.
- sota on 6 benchmark.
- first layer vs second layer representations.
- word sense disambiguation.
- GloVe vs. biLM.
- Improving Language Understanding By Generative Pre-Training (Radford et al., 2018)
- transformer decoder.
- language modeling as objective function in pre-training.
- NLI types.
- good visualizations of fine-tuning tasks.
- 12 decoder layer (as in BERT small).
- comparison: language modeling as an auxiliary objective in fine-tuning.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
- masked LM as objective fn in pre-training..
- transformer encoder.
- ELMO but transformer and deeper.
- BERT large vs BERT base.
- [CLS] and [SEP] tokens.
- Wordpiece tokenizer.
- Task specific input representations.
- GPT vs. BERT.
- GLUE Benchmarks.
- Cross-lingual Language Model Pretraining (Lample et al., 2019)
- novel unspuervised method for learning cross-lingual representation.
- novel supervised method for cross-lingual pretraining.
- CLM, MLM, TLM
- XNLI
- fine-tuning only with English on sequence classification.
- shared subword vocab.
- low-resource language modeling.