-
[김성범]Data AugmentationNLP 2022. 1. 21. 08:28
[참고영상] https://www.youtube.com/watch?v=7fwxcCJsvPw&t=570s
[14분 02초]
1. Thesaurus-based Substitution
- Replace a random word with its synonym using a Thesaurus.
- ex) It is awesome => amazing, awe-inspiring-awing => It is amazing.
2. Word Embedding-based Substitution
- Use pre-trained word embeddings such as Word2Vec, GloVe, FastText, etc.
- Find the nearest neighbor words and substitute.
- ex) It is awesome => It is amazing.
It is perfect.
It is fantastic.
- 관련 논문 : TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liuhttps://arxiv.org/abs/1909.10351
- 관련 자료 : https://amitness.com/2020/05/data-augmentation-for-nlp/
3. Masked Language Model
- Use pre-trained masked language models such as BERT, ROBERTa, and ALBERT.
- Mask out some words & see what the model predicts.
- ex) Tis is [mask] cool => BERT => This is pretty cool.
This is really cool.
This is super cool.
This is kinda cool.
This is very cool.
- 예제의 관련 자료 : https://amitness.com/2020/05/data-augmentation-for-nlp/
- 관련 논문 : BERT-based Adversarial Examples for Text Classificationhttps://arxiv.org/abs/2004.01970
4. TF-IDF-based Replacement
- Words with low TF-IDF scores are uniformative.
- Replacing those words will not affect the original semantic information.
- ex) This virus has spead worldwide. => A virus has spead worlwide.
- 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848
5. Back Translation
- Translate into another language.
- Translate back to original language.
- ex) (영)This is very coo =>(불)Cest tres cool =>(영) That's very cool
- 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848
6. Word/Sent Mixup
- Use Mixup on word embedding features
- Use Mixup on sentence embedding features
- ex
- 관련 논문 : Augmenting data with mixup for sentence classification: An empirical study. https://arxiv.org/abs/1905.08941
'NLP' 카테고리의 다른 글
[아이펠특강]데이터 증강-유재영 (0) 2022.02.04 [김성범:DMQA]데이터 증식 기법 (0) 2022.01.21 [Attention] 6. Self-Attention (0) 2022.01.18 NLP14 : BERT pretrained model 제작 (0) 2022.01.06 NLP05. 워드 임베딩 (0) 2021.12.15