[김성범]Data Augmentation

NLP 2022. 1. 21. 08:28

[참고영상] https://www.youtube.com/watch?v=7fwxcCJsvPw&t=570s

[14분 02초]

1. Thesaurus-based Substitution

- Replace a random word with its synonym using a Thesaurus.

- ex) It is awesome => amazing, awe-inspiring-awing => It is amazing.

2. Word Embedding-based Substitution

- Use pre-trained word embeddings such as Word2Vec, GloVe, FastText, etc.

- Find the nearest neighbor words and substitute.

- ex) It is awesome => It is amazing.

It is perfect.

It is fantastic.

- 관련 논문 : TinyBERT: Distilling BERT for Natural Language Understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

https://arxiv.org/abs/1909.10351

3. Masked Language Model

- Use pre-trained masked language models such as BERT, ROBERTa, and ALBERT.

- Mask out some words & see what the model predicts.

- ex) Tis is [mask] cool => BERT => This is pretty cool.

This is really cool.

This is super cool.

This is kinda cool.

This is very cool.

- 예제의 관련 자료 : https://amitness.com/2020/05/data-augmentation-for-nlp/

- 관련 논문 : BERT-based Adversarial Examples for Text Classificationhttps://arxiv.org/abs/2004.01970

4. TF-IDF-based Replacement

- Words with low TF-IDF scores are uniformative.

- Replacing those words will not affect the original semantic information.

- ex) This virus has spead worldwide. => A virus has spead worlwide.

- 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848

5. Back Translation

- Translate into another language.

- Translate back to original language.

- ex) (영)This is very coo =>(불)Cest tres cool =>(영) That's very cool

- 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848

6. Word/Sent Mixup

- Use Mixup on word embedding features

- Use Mixup on sentence embedding features

- ex

- 관련 논문 : Augmenting data with mixup for sentence classification: An empirical study. https://arxiv.org/abs/1905.08941

'NLP' 카테고리의 다른 글

[아이펠특강]데이터 증강-유재영 (0)	2022.02.04
[김성범:DMQA]데이터 증식 기법 (0)	2022.01.21
[Attention] 6. Self-Attention (0)	2022.01.18
NLP14 : BERT pretrained model 제작 (0)	2022.01.06
NLP05. 워드 임베딩 (0)	2021.12.15

ABOUT ME

딥러닝뽀개기 딥러닝뽀개기

1. Thesaurus-based Substitution

2. Word Embedding-based Substitution

3. Masked Language Model

4. TF-IDF-based Replacement

5. Back Translation

6. Word/Sent Mixup

'NLP' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. Thesaurus-based Substitution

2. Word Embedding-based Substitution

3. Masked Language Model

4. TF-IDF-based Replacement

5. Back Translation

6. Word/Sent Mixup

'NLP' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바