-
[김성범:DMQA]데이터 증식 기법NLP 2022. 1. 21. 11:33
1. Lexical Substitution
(1) Thesaurus-based substitution
(2) Word-Embeddings Substitution
(3) Masked Language Model
(4) TF-IDF based word replacement
2. Back Translation
3. Text Surface Transformation
4. Random Noise Injection
(1) Spelling error injection
(2) QWERTY Keyboard Error Injection
(3) Unigram Noising
(4) Blank Noising
(5) Sentence Shuffling
(6) Random Insertion
(7) Random Swap
(8) Random Deletion
5. Instance Crossover Augmentation
6. Syntax-tree Manipuation
7. MixUp for Text
(1) wordMixup
(2) sentMixup
8. Generative Methods
(1) Conditional Pre-trained Language Models
9. Imprementation
(1) nlpaug
(2) textattack
[참고] https://amitness.com/2020/05/data-augmentation-for-nlp/
===================================
1. Lexical Substitution
(1) Thesaurus-based substitution
[논문] Character-level convolutional networks for text classification. Advances in seural information processing systems(Zhang X, Zhao J, & LeCun, 2015)
(2) Word-Embeddings Substitution
[논문] Tinybert : Distilling bert for natural language understanding(Jiao X, Yin Y, Shang L 2019)
(3) Masked Language Model
[논문] BAE:BERT-based Adversarial Examples for Text Classification. (Garg S, Ramakrishnan G, 2020)
(4) TF-IDF based word replacement
[논문] Unsupervised data augmentation for consistency training(Xia Q, Dai Z, 2019)
2. Back Translation
[논문] Unsupervised data augmentation for consistency training(Xia Q, Dai Z, 2019)
3. Text Surface Transformation
4. Random Noise Injection
(1) Spelling error injection
(2) QWERTY Keyboard Error Injection
(3) Unigram Noising
(4) Blank Noising
(5) Sentence Shuffling
(6) Random Insertion
(7) Random Swap
(8) Random Deletion
5. Instance Crossover Augmentation
6. Syntax-tree Manipuation
7. MixUp for Text
(1) wordMixup
[논문] Augmenting data with mixup for sentence classification : An emprical Study(Guo M, Mao Y, 2019)
(2) sentMixup
8. Generative Methods
(1) Conditional Pre-trained Language Models
9. Imprementation
(1) nlpaug
(2) textattack
'NLP' 카테고리의 다른 글
JSON (0) 2022.02.06 [아이펠특강]데이터 증강-유재영 (0) 2022.02.04 [김성범]Data Augmentation (0) 2022.01.21 [Attention] 6. Self-Attention (0) 2022.01.18 NLP14 : BERT pretrained model 제작 (0) 2022.01.06