ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [김성범]Data Augmentation
    NLP 2022. 1. 21. 08:28

     

     

    [참고영상] https://www.youtube.com/watch?v=7fwxcCJsvPw&t=570s

     

    [14분 02초]

    1. Thesaurus-based Substitution

    - Replace a random word with its synonym using a Thesaurus.

    - ex) It is awesome => amazing, awe-inspiring-awing => It is amazing.

     

     

    2. Word Embedding-based Substitution

    - Use pre-trained word embeddings such as Word2Vec, GloVe, FastText, etc.

    - Find the nearest neighbor words and substitute.

    - ex) It is awesome  => It is amazing.

                                   It is perfect.

                                   It is fantastic.

    - 관련 논문 : TinyBERT: Distilling BERT for Natural Language Understanding
    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu

    https://arxiv.org/abs/1909.10351

    - 관련 자료 : https://amitness.com/2020/05/data-augmentation-for-nlp/

     

     

    3. Masked Language Model

    - Use pre-trained masked language models such as BERT, ROBERTa, and ALBERT.

    - Mask out some words & see what the model predicts.

    - ex) Tis is [mask] cool => BERT =>  This is pretty cool.

                                                    This is really cool.

                                                    This is super cool.

                                                    This is kinda cool.

                                                    This is very cool.

    - 예제의 관련 자료 : https://amitness.com/2020/05/data-augmentation-for-nlp/

    - 관련 논문 : BERT-based Adversarial Examples for Text Classificationhttps://arxiv.org/abs/2004.01970

     

    4. TF-IDF-based Replacement

    - Words with low TF-IDF scores are uniformative.

    - Replacing those words will not affect the original semantic information.

    - ex) This virus has spead worldwide. => A virus has spead worlwide.

    - 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848

     

     

    5. Back Translation

    - Translate into another language.

    - Translate back to original language.

    - ex) (영)This is very coo =>(불)Cest tres cool =>(영) That's very cool

    - 관련 논문 : Unsupervised data augmentation for consistency traininghttps://arxiv.org/abs/1904.12848

     

     

    6. Word/Sent Mixup

    - Use Mixup on word embedding features

    - Use Mixup on sentence embedding features

    - ex

     

     

    - 관련 논문 : Augmenting data with mixup for sentence classification: An empirical study. https://arxiv.org/abs/1905.08941

     

     

     

    'NLP' 카테고리의 다른 글

    [아이펠특강]데이터 증강-유재영  (0) 2022.02.04
    [김성범:DMQA]데이터 증식 기법  (0) 2022.01.21
    [Attention] 6. Self-Attention  (0) 2022.01.18
    NLP14 : BERT pretrained model 제작  (0) 2022.01.06
    NLP05. 워드 임베딩  (0) 2021.12.15
Designed by Tistory.