ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [Paper] ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
    ML engineer/Papers & CS generals 2023. 1. 5. 09:08
    ๋ฐ˜์‘ํ˜•

    ๐Ÿ•“ 3 mins read

    https://arxiv.org/abs/2210.13304

     

    ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation

    We study the text generation task under the approach of pre-trained language models (PLMs). Typically, an auto-regressive (AR) method is adopted for generating texts in a token-by-token manner. Despite many advantages of AR generation, it usually suffers f

    arxiv.org

    ์›”๋“œ์ปต ์‹œ์ฆŒ๊ณผ ์•ฝ๊ฐ„ ๋งž๋ฌผ๋ ค์„œ ์•„๋ถ€๋‹ค๋น„์—์„œ ์ง„ํ–‰๋œ 2022 EMNLP ๋…ผ๋ฌธ ์ค‘์—์„œ ํฅ๋ฏธ๋กœ์šด NLG ๋…ผ๋ฌธ๋“ค์ด ๋ช‡๊ฐœ ์žˆ์–ด์„œ ํ•˜๋‚˜ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

    # ํ•ต์‹ฌ ์ •๋ฆฌ

    Non-autoregressive(์ดํ•˜ NAR) generation ์—ฐ๊ตฌ๋กœ, BART ๋ชจ๋ธ์„ ๊ทธ๋ƒฅ autoregressive ๋””์ฝ”๋”ฉ ํ–ˆ์„๋•Œ ๋ณด๋‹ค ์š”์•ฝ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ์€ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜์˜ค๋ฉด์„œ ์ถ”๋ก  ์†๋„๋Š” 10๋ฐฐ ๊ฐ€๋Ÿ‰ ๋น ๋ฅด๋‹ค๊ณ  ํ•˜๋„ค์š”.
    ์•„๋ฌด๋ž˜๋„ ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋””์ฝ”๋”ฉ ๋ณด๋‹ค๋Š” ํ•œ๋ฒˆ์˜ ์ถ”๋ก ์œผ๋กœ ๋ฌธ์žฅ์„ ํ†ต์งธ๋กœ ์ƒ์„ฑํ•˜๋‹ˆ ์†๋„์•ผ 10๋ฐฐ ๊ฐ€๋Ÿ‰ ๋น ๋ฅธ๊ฒŒ ์žฅ์ ์ด๊ฒ ๋„ค์š”.

    NAR ์ƒ์„ฑ ๋ฐฉ์‹์— ๋Œ€ํ•œ ๊ด€๋ จ ์—ฐ๊ตฌ๋กœ๋Š”, 
    ์ดˆ๊ธฐ์—” ๋‹จ์ˆœํžˆ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ–ˆ๋˜๊ฒƒ์ด๋ผ์„œ single inference ๋Œ€์‹ , N ๋ฒˆ์˜ ์ถ”๋ก ์„ ํ†ตํ•ด, ๋งค ์Šคํ… ์ „์ฒด [mask] ํ† ํฐ ์ค‘์— ํ† ํฐ ๋ช‡๊ฐœ์”ฉ ์ƒ์„ฑ ํ•˜๋ฉด์„œ ์ ์ฐจ confidence๋ฅผ ๋†’์ด๋Š”์‹์œผ๋กœ, ๋งˆ์น˜ ๋ฌธ์žฅ์„ ์กฐ๊ธˆ์”ฉ ๋‹ค๋“ฌ์–ด ๋‚˜๊ฐ€๋Š”๋“ฏํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

    ์ €์ž๋Š” ๊ธฐ์กด NAR ์—ฐ๊ตฌ์—์„œ token dependency๊ฐ€ ๋‚ฎ์€๊ฒƒ์„ ๋ฌธ์ œ๋กœ ๋ณด์•˜์Šต๋‹ˆ๋‹ค.
    ๋ฌด์Šจ ๋ง์ด๋ƒ ํ•˜๋ฉด, ๊ฒฐ๊ตญ ํ•œ๋ฒˆ์— ํ† ํฐ์„ ์ขŒ์—์„œ ์šฐ๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋ฐฉ์‹ ๋Œ€๋น„, NAR์—์„  ํ† ํฐ๊ฐ„์˜ ์—ฐ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์ถ”๋ก ์„ ๋ช‡์ฐจ๋ก€ ํ•˜๋Š”(์ƒ๋Œ€์ ์œผ๋กœ autoregressive๋ณด๋‹ค๋Š” ์ ๊ฒŒ) ๋ฐฉ์‹์—์„œ๋Š” alignment ๋ฌธ์ œ๋กœ ํ’€์–ด๋‚ผ ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ๋ฌธ์žฅ์„ ๋ชจ๋ธ ์ถ”๋ก  ํ•œ๋ฒˆ์— ์ƒ์„ฑํ•˜๊ฒŒ ๋˜๋ฉด ์‰ฝ์ง€ ์•Š๊ฒ ์ฃ .

    ## ํฌ์ธํŠธ1.

    ๋”ฐ๋ผ์„œ ์ €์ž์˜ ์•„์ด๋””์–ด๋Š” (ํ˜„์žฌ ๋Œ€์„ธ์ธ transformer ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ ๊ตฌ์กฐ์˜ ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ „์ œ) output ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์นœ ๋‹ค์Œ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์—์„œ ํ† ํฐ์„ ๊ฒฐ์ •ํ•˜๋Š” ๋Œ€์‹ , ๊ฐ ๋ ˆ์ด์–ด์—์„œ ๋จผ์ € ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋ฉด, ํ•ด๋‹น position์˜ hiddenstate๋Š” ๋‹ค์Œ ๋ ˆ์ด์–ด์—์„œ ์—ฐ์‚ฐ์„ ํ•˜์ง€ ์•Š๊ณ  copyํ•ด์„œ ๋‚ด๋ ค์ฃผ๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋™์ž‘ ํ…Œํฌ๋‹‰์„ early exit์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์„ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ค‘๊ฐ„์— ์˜ˆ์ธก ํ•˜๊ฒ ๋‹ค๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค.
    ์ด๋Ÿฐ ์ƒ์„ฑ ๊ตฌ์กฐํ•˜์—๋Š” ๋จผ์ € ์ƒ์„ฑ๋œ ํ† ํฐ์ด ์•ž๋’ค์˜ ํ† ํฐ ์˜ˆ์ธก์— dependency๋ฅผ ์ฃผ๊ฒŒ ๋˜๋Š”๊ฒƒ์ด๊ณ ์š”.

    ## ํฌ์ธํŠธ2.

    ์—ฌ๊ธฐ์— ํ† ํฐ์ด ๋ช‡ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ early exit ๋˜๋Š๋ƒ์— ๋”ฐ๋ผ์„œ dependency๋ฅผ ์ฃผ๋ณ€์— ์–ผ๋งˆ๋งŒํผ ์ค„ ์ˆ˜ ์žˆ๋Š”์ง€๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ €์ž๋Š” LPLM์ด๋ผ๋Š” ์ƒˆ๋กœ์šด LM pretraining objective๋„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์˜ exit layer permutation์„ ํ†ตํ•ด ์ด๋Ÿฐ dependency๋ฅผ ๋ชจ๋ธ์ด ๋” ๋„“๊ฒŒ ๋ณด๋„๋ก ํ•œ๋‹ค๋Š” ์ „๋žต์ž…๋‹ˆ๋‹ค.

    ์œ„์˜ ๋‹ค์ด์–ด๊ทธ๋žจ์€ ์ „์ฒด ๋ชจ๋ธ ๊ตฌ์กฐ์ธ๋ฐ, ๋””์ฝ”๋” ์ž…๋ ฅ์œผ๋กœ๋Š” [MASK] ์˜ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ ๋ฐ›๊ณ , ๊ฐ ๋ ˆ์ด์–ด์—์„œ early exit์œผ๋กœ softmax๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Layer level exit ์— ์‚ฌ์šฉ๋˜๋Š” softmax ๋ ˆ์ด์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” $W_{c}$ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ ˆ์ด์–ด๋งˆ๋‹ค ๋‘˜ ์ˆ˜ ๋„ ์žˆ๊ณ , ๊ณตํ†ต ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

    ๋ ˆ์ด์–ด permutation์„ ํ†ตํ•ด LPLM(Layer Permutation Language Modeling) ํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•œ๊ฒƒ๋„ ๋‹ค์ด์–ด๊ทธ๋žจ์„ ํ†ตํ•ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค์š”.

    # ์š”์•ฝ

    ์žฅ์ 

    1. ์ถ”๋ก  ์†๋„ ๊ฐœ์„  : ํ•œ ๋ฒˆ์˜ ๋ชจ๋ธ ์ถ”๋ก ์œผ๋กœ ๋ฌธ์žฅ ์ƒ์„ฑ
    2. ๋ณ‘๋Ÿดํ™” ๊ฐ€๋Šฅ์„ฑ : Autoregressive ํ•˜๊ธฐ ๋•Œ๋ฌธ์— batch๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
    3. ๊ฐ„๋‹จํ•œ ๊ตฌํ˜„ : ๊ธฐ์กด์˜ language model ๊ตฌํ˜„์— layer early exit ๋งŒ ๊ตฌํ˜„์„ ์ถ”๊ฐ€ ํ•˜๋ฉด BART์™ธ์— ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ๋„ ํ™•์žฅ์ด ์šฉ์ดํ•˜๋‹ค.
    4. Length prediction ๋ถˆํ•„์š” : ๊ธฐ์กด์˜ NAR ๊ตฌํ˜„๋“ค์€ ๋ณ„๋„๋กœ length prediction์„ ์š”๊ตฌํ–ˆ์œผ๋‚˜, ELMER๋Š” [EOS] ํ† ํฐ์œผ๋กœ ์ž„์˜์˜ ๊ธธ์ด๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. (๋ชจ๋ธ decoder ์‚ฌ์ด์ฆˆ์— ์ œ์•ฝ์ด ๋”ฐ๋ฅด๊ฒ ์ง€๋งŒ)

    ๋‹จ์ 

    1. ์—ฌ์ „ํžˆ ์žฅ๋ฌธ์„ ์ƒ์„ฑ์—๋Š” ๋ถˆ๋ฆฌ
    2. Fintune ํ• ๋•Œ early exit ์ „๋žต์„ ์–ด๋–ป๊ฒŒ ์ทจํ• ์ง€ ๋”ฐ๋กœ ์„ค๊ณ„ ํ•ด์•ผํ•˜๋Š” ๋ฌธ์ œ:
     - Pretrain ํ• ๋•Œ๋Š” layer permutation์œผ๋กœ ์ผ๋ฐ˜ํ™” ํ•˜์ง€๋งŒ, downstream task ํ•™์Šต์‹œ์—๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

     

    # ๊ฐœ์ธ์ ์ธ ๋Š๋‚Œ

    Evaluation์„ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ ์—†์ด ๋‹จ์ผ measure score๋กœ๋งŒ ๋ณด์ธ๊ฒƒ์ด ์ข€ ๊ฑธ๋ฆฌ๋„ค์š”. ๋ณดํ†ต ์ด๋Ÿฐ ์ƒ์„ฑ ๋…ผ๋ฌธ์—์„œ ์ •๋ง ์„ฑ๋Šฅ์ด ๊ต‰์žฅํ•˜๋‹ค๊ณ  ํŒ๋‹จ๋˜๋ฉด cherry-pick์ด๋“  ๋ญ๋“  ์‹ค์ œ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ๋“ค์„ ๋น„๊ตํ•˜๋Š” ์‹์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ์—ฌ๊ธฐ์„  ๊ทธ๋ƒฅ BLEU๋‚˜ ROUGE ์Šค์ฝ”์–ด๋กœ๋งŒ ํ‰๊ฐ€๋ฅผ ํ•˜๋„ค์š”.
    ์•„๋ฌด๋ž˜๋„ ๋ฌธ๋ฒ•์ ์œผ๋กœ ์ด์ƒํ•œ ๋ฌธ์žฅ์ด ๋‚˜์˜ค์ง„ ์•Š์„๊นŒ..ํ•˜๋Š” ์šฐ๋ ค๊ฐ€ ์žˆ๋„ค์š”. BLEU, ROUGE ๋ชจ๋‘ ์–ด๋ฒ•๊ณผ ๊ด€๊ณ„์—†์ด n-gram์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ˆ๊นŒ์š”.
    ๊ทธ๋ž˜๋„ pretrain ์ž์ฒด๊ฐ€ LM ํ•™์Šต์ด๋‹ˆ.. ์™„์ „ ์ด์ƒํ•œ ๋ฌธ์žฅ์ด ๋‚˜์˜ค์ง„ ์•Š์œผ๋ ค๋‚˜..
    ๊ทธ๋ž˜๋„ ๋‹จ์  ๋ณด๋‹ค ์–ป๋Š” ์žฅ์ ์ด ํ›จ์”ฌ ๋งŽ์€ ์—ฐ๊ตฌ๋ผ์„œ ๋ฌธ์žฅ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฅผ ํ•œ๋ฒˆ ์ง์ ‘ ํ™•์ธํ•ด ๋ณด๊ณ  ์‹ถ๋„ค์š”.

     

    ์ „๋ฐ˜์ ์ธ ๋ฐฉ์‹์€ ์กฐ๊ธˆ ๋‹ค๋ฅด์ง€๋งŒ, ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” early exit์ด๋ผ๋Š” ์ ์—์„œ ์•„๋ž˜ ์—ฐ๊ตฌ์™€๋„ ์œ ์‚ฌํ•œ๋ฐ, ๋น„์Šทํ•œ ์‹œ๊ธฐ์— ๋น„์Šทํ•œ ๊ฒฐ์˜ ์—ฐ๊ตฌ๊ฐ€ ์„ธ๊ณ„ ๊ณณ๊ณณ์—์„œ ์ง„ํ–‰๋˜๋Š”๊ฑธ ๋ณด๋ฉด ์ฐธ ์—ฐ๊ตฌ๋ผ๋Š”๊ฒŒ, ์‚ฌ๋žŒ ์ƒ๊ฐ์ด๋ผ๋Š”๊ฒŒ ๋น„์Šทํ•œ๊ฐ€ ์‹ถ๊ณ  ๊ทธ๋ ‡๋„ค์š”.

    Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

    https://ai.googleblog.com/2022/12/accelerating-text-generation-with.html?m=1 

     

    Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

    Posted by Tal Schuster, Research Scientist, Google Research Language models (LMs) are the driving force behind many recent breakthroughs in natural language processing. Models like T5, LaMDA, GPT-3, and PaLM have demonstrated impressive performance on vari

    ai.googleblog.com

    ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜๋ฉด, autoregressive ๋””์ฝ”๋”ฉ์„ ํ•˜์ง€๋งŒ, ๊ฐ ํ† ํฐ ๋งˆ๋‹ค ๋ ˆ์ด์–ด๋ฅผ ์ „๋ถ€ ์—ฐ์‚ฐํ•˜์ง„ ์•Š๊ณ  ์ ๋‹นํžˆ ์ฒ˜๋ฆฌํ•˜๋‹ค๊ฐ€ early exitํ•ด์„œ ๋””์ฝ”๋”ฉ ์†๋„๋ฅผ boosting ํ•˜๊ฒ ๋‹ค๋Š” ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

    ๋ฐ˜์‘ํ˜•

    ๋Œ“๊ธ€

Designed by naubull2.