ML engineer/Papers & CS generals

[Paper] ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation

naubull2 2023. 1. 5. 09:08
๋ฐ˜์‘ํ˜•

๐Ÿ•“ 3 mins read

https://arxiv.org/abs/2210.13304

 

ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation

We study the text generation task under the approach of pre-trained language models (PLMs). Typically, an auto-regressive (AR) method is adopted for generating texts in a token-by-token manner. Despite many advantages of AR generation, it usually suffers f

arxiv.org

์›”๋“œ์ปต ์‹œ์ฆŒ๊ณผ ์•ฝ๊ฐ„ ๋งž๋ฌผ๋ ค์„œ ์•„๋ถ€๋‹ค๋น„์—์„œ ์ง„ํ–‰๋œ 2022 EMNLP ๋…ผ๋ฌธ ์ค‘์—์„œ ํฅ๋ฏธ๋กœ์šด NLG ๋…ผ๋ฌธ๋“ค์ด ๋ช‡๊ฐœ ์žˆ์–ด์„œ ํ•˜๋‚˜ ๊ฐ€์ ธ์™”์Šต๋‹ˆ๋‹ค.

# ํ•ต์‹ฌ ์ •๋ฆฌ

Non-autoregressive(์ดํ•˜ NAR) generation ์—ฐ๊ตฌ๋กœ, BART ๋ชจ๋ธ์„ ๊ทธ๋ƒฅ autoregressive ๋””์ฝ”๋”ฉ ํ–ˆ์„๋•Œ ๋ณด๋‹ค ์š”์•ฝ ํƒœ์Šคํฌ์—์„œ ์„ฑ๋Šฅ์€ ์œ ์‚ฌํ•˜๊ฒŒ ๋‚˜์˜ค๋ฉด์„œ ์ถ”๋ก  ์†๋„๋Š” 10๋ฐฐ ๊ฐ€๋Ÿ‰ ๋น ๋ฅด๋‹ค๊ณ  ํ•˜๋„ค์š”.
์•„๋ฌด๋ž˜๋„ ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋””์ฝ”๋”ฉ ๋ณด๋‹ค๋Š” ํ•œ๋ฒˆ์˜ ์ถ”๋ก ์œผ๋กœ ๋ฌธ์žฅ์„ ํ†ต์งธ๋กœ ์ƒ์„ฑํ•˜๋‹ˆ ์†๋„์•ผ 10๋ฐฐ ๊ฐ€๋Ÿ‰ ๋น ๋ฅธ๊ฒŒ ์žฅ์ ์ด๊ฒ ๋„ค์š”.

NAR ์ƒ์„ฑ ๋ฐฉ์‹์— ๋Œ€ํ•œ ๊ด€๋ จ ์—ฐ๊ตฌ๋กœ๋Š”, 
์ดˆ๊ธฐ์—” ๋‹จ์ˆœํžˆ ์ถ”๋ก  ์†๋„๋ฅผ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ–ˆ๋˜๊ฒƒ์ด๋ผ์„œ single inference ๋Œ€์‹ , N ๋ฒˆ์˜ ์ถ”๋ก ์„ ํ†ตํ•ด, ๋งค ์Šคํ… ์ „์ฒด [mask] ํ† ํฐ ์ค‘์— ํ† ํฐ ๋ช‡๊ฐœ์”ฉ ์ƒ์„ฑ ํ•˜๋ฉด์„œ ์ ์ฐจ confidence๋ฅผ ๋†’์ด๋Š”์‹์œผ๋กœ, ๋งˆ์น˜ ๋ฌธ์žฅ์„ ์กฐ๊ธˆ์”ฉ ๋‹ค๋“ฌ์–ด ๋‚˜๊ฐ€๋Š”๋“ฏํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์ €์ž๋Š” ๊ธฐ์กด NAR ์—ฐ๊ตฌ์—์„œ token dependency๊ฐ€ ๋‚ฎ์€๊ฒƒ์„ ๋ฌธ์ œ๋กœ ๋ณด์•˜์Šต๋‹ˆ๋‹ค.
๋ฌด์Šจ ๋ง์ด๋ƒ ํ•˜๋ฉด, ๊ฒฐ๊ตญ ํ•œ๋ฒˆ์— ํ† ํฐ์„ ์ขŒ์—์„œ ์šฐ๋กœ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” autoregressive ๋ฐฉ์‹ ๋Œ€๋น„, NAR์—์„  ํ† ํฐ๊ฐ„์˜ ์—ฐ๊ด€ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋ง ํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌผ๋ก  ์ถ”๋ก ์„ ๋ช‡์ฐจ๋ก€ ํ•˜๋Š”(์ƒ๋Œ€์ ์œผ๋กœ autoregressive๋ณด๋‹ค๋Š” ์ ๊ฒŒ) ๋ฐฉ์‹์—์„œ๋Š” alignment ๋ฌธ์ œ๋กœ ํ’€์–ด๋‚ผ ์ˆ˜๋„ ์žˆ๊ฒ ์ง€๋งŒ, ๋ฌธ์žฅ์„ ๋ชจ๋ธ ์ถ”๋ก  ํ•œ๋ฒˆ์— ์ƒ์„ฑํ•˜๊ฒŒ ๋˜๋ฉด ์‰ฝ์ง€ ์•Š๊ฒ ์ฃ .

## ํฌ์ธํŠธ1.

๋”ฐ๋ผ์„œ ์ €์ž์˜ ์•„์ด๋””์–ด๋Š” (ํ˜„์žฌ ๋Œ€์„ธ์ธ transformer ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ ๊ตฌ์กฐ์˜ ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ „์ œ) output ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ชจ๋“  ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์นœ ๋‹ค์Œ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์—์„œ ํ† ํฐ์„ ๊ฒฐ์ •ํ•˜๋Š” ๋Œ€์‹ , ๊ฐ ๋ ˆ์ด์–ด์—์„œ ๋จผ์ € ํ† ํฐ์„ ์ƒ์„ฑํ•˜๋ฉด, ํ•ด๋‹น position์˜ hiddenstate๋Š” ๋‹ค์Œ ๋ ˆ์ด์–ด์—์„œ ์—ฐ์‚ฐ์„ ํ•˜์ง€ ์•Š๊ณ  copyํ•ด์„œ ๋‚ด๋ ค์ฃผ๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋™์ž‘ ํ…Œํฌ๋‹‰์„ early exit์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์„ ๋””์ฝ”๋” ๋ ˆ์ด์–ด์˜ ์ค‘๊ฐ„ ์ค‘๊ฐ„์— ์˜ˆ์ธก ํ•˜๊ฒ ๋‹ค๋Š”๊ฒƒ์ž…๋‹ˆ๋‹ค.
์ด๋Ÿฐ ์ƒ์„ฑ ๊ตฌ์กฐํ•˜์—๋Š” ๋จผ์ € ์ƒ์„ฑ๋œ ํ† ํฐ์ด ์•ž๋’ค์˜ ํ† ํฐ ์˜ˆ์ธก์— dependency๋ฅผ ์ฃผ๊ฒŒ ๋˜๋Š”๊ฒƒ์ด๊ณ ์š”.

## ํฌ์ธํŠธ2.

์—ฌ๊ธฐ์— ํ† ํฐ์ด ๋ช‡ ๋ฒˆ์งธ ๋ ˆ์ด์–ด์—์„œ early exit ๋˜๋А๋ƒ์— ๋”ฐ๋ผ์„œ dependency๋ฅผ ์ฃผ๋ณ€์— ์–ผ๋งˆ๋งŒํผ ์ค„ ์ˆ˜ ์žˆ๋Š”์ง€๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ €์ž๋Š” LPLM์ด๋ผ๋Š” ์ƒˆ๋กœ์šด LM pretraining objective๋„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์˜ exit layer permutation์„ ํ†ตํ•ด ์ด๋Ÿฐ dependency๋ฅผ ๋ชจ๋ธ์ด ๋” ๋„“๊ฒŒ ๋ณด๋„๋ก ํ•œ๋‹ค๋Š” ์ „๋žต์ž…๋‹ˆ๋‹ค.

์œ„์˜ ๋‹ค์ด์–ด๊ทธ๋žจ์€ ์ „์ฒด ๋ชจ๋ธ ๊ตฌ์กฐ์ธ๋ฐ, ๋””์ฝ”๋” ์ž…๋ ฅ์œผ๋กœ๋Š” [MASK] ์˜ ์‹œํ€€์Šค๋ฅผ ์ž…๋ ฅ ๋ฐ›๊ณ , ๊ฐ ๋ ˆ์ด์–ด์—์„œ early exit์œผ๋กœ softmax๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. Layer level exit ์— ์‚ฌ์šฉ๋˜๋Š” softmax ๋ ˆ์ด์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” $W_{c}$ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ ˆ์ด์–ด๋งˆ๋‹ค ๋‘˜ ์ˆ˜ ๋„ ์žˆ๊ณ , ๊ณตํ†ต ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ๋„ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ ˆ์ด์–ด permutation์„ ํ†ตํ•ด LPLM(Layer Permutation Language Modeling) ํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•œ๊ฒƒ๋„ ๋‹ค์ด์–ด๊ทธ๋žจ์„ ํ†ตํ•ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๊ฒ ๋„ค์š”.

# ์š”์•ฝ

์žฅ์ 

1. ์ถ”๋ก  ์†๋„ ๊ฐœ์„  : ํ•œ ๋ฒˆ์˜ ๋ชจ๋ธ ์ถ”๋ก ์œผ๋กœ ๋ฌธ์žฅ ์ƒ์„ฑ
2. ๋ณ‘๋Ÿดํ™” ๊ฐ€๋Šฅ์„ฑ : Autoregressive ํ•˜๊ธฐ ๋•Œ๋ฌธ์— batch๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅ
3. ๊ฐ„๋‹จํ•œ ๊ตฌํ˜„ : ๊ธฐ์กด์˜ language model ๊ตฌํ˜„์— layer early exit ๋งŒ ๊ตฌํ˜„์„ ์ถ”๊ฐ€ ํ•˜๋ฉด BART์™ธ์— ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ๋„ ํ™•์žฅ์ด ์šฉ์ดํ•˜๋‹ค.
4. Length prediction ๋ถˆํ•„์š” : ๊ธฐ์กด์˜ NAR ๊ตฌํ˜„๋“ค์€ ๋ณ„๋„๋กœ length prediction์„ ์š”๊ตฌํ–ˆ์œผ๋‚˜, ELMER๋Š” [EOS] ํ† ํฐ์œผ๋กœ ์ž„์˜์˜ ๊ธธ์ด๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค. (๋ชจ๋ธ decoder ์‚ฌ์ด์ฆˆ์— ์ œ์•ฝ์ด ๋”ฐ๋ฅด๊ฒ ์ง€๋งŒ)

๋‹จ์ 

1. ์—ฌ์ „ํžˆ ์žฅ๋ฌธ์„ ์ƒ์„ฑ์—๋Š” ๋ถˆ๋ฆฌ
2. Fintune ํ• ๋•Œ early exit ์ „๋žต์„ ์–ด๋–ป๊ฒŒ ์ทจํ• ์ง€ ๋”ฐ๋กœ ์„ค๊ณ„ ํ•ด์•ผํ•˜๋Š” ๋ฌธ์ œ:
 - Pretrain ํ• ๋•Œ๋Š” layer permutation์œผ๋กœ ์ผ๋ฐ˜ํ™” ํ•˜์ง€๋งŒ, downstream task ํ•™์Šต์‹œ์—๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 

# ๊ฐœ์ธ์ ์ธ ๋А๋‚Œ

Evaluation์„ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ ์—†์ด ๋‹จ์ผ measure score๋กœ๋งŒ ๋ณด์ธ๊ฒƒ์ด ์ข€ ๊ฑธ๋ฆฌ๋„ค์š”. ๋ณดํ†ต ์ด๋Ÿฐ ์ƒ์„ฑ ๋…ผ๋ฌธ์—์„œ ์ •๋ง ์„ฑ๋Šฅ์ด ๊ต‰์žฅํ•˜๋‹ค๊ณ  ํŒ๋‹จ๋˜๋ฉด cherry-pick์ด๋“  ๋ญ๋“  ์‹ค์ œ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ๋“ค์„ ๋น„๊ตํ•˜๋Š” ์‹์œผ๋กœ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, ์—ฌ๊ธฐ์„  ๊ทธ๋ƒฅ BLEU๋‚˜ ROUGE ์Šค์ฝ”์–ด๋กœ๋งŒ ํ‰๊ฐ€๋ฅผ ํ•˜๋„ค์š”.
์•„๋ฌด๋ž˜๋„ ๋ฌธ๋ฒ•์ ์œผ๋กœ ์ด์ƒํ•œ ๋ฌธ์žฅ์ด ๋‚˜์˜ค์ง„ ์•Š์„๊นŒ..ํ•˜๋Š” ์šฐ๋ ค๊ฐ€ ์žˆ๋„ค์š”. BLEU, ROUGE ๋ชจ๋‘ ์–ด๋ฒ•๊ณผ ๊ด€๊ณ„์—†์ด n-gram์œผ๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ˆ๊นŒ์š”.
๊ทธ๋ž˜๋„ pretrain ์ž์ฒด๊ฐ€ LM ํ•™์Šต์ด๋‹ˆ.. ์™„์ „ ์ด์ƒํ•œ ๋ฌธ์žฅ์ด ๋‚˜์˜ค์ง„ ์•Š์œผ๋ ค๋‚˜..
๊ทธ๋ž˜๋„ ๋‹จ์  ๋ณด๋‹ค ์–ป๋Š” ์žฅ์ ์ด ํ›จ์”ฌ ๋งŽ์€ ์—ฐ๊ตฌ๋ผ์„œ ๋ฌธ์žฅ ์ƒ์„ฑ ๊ฒฐ๊ณผ๋ฅผ ํ•œ๋ฒˆ ์ง์ ‘ ํ™•์ธํ•ด ๋ณด๊ณ  ์‹ถ๋„ค์š”.

 

์ „๋ฐ˜์ ์ธ ๋ฐฉ์‹์€ ์กฐ๊ธˆ ๋‹ค๋ฅด์ง€๋งŒ, ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” early exit์ด๋ผ๋Š” ์ ์—์„œ ์•„๋ž˜ ์—ฐ๊ตฌ์™€๋„ ์œ ์‚ฌํ•œ๋ฐ, ๋น„์Šทํ•œ ์‹œ๊ธฐ์— ๋น„์Šทํ•œ ๊ฒฐ์˜ ์—ฐ๊ตฌ๊ฐ€ ์„ธ๊ณ„ ๊ณณ๊ณณ์—์„œ ์ง„ํ–‰๋˜๋Š”๊ฑธ ๋ณด๋ฉด ์ฐธ ์—ฐ๊ตฌ๋ผ๋Š”๊ฒŒ, ์‚ฌ๋žŒ ์ƒ๊ฐ์ด๋ผ๋Š”๊ฒŒ ๋น„์Šทํ•œ๊ฐ€ ์‹ถ๊ณ  ๊ทธ๋ ‡๋„ค์š”.

Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

https://ai.googleblog.com/2022/12/accelerating-text-generation-with.html?m=1 

 

Accelerating Text Generation with Confident Adaptive Language Modeling (CALM)

Posted by Tal Schuster, Research Scientist, Google Research Language models (LMs) are the driving force behind many recent breakthroughs in natural language processing. Models like T5, LaMDA, GPT-3, and PaLM have demonstrated impressive performance on vari

ai.googleblog.com

๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜๋ฉด, autoregressive ๋””์ฝ”๋”ฉ์„ ํ•˜์ง€๋งŒ, ๊ฐ ํ† ํฐ ๋งˆ๋‹ค ๋ ˆ์ด์–ด๋ฅผ ์ „๋ถ€ ์—ฐ์‚ฐํ•˜์ง„ ์•Š๊ณ  ์ ๋‹นํžˆ ์ฒ˜๋ฆฌํ•˜๋‹ค๊ฐ€ early exitํ•ด์„œ ๋””์ฝ”๋”ฉ ์†๋„๋ฅผ boosting ํ•˜๊ฒ ๋‹ค๋Š” ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค.

๋ฐ˜์‘ํ˜•