Fuzhao Xue

National University of Singapore

Sep 27, 2023

This blog is a detailed discussion of Appendix F of our Token-Crisis Paper (NeurIPS 2023):

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [arxiv]

Background

First, I have to clarify that the open-sourced Encoder-Decoder (En-De) language models (e.g. T5) are different from the open Decoder-only (De-only) models (e.g. LLaMA) from many perspectives:

(Encoder_Decoder==Decoder_only) != (T5==LLaMA)

To be more specific:

class T5:
    def __init__(self):
        self.architecture = Encoder_Decoder()
				self.architecture.encoder_attention_type = "bidirectional"
        self.architecture.decoder_attention_type = "unidirectional"
        self.position_embedding = "Relative_Positional_Embedding"
        self.pretraining_data = "C4"
				self.pretraining_objective = "span_corruption"
				# And also many minor modifications

class LLaMA:
    def __init__(self):
				self.architecture = DecoderOnly()
        self.decoder_attention_type = "unidirectional"
        self.position_embedding = "RoPE"
        self.pretraining_data = "mixture_of_many_different_datasets"
				self.pretraining_objective = "span_corruption"
				# And also many minor modifications

So, the performance difference between open En-De and De-only is caused by the sum of all these factors. We focus on the the transformer architecture (En-De vs De-only) in this blog.

How to fairly compare encoder-decoder and decoder-only?

UL2 paper suggests that comparing decoder-only models with encoder-decoder models is not straightforward (See UL2 Section 4.3.1). This is because when the encoder-decoder (En-De) model is computation-matched with a decoder-only architecture, En-De always has around 2 times trainable parameters. To put it simply but imprecise, En-De uses different trainable parameters for input tokens and output tokens, while the decoder-only model uses the same trainable parameters for both. Therefore, by default, we can only compare them under two conditions: (1) when they have comparable computational complexity (FLOPs), but En-De has 2 times the parameters, or (2) when they have comparable number of parameters, but the decoder-only model has 2 times the computational complexity (FLOPs).

To address this issue, we utilized Mixture-of-Experts (MoE) to ensure that the De-only model matches the En-De model in terms of both FLOPs and the number of parameters. (Note: MoE may increase the number of parameters but maintains comparable FLOPs as before. For more information, refer to Switch Transformer.)

Results