Fuzhao Xue

National University of Singapore

May 19, 2023

Other versions:

Paper: A Study on Transformer Configuration and Training Objective [arxiv]

TL;DR for Practitioners

The token-level training objective is more scalable in terms of depth, while the sequence-level training objective performs worse on deeper transformers due to the over-smoothing issue.

In addition, not to underestimate a training objective before tuning model configuration. Usually, for a fair comparison, we simply adopt the previous configurations. However, sometimes, one training objective may look decent if it wins the “configuration lottery” (The previous used configuration matches well with the new training objective.). However, for a different objective, the effectiveness would be underestimated without a configuration sweep. We may then miss a good training objective for the community. Therefore, to know about the potential of each novel training objective design, we strongly suggest practitioners analyze the inductive bias and customize configurations.

What is training objective?

Vanilla Classification

Vanilla Classification

Masked Autoencoder

Masked Autoencoder

Next Token Prediction

Next Token Prediction

Training objective refers to the goal of the model during the training process. There are several different training objectives that can be used. We list three training objectives here, vanilla classification, masked autoencoder, and next token prediction. Vanilla classification is a type of supervised learning where the model is trained to correctly classify inputs into different categories. Masked autoencoder is an unsupervised learning technique in which the model is trained to reconstruct a partially masked input. Next token prediction is also an unsupervised learning technique where the model is trained to predict the next token in a sequence.

What is Transformer Configuration?

Untitled

Untitled

Transformer configuration refers to the hyperparameters that define the structure and size of the transformer model, such as the number of layers, the dimension of the hidden states, and the number of attention heads.

Where are the configurations from?