Mar 2024 | Take a Closer Look at the MoE LLM Routing

Fuzhao Xue

National University of Singapore

Email: f.xue at u.nus.edu

Twitter: @XueFz

Joint work with: Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You

Mar 27, 2024

GitHub:

GitHub - XueFuzhao/OpenMoE: A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

Introduction

Mixture of Experts (MoE)-based LLM is known for its cost-effectiveness compared to dense LLMs. However, there's a knowledge gap in understanding how MoE LLM functions. At a high level, MoE is typically considered an effective method to increase parameters within a fixed computation budget. The key difference between dense and MoE LLMs is that, MoE uses a router to selectively and adaptively assign each token to a few experts (i.e., Feed-Forward Neural Network layers). But, how this routing works is not well-understood. In this blog, we delve deeper into MoE LLM routing by examining our OpenMoE, Mixtral 7*8B and deepseek checkpoints, aim to answer the following questions:

What are the routing based on? General domain? High-level Semantics? Or just Token ID?
- Our Answer: Almost just Token ID. We found the tokens with the same Token ID would always be assigned to a few fixed experts no matter what the context is. We call it “Context-independent Specialization”.
When does the routing be learned? Is it learning and changing throughout the training? Or just be fixed after be learned?
- Our Answer: Very Early! Maybe it has been fixed after warm-up. We found the routing decisions are almost fixed after 200B tokens. (The time point may be even much earlier but we do not have very early checkpoints.) We call it “Early Routing Learning”.
Would this routing pattern be problematic? Are there any existing or potential problem? If yes, any possible solution?
- Our Answer: Yes, it is problematic. It would make the tokens appearing later in the sequence be dropped more frequently. We call it “Drop-towards-the-End”.
Are there similar routing issues in other MoE LLMs?
- Our Answer: Yes, we observed Context-independent Specialization in Deepseek-MoE but not in Mixtral. We observed “Drop-towards-the-End” in both of Mixtral and DeepSeek-MoE when setting expert capacity manually.

We introduce these findings one by one in the following sections.

Context-independent Specialization

We use a subset of RedPajama (different general domains), a subset of The Stack (different coding languages), a subset of Multi-lingual TED corpus(different natural languages), and MT-Bench (different real-world tasks) dataset to investigate the MoE LLM routing behaviour.

We start from checking the specialization on different general domains by RedPajama: