Fuzhao Xue

National University of Singapore

Email: f.xue at u.nus.edu

Twitter: @XueFz

Joint work with: Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You

Mar 27, 2024

GitHub:

GitHub - XueFuzhao/OpenMoE: A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

Introduction

Mixture of Experts (MoE)-based LLM is known for its cost-effectiveness compared to dense LLMs. However, there's a knowledge gap in understanding how MoE LLM functions. At a high level, MoE is typically considered an effective method to increase parameters within a fixed computation budget. The key difference between dense and MoE LLMs is that, MoE uses a router to selectively and adaptively assign each token to a few experts (i.e., Feed-Forward Neural Network layers). But, how this routing works is not well-understood. In this blog, we delve deeper into MoE LLM routing by examining our OpenMoE, Mixtral 7*8B and deepseek checkpoints, aim to answer the following questions:

We introduce these findings one by one in the following sections.

Context-independent Specialization

We use a subset of RedPajama (different general domains), a subset of The Stack (different coding languages), a subset of Multi-lingual TED corpus(different natural languages), and MT-Bench (different real-world tasks) dataset to investigate the MoE LLM routing behaviour.

We start from checking the specialization on different general domains by RedPajama: