National University of Singapore
Email: f.xue at u.nus.edu
Twitter: @XueFz
Joint work with: Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou, Yang You
Mar 27, 2024
GitHub:
GitHub - XueFuzhao/OpenMoE: A family of open-sourced Mixture-of-Experts (MoE) Large Language Models
Mixture of Experts (MoE)-based LLM is known for its cost-effectiveness compared to dense LLMs. However, there's a knowledge gap in understanding how MoE LLM functions. At a high level, MoE is typically considered an effective method to increase parameters within a fixed computation budget. The key difference between dense and MoE LLMs is that, MoE uses a router to selectively and adaptively assign each token to a few experts (i.e., Feed-Forward Neural Network layers). But, how this routing works is not well-understood. In this blog, we delve deeper into MoE LLM routing by examining our OpenMoE, Mixtral 7*8B and deepseek checkpoints, aim to answer the following questions:
We introduce these findings one by one in the following sections.
We use a subset of RedPajama (different general domains), a subset of The Stack (different coding languages), a subset of Multi-lingual TED corpus(different natural languages), and MT-Bench (different real-world tasks) dataset to investigate the MoE LLM routing behaviour.
We start from checking the specialization on different general domains by RedPajama: