DBRX: A New State-of-the-Art Open LLM——基于专家LLM

最新推荐文章于 2025-05-23 20:21:40 发布

Re:fused

最新推荐文章于 2025-05-23 20:21:40 发布

阅读量1.3k

点赞数 34

文章标签：人工智能

本文链接：https://blog.csdn.net/REfusing/article/details/137122253

版权

DBRX是一个由Databricks创建的高效通用语言模型，它在性能和效率上超越了GPT-3.5和Gemini1.0Pro，特别在编程任务上优于专业模型。基于MoE架构，DBRX在推理速度和参数效率上表现出色，且代码实现与MixtralMOE类似但有所优化。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

刷知乎的时候，发现最近开源了一个新的大模型DBRX，同样是基于专家的大模型。之前找MOE的源码没找到，仅仅找到了一些博主写的博客内容，简单了解了一下内容。之前写的模型：Mixtral MOE代码理解添加链接描述。简单看了一下，对于代码实现和之前的原理基本一致。
技术博客：
Introducing DBRX: A New State-of-the-Art Open LLM
模型代码：点击这里

介绍（博客原文摘取）：

一个由Databricks创建的开放的通用LLM。在一系列标准基准测试中，DBRX为已建立的开放LLM设定了新的最先进水平。此外，它为开放社区和企业构建自己的LLM提供了以前仅限于封闭模型API的功能；根据我们的测量，它超过了GPT-3.5，并与Gemini 1.0 Pro具有竞争力。它是一个特别有能力的代码模型，在编程方面超过了CodeLLaMA-70B等专门模型，此外它作为通用LLM的实力。这种最先进的质量在训练和推理性能方面有显著的改进。由于其细粒度的专家混合(MoE)架构，DBRX在开放模型中提高了效率的最先进水平。推理速度比LLaMA2-70B快2倍，DBRX在总参数和活动参数计数方面约为Grok-1的40%。当托管在Mosaic AI模型服务上时，DBRX可以以高达150 tok/s/user的速度生成文本。我们的客户将发现，对于相同的最终模型质量，训练MoEs的FLOP效率也比训练密集模型高出约2倍。端到端，我们为DBRX提供的整体配方(包括预训练数据、模型架构和优化策略)可以与我们上一代MPT模型的质量相媲美，而计算量却减少了近4倍。
在这里插入图片描述

技术实现：

模型结构：DBRX是一个基于Transformer的仅解码器的大型语言模型(LLM)，使用一个细粒度的专家混合(MoE)架构
专家数量：16
参数量：132B
起作用专家数量：4
推理时激活参数：36B（由于专家形式，推理时仅仅激活4个专家）
位置编码：RoPE
前馈神经网络：GLU（门控线性单元）
注意力机制：分组查询注意力(GQA)
分词方式：tiktoken
最大上下文长度：32k Token
layers: 40
head数量：48
rope_theta: 500000（LLama设置10000,扩大了50倍，应为是为了能够支持更长的文本长度，这里我了解不太多）
维度：6144
vocab_size: 100352
ffn_hidden_size: 10752
Norm的位置：pre norm
Norm方式：LayerNorm
相比于与Mixtral和Grok-1等其他开放MoE模型相比采用8个专家，选择两个专家，DBRX提供更多的专家组合。
结构上类似于Mixtral的MoE,对前馈神经网络，进行MoE。

编程和数学能力

在这里插入图片描述
以上就是我比较关注的点，看一下专家是怎么实现的。

代码分析：

专家的选择和我之前理解MoE代码基本一致，仅仅简单，叙述一下：
Mixtral MOE代码理解
因为DBRX在专家实现方面和Mixtral MOE，在代码上是一致的，仅仅是专家的数量和，激活专家的数量不一致，其它我理解应该是一致的。

Router

这个功能主要用于选择从16个专家中选个哪4个专家

class DbrxRouter(nn.Module):

    def __init__(self, hidden_size: int, moe_num_experts: int, moe_top_k: int,
                 moe_jitter_eps: Optional[float],
                 moe_normalize_expert_weights: Optional[float],
                 uniform_expert_assignment: bool):
        super().__init__()
        self.hidden_size = hidden_size
        self.moe_num_experts = moe_num_experts
        self.moe_top_k = moe_top_k
        self.moe_jitter_eps = moe_jitter_eps
        self.moe_normalize_expert_weights = moe_normalize_expert_weights
        self.uniform_expert_assignment = uniform_expert_assignment

        self.layer = nn.Linear(self.hidden_size,
                               self.moe_num_experts,
                               bias=False)

    def jitter(self, x: torch.Tensor) -> torch.Tensor:#增加抖动，如何仅仅理解原理这个可以忽略
        if self.moe_jitter_eps is None:
            raise RuntimeError('The router does not have moe_jitter_eps set.')
        low = 1.0 - self.moe_jitter_eps
        high = 1.0 + self.moe_jitter_eps
        noise = torch.rand(x.size(), dtype=x.dtype, device=x.device)
        return low + noise * (high - low)

    def forward(
            self, x: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor, torch.LongTensor]:
        if self.training and self.moe_jitter_eps is not None:
            x = x * self.jitter(x)

        weights = self.layer(x.view(-1,
                                    x.shape[-1])).softmax(dim=-1,
                                                          dtype=torch.float32)
        top_weights, top_experts = torch.topk(weights, self.moe_top_k, dim=-1)

        if self.moe_normalize_expert_weights:
            top_weights = top_weights / torch.norm(#主要对专家的结果进行归一化
                top_weights,
                p=self.moe_normalize_expert_weights,
                dim=-1,
                keepdim=True)

        if self.uniform_expert_assignment:
            with torch.no_grad():
                uniform_tensor = torch.arange(
                    0,
                    top_experts.numel(),
                    device=top_experts.device,
                    dtype=top_experts.dtype) % self.moe_num_experts
                top_experts = uniform_tensor.reshape(top_experts.shape)
                # Note, weights and top_weights are not changed

        weights = weights.to(x.dtype)
        top_weights = top_weights.to(x.dtype)
        return weights, top_weights, top_experts  # type: ignore

ExpertGLU

对token选择某个专家进行GLU即前馈神经网络

class DbrxExpertGLU(nn.Module):#这个就是常规的GLU结果，只不多是多个专家的GLU

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.hidden_size = hidden_size
        self.ffn_hidden_size = ffn_hidden_size
        self.moe_num_experts = moe_num_experts

        self.w1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))#因为是多个16个专家，所以是专家的总数量，乘以总的ffn_hidden_size
        self.v1 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.w2 = nn.Parameter(
            torch.empty(moe_num_experts * ffn_hidden_size, hidden_size))
        self.activation_fn = resolve_ffn_act_fn(ffn_act_fn)

    def forward(self, x: torch.Tensor, expert_idx: int) -> torch.Tensor:
        expert_w1 = self.w1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_v1 = self.v1.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]
        expert_w2 = self.w2.view(self.moe_num_experts, self.ffn_hidden_size,
                                 self.hidden_size)[expert_idx]

        x1 = x.matmul(expert_w1.t())
        x2 = x.matmul(expert_v1.t())
        x1 = self.activation_fn(x1)
        x1 = x1 * x2
        x1 = x1.matmul(expert_w2)
        return x1

Experts

这个主要就是，专家的核心，算是专家的控制代码，控制token进行专家的选择，选择相应的GLU结构。

class DbrxExperts(nn.Module):

    def __init__(self, hidden_size: int, ffn_hidden_size: int,
                 moe_num_experts: int, ffn_act_fn: dict):
        super().__init__()
        self.moe_num_experts = moe_num_experts
        self.mlp = DbrxExpertGLU(hidden_size=hidden_size,
                                 ffn_hidden_size=ffn_hidden_size,
                                 moe_num_experts=moe_num_experts,
                                 ffn_act_fn=ffn_act_fn)

    def forward(self, x: torch.Tensor, weights: torch.Tensor,
                top_weights: torch.Tensor,
                top_experts: torch.LongTensor) -> torch.Tensor:
        bsz, q_len, hidden_size = x.shape
        x = x.view(-1, hidden_size)
        out = torch.zeros_like(x)

        expert_mask = nn.functional.one_hot(
            top_experts, num_classes=self.moe_num_experts).permute(2, 1, 0)
        for expert_idx in range(0, self.moe_num_experts):
            topk_idx, token_idx = torch.where(expert_mask[expert_idx])
            if token_idx.shape[0] == 0:
                continue

            token_list = token_idx.tolist()
            topk_list = topk_idx.tolist()

            expert_tokens = x[None, token_list].reshape(-1, hidden_size)
            expert_out = self.mlp(
                expert_tokens, expert_idx) * top_weights[token_list, topk_list,
                                                         None]

            out.index_add_(0, token_idx, expert_out)

        out = out.reshape(bsz, q_len, hidden_size)
        return out