Speculative Decoding多篇主流论文的总结

现象:

1. 有些文本词语,小模型也是有推理正确的能力的。不需要大模型这样能力强的模型。

2. decode阶段是memory-bound,不是计算密集,远远没有打满GPU算力。

基本原理

先用小且快的draft模型,生成几个tokens。再用大而慢的模型,将这些input+output tokens批量输入,一次forward得到每个token的下一个token,只保留前面和draft结果完全相同的部分,不同的部分扔弃。

SpecInfer

痛点:强调了传统decode方法每次都要读1遍模型参数的memory-bound。

1个小模型每次采样几个tokens,OR,多个小模型每个采样1个序列;

生成多个候选序列的目的:增大命中几率。

将多个候选序列,合并成1棵树的目的:避免公共前缀tokens的KV cache的重复计算,减少显存占用。

使用Tree-attention,把整棵树的tokens打包至1个batch,输入大模型进行1个forward推理。

限制每个节点的采样tokens不能太多。如下,<2, 2, 1>限制。

### Speculative Decoding in LMStudio Implementation and Usage In the context of large language models (LLMs), speculative decoding is an optimization technique aimed at improving inference efficiency by predicting future tokens before they are fully computed. This approach leverages parallelism within GPUs or TPUs to reduce latency during text generation tasks. #### Key Concepts Behind Speculative Decoding To understand how speculative decoding works, it's important to recognize that traditional autoregressive models generate one token at a time based on previous tokens. However, this sequential nature can lead to inefficiencies as each new prediction must wait for preceding computations to complete. By contrast, speculative decoding attempts to predict multiple potential next tokens simultaneously while maintaining reasonable accuracy[^1]. #### Memory Optimization Techniques Training multi-token predictors poses significant challenges regarding GPU memory utilization due to the disparity between vocabulary size \( V \) and embedding dimensions \( d \). Naive implementations materialize all logits along with their gradients, leading to substantial memory consumption. To mitigate these issues: - The architecture suggests carefully orchestrating forward and backward operations. - After sharing the trunk computation across independent output heads, individual forward passes followed immediately by corresponding backpropagation steps occur sequentially. - Logits generated from each head get discarded once no longer needed, retaining only essential gradient information related to the main model parameters[^2]. This strategy effectively reduces peak GPU memory usage without introducing additional runtime overheads. #### Practical Application Within LMStudio Within LMStudio, implementing speculative decoding involves several considerations: ```python def speculative_decode(model, input_sequence, max_length=50): predictions = [] for i in range(max_length): # Perform standard forward pass up until last known token hidden_state = model.forward(input_sequence) # Generate top-k candidates using current state candidate_tokens = model.top_k_sampling(hidden_state, k=5) # Select most probable continuation path among candidates best_token_id = select_best_continuation(candidate_tokens) # Append chosen token ID to sequence for subsequent iterations input_sequence.append(best_token_id) predictions.append(best_token_id) # Early stopping condition when end-of-sequence detected if check_end_of_sequence(best_token_id): break return predictions ``` The provided code snippet demonstrates a simplified version of speculative decoding where after computing intermediate states, multiple possible continuations are evaluated probabilistically rather than waiting for deterministic outcomes step-by-step.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值