LLM推理01-lookahead decoding性能测试

术语：

下载并安装：

git clone https://github.com/hao-ai-lab/LookaheadDecoding.git
cd LookaheadDecoding
pip install -r requirements.txt
pip install -e .

执行官方demo：

# 不使用LADE
python minimal.py

# 使用LADE
USE_LADE=1 LOAD_LADE=1 python minimal.py

提示：如果遇到报错：We couldn't connect to '[https://huggingface.co](https://huggingface.co/)' to load this file，须检查代理问题，确认代理可访问huggingface。

1、本地RTX-1050Ti with 4G RAM

rtx1050ti_not_use_lade

rtx1050ti_use_lade

2、Nvidia P100 with 16G RAM

p100_not_use_lade

p100_use_lade

3、Nvidia T4x2 with 16GB RAM

t4x2_not_use_lade

t4x2_use_lade

性能数据汇总：

从官方Tiny-llama示例的实测结果看：
（1）LADE方法在三种硬件上，只有T4x2上有26%的提示，其它两种均有显著劣化。
（2）在RTX 1050Ti和P100上，输出结果的最后一行生成文本，使用LADE和不使用保持一致，但T4x2上，输出结果最后一行不一致。

官方给的数据基于llama2-7b-chat，考虑进一步用相同模型测试，对比官方数据结论。
不过有意思的是：官方给的默认示例，居然是存在性能劣化的，是不是说明这种这种方法的普适性有限。