LLM
Struct
Embedding Layer
- Token Embedding
- Positional Embedding
Tranformer
- Multi-Headed Self-Attention
- Query, Key, Value
- Feed-Forward Neural Networks
- Linear Layer
- Up-projection
- Down-projection
- Activation Function
- ReLU
- Linear Layer
- Residual Connection & Layer Normalization
- Multi-Headed Self-Attention
Decoding & Softmax
Memory Usage
- Model Weight
- KV Cache
- Activation/Overhead Buffers
Inference Optimization
Metrics
When talking about Efficient Inference, what are we truly discussing?
- Cost
- Computational Cost
- Memory Access Cost
- Memory Cost
- Latency
- First token latency
- Per-output token latency
- Generation latency
- Performance
- Accuracy & Precision
- Robustness
- Generalization
Overview
Data-level Optimization
- Input Compression
- Prompt Pruning
- Prompt Summary
- Soft Prompt-based Compression
- Retrieval-Augmented Generation (RAG)
- Output Organization
- Skeleton-of-Thought (SoT)
- Directed Acyclic Graph (DAG)
- Input Compression
Model-level Optimization
- Efficient Structure Design
- Efficient FFN Design
- Mixture-of-Experts (MoE)
- Efficient Attention Design
- Low-Complexity Attention
- Kernel-Based Attention
- Low-Rank Attention
- Multi/Group-Query Attention
- Low-Complexity Attention
- Transformer Alternate
- Efficient FFN Design
- Model Compression
- Quantization
- Post-Training Quantization (PTQ)
- Quantization-aware Training (QAT)
- Sparsification
- Weight Pruning
- Sparse Attention
- Structure Optimization
- Structure Factorization
- Neural Architecture Search
- Knowledge Distillation
- White-box KD
- Black-box KD
- Dynamic Inference
- Sample-level early exiting
- Token-level early exiting
- Quantization
- Efficient Structure Design
System-level Optimization
- Inference Engine
- Graph and Operator Optimization
- Attention Operator Optimization
- #FlashAttention
- Graph-Level Optimization
- Linear Operator Optimization
- Attention Operator Optimization
- Speculative Decoding
- Graph and Operator Optimization
- Serving System
- Memory Management
- #LightLLM #TokenAttention
- #vLLM #PagedAttention
- #SGLang #RadixAttention
- Continuous Batching
- Scheduling
- Distributed Systems
- Data Parallelism
- Model Parallelism
- #OPER
- Pipeline Parallelism
- #PipeLLM
- Layer-wise Parallelism
- Tensor Parallelism
- Memory Management
- Inference Engine
Methods
FlashAttention
#LightLLM
https://arxiv.org/pdf/2205.14135
https://github.com/Dao-AILab/flash-attention
⏱️78s看懂FlashAttention【有点意思·1】
https://www.bilibili.com/video/BV1Zz4y1q7FX
图解大模型计算加速系列:FlashAttention V1,从硬件到计算逻辑
https://zhuanlan.zhihu.com/p/669926191
图解大模型计算加速系列:Flash Attention V2,从原理到并行计算
https://zhuanlan.zhihu.com/p/691067658
FlashAttention: 更快训练更长上下文的GPT【论文粗读·6】
https://www.bilibili.com/video/BV1SW4y1X7kh
Flash Attention 为什么那么快?原理讲解
https://www.bilibili.com/video/BV1UT421k7rA
FlashAttention: 更快训练更长上下文的GPT
https://readpaper.feishu.cn/docx/AC7JdtLrhoKpgxxSRM8cfUounsh
MedAI #54: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Tri Dao
https://www.youtube.com/watch?v=FThvfkXWqtE
PagedAttention
#vLLM
图解大模型计算加速系列之:vLLM核心技术PagedAttention原理
https://zhuanlan.zhihu.com/p/691038809
TokenAttention
#LightLLM
RadixAttention
#SGLang
大模型推理在attention算子和kv cache复用上的最新进展
https://www.zhihu.com/question/637480772/answer/3391893087
Continuous Batching
#llamacpp #vLLM #LightLLM
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
https://www.anyscale.com/blog/continuous-batching-llm-inference
Inference Frameworks
SGLang
https://github.com/sgl-project/sglang
SGLang Code Walk Through
https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through
vLLM
https://github.com/vllm-project/vllm
大模型并发加速部署 解析当前应用较广的几种并发加速部署方案
https://www.bilibili.com/video/BV1du4m1K7eF
The AI Acceleration Showdown: vLLM vs. TGI in the Race for Efficient LLM Deployment
https://runaker.medium.com/the-ai-acceleration-showdown-vllm-vs-tgi-in-the-race-for-efficient-llm-deployment-13fe90c635be
图解大模型计算加速系列:vLLM源码解析1,整体架构
https://zhuanlan.zhihu.com/p/691045737
图解大模型计算加速系列:vLLM源码解析2,调度器策略(Scheduler)
https://zhuanlan.zhihu.com/p/692540949
图解大模型计算加速系列:vLLM源码解析3,块管理器BlockManager(上篇)
https://zhuanlan.zhihu.com/p/700780161
图解大模型计算加速系列:vLLM源码解析3,Prefix Caching
https://zhuanlan.zhihu.com/p/707228704
LightLLM
https://github.com/ModelTC/lightllm
LightLLM轻量级高性能推理框架 和vLLM哪个更强?
https://www.bilibili.com/video/BV18G411U77C
lightllm代码解读
https://www.zhihu.com/people/robindu/posts
llama.cpp
https://github.com/ggerganov/llama.cpp
笔记:Llama.cpp 代码浅析(一):并行机制与KVCache
https://zhuanlan.zhihu.com/p/670515231
笔记:Llama.cpp 代码浅析(二):数据结构与采样方法
https://zhuanlan.zhihu.com/p/671761052
笔记:Llama.cpp 代码浅析(三):计算开销
https://zhuanlan.zhihu.com/p/672289691
笔记:Llama.cpp 代码浅析(四):量化那些事
https://zhuanlan.zhihu.com/p/672983861
llama.cpp 源码解析– CUDA版本流程与逐算子详解
https://zhuanlan.zhihu.com/p/665027154
https://www.bilibili.com/video/BV1Ez4y1w7fc
DeepSpeed
https://github.com/microsoft/DeepSpeed
从啥也不会到DeepSpeed————一篇大模型分布式训练的学习过程总结
https://zhuanlan.zhihu.com/p/688873027
ZeRO: Zero Redundancy Optimizer,一篇就够了
https://zhuanlan.zhihu.com/p/663517415
【通俗易读】LLM训练-从显存占用分析到DeepSpeed ZeRO 三阶段解读
https://zhuanlan.zhihu.com/p/694880795
Breaking the memory barrier: how ZeRO revolutionizes large model training
https://medium.com/@Shrishml/breaking-the-gmemory-barrier-how-zero-revolutionizes-large-language-model-training-8e00d2e2f30a
TensorRT-LLM
https://github.com/NVIDIA/TensorRT-LLM
TGI (text-generation-inference)
https://github.com/huggingface/text-generation-inference
Others
Index
LLM推理优化系统工程概述
https://zhuanlan.zhihu.com/p/680635901
LLM大模型推理部署优化技术综述
https://zhuanlan.zhihu.com/p/655557420
Modest Understandings on LLM
https://bytedance.larkoffice.com/docx/doxcn3zm448MK9sK6pHuPsqtH8f
NVIDIA Deep Learning Performance
https://docs.nvidia.com/deeplearning/performance/index.html
刀刀宁
https://www.zhihu.com/people/zzningxp
猛猿
https://www.zhihu.com/people/lemonround
手抓饼熊
https://www.zhihu.com/people/tongsanpang
Essay
A Survey on Efficient Inference for Large Language Models
http://arxiv.org/abs/2404.14294
Attention is all you need
https://arxiv.org/pdf/1706.03762
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/pdf/2205.14135
MapReduce: Simplified Data Processing on Large Clusters
https://storage.googleapis.com/gweb-research2023-media/pubtools/4449.pdf
LLM Basic
LLM Visualization
https://bbycroft.net/llm
从啥也不会到GPT-3和InstructGPT————一篇LLM的学习过程总结
https://zhuanlan.zhihu.com/p/684034047
This post is all you need(上卷)——层层剥开Transformer
https://zhuanlan.zhihu.com/p/420820453
This post is all you need(下卷)——步步走进BERT
https://zhuanlan.zhihu.com/p/519432336
How large language models work, a visual intro to transformers | Chapter 5, Deep Learning
https://www.youtube.com/watch?v=wjZofJX0v4M
Attention in transformers, visually explained | Chapter 6, Deep Learning
https://www.youtube.com/watch?v=eMlx5fFNoYc
How might LLMs store facts | Chapter 7, Deep Learning
https://www.youtube.com/watch?v=9-Jl0dxWQs8
LLM:旋转位置编码(RoPE)的通俗理解
https://zhuanlan.zhihu.com/p/690610231
Q、K、V 与 Multi-Head Attention 多头注意力机制
https://zhuanlan.zhihu.com/p/669027091
大模型推理优化技术-KV Cache
https://zhuanlan.zhihu.com/p/700197845