LLM

Posted on 2025-01-02

Struct

Embedding Layer
- Token Embedding
- Positional Embedding
Tranformer
- Multi-Headed Self-Attention
  - Query, Key, Value
- Feed-Forward Neural Networks
  - Linear Layer
    - Up-projection
    - Down-projection
  - Activation Function
    - ReLU
- Residual Connection & Layer Normalization
Decoding & Softmax

Memory Usage

Model Weight
KV Cache
Activation/Overhead Buffers

Inference Optimization

Metrics

When talking about Efficient Inference, what are we truly discussing?

Cost
- Computational Cost
- Memory Access Cost
- Memory Cost
Latency
- First token latency
- Per-output token latency
- Generation latency
Performance
- Accuracy & Precision
- Robustness
- Generalization

Overview

Data-level Optimization
- Input Compression
  - Prompt Pruning
  - Prompt Summary
  - Soft Prompt-based Compression
  - Retrieval-Augmented Generation (RAG)
- Output Organization
  - Skeleton-of-Thought (SoT)
  - Directed Acyclic Graph (DAG)
Model-level Optimization
- Efficient Structure Design
  - Efficient FFN Design
    - Mixture-of-Experts (MoE)
  - Efficient Attention Design
    - Low-Complexity Attention
      - Kernel-Based Attention
      - Low-Rank Attention
    - Multi/Group-Query Attention
  - Transformer Alternate
- Model Compression
  - Quantization
    - Post-Training Quantization (PTQ)
    - Quantization-aware Training (QAT)
  - Sparsification
    - Weight Pruning
    - Sparse Attention
  - Structure Optimization
    - Structure Factorization
    - Neural Architecture Search
  - Knowledge Distillation
    - White-box KD
    - Black-box KD
  - Dynamic Inference
    - Sample-level early exiting
    - Token-level early exiting
System-level Optimization
- Inference Engine
  - Graph and Operator Optimization
    - Attention Operator Optimization
      - #FlashAttention
    - Graph-Level Optimization
    - Linear Operator Optimization
  - Speculative Decoding
- Serving System
  - Memory Management
    - #LightLLM #TokenAttention
    - #vLLM #PagedAttention
    - #SGLang #RadixAttention
  - Continuous Batching
  - Scheduling
  - Distributed Systems
    - Data Parallelism
    - Model Parallelism
      - #OPER
    - Pipeline Parallelism
      - #PipeLLM
      - Layer-wise Parallelism
    - Tensor Parallelism

Methods

FlashAttention

#LightLLM

https://arxiv.org/pdf/2205.14135
https://github.com/Dao-AILab/flash-attention

⏱️78s看懂FlashAttention【有点意思·1】
https://www.bilibili.com/video/BV1Zz4y1q7FX

图解大模型计算加速系列：FlashAttention V1，从硬件到计算逻辑
https://zhuanlan.zhihu.com/p/669926191

图解大模型计算加速系列：Flash Attention V2，从原理到并行计算
https://zhuanlan.zhihu.com/p/691067658

FlashAttention: 更快训练更长上下文的GPT【论文粗读·6】
https://www.bilibili.com/video/BV1SW4y1X7kh

Flash Attention 为什么那么快？原理讲解
https://www.bilibili.com/video/BV1UT421k7rA

FlashAttention: 更快训练更长上下文的GPT
https://readpaper.feishu.cn/docx/AC7JdtLrhoKpgxxSRM8cfUounsh

MedAI #54: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Tri Dao
https://www.youtube.com/watch?v=FThvfkXWqtE

PagedAttention

#vLLM

图解大模型计算加速系列之：vLLM核心技术PagedAttention原理
https://zhuanlan.zhihu.com/p/691038809

TokenAttention

#LightLLM

RadixAttention

#SGLang

大模型推理在attention算子和kv cache复用上的最新进展
https://www.zhihu.com/question/637480772/answer/3391893087

Continuous Batching

#llamacpp #vLLM #LightLLM

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
https://www.anyscale.com/blog/continuous-batching-llm-inference

Inference Frameworks

SGLang

https://github.com/sgl-project/sglang

SGLang Code Walk Through
https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through

vLLM

https://github.com/vllm-project/vllm

大模型并发加速部署解析当前应用较广的几种并发加速部署方案
https://www.bilibili.com/video/BV1du4m1K7eF

The AI Acceleration Showdown: vLLM vs. TGI in the Race for Efficient LLM Deployment
https://runaker.medium.com/the-ai-acceleration-showdown-vllm-vs-tgi-in-the-race-for-efficient-llm-deployment-13fe90c635be

图解大模型计算加速系列：vLLM源码解析1，整体架构
https://zhuanlan.zhihu.com/p/691045737

图解大模型计算加速系列：vLLM源码解析2，调度器策略(Scheduler)
https://zhuanlan.zhihu.com/p/692540949

图解大模型计算加速系列：vLLM源码解析3，块管理器BlockManager（上篇）
https://zhuanlan.zhihu.com/p/700780161

图解大模型计算加速系列：vLLM源码解析3，Prefix Caching
https://zhuanlan.zhihu.com/p/707228704

LightLLM

https://github.com/ModelTC/lightllm

LightLLM轻量级高性能推理框架和vLLM哪个更强？
https://www.bilibili.com/video/BV18G411U77C

lightllm代码解读
https://www.zhihu.com/people/robindu/posts

llama.cpp

https://github.com/ggerganov/llama.cpp

笔记：Llama.cpp 代码浅析（一）：并行机制与KVCache
https://zhuanlan.zhihu.com/p/670515231

笔记：Llama.cpp 代码浅析（二）：数据结构与采样方法
https://zhuanlan.zhihu.com/p/671761052

笔记：Llama.cpp 代码浅析（三）：计算开销
https://zhuanlan.zhihu.com/p/672289691

笔记：Llama.cpp 代码浅析（四）：量化那些事
https://zhuanlan.zhihu.com/p/672983861

llama.cpp 源码解析– CUDA版本流程与逐算子详解
https://zhuanlan.zhihu.com/p/665027154
https://www.bilibili.com/video/BV1Ez4y1w7fc

DeepSpeed

https://github.com/microsoft/DeepSpeed

从啥也不会到DeepSpeed————一篇大模型分布式训练的学习过程总结
https://zhuanlan.zhihu.com/p/688873027

ZeRO: Zero Redundancy Optimizer，一篇就够了
https://zhuanlan.zhihu.com/p/663517415

【通俗易读】LLM训练-从显存占用分析到DeepSpeed ZeRO 三阶段解读
https://zhuanlan.zhihu.com/p/694880795

Breaking the memory barrier: how ZeRO revolutionizes large model training
https://medium.com/@Shrishml/breaking-the-gmemory-barrier-how-zero-revolutionizes-large-language-model-training-8e00d2e2f30a

TensorRT-LLM

https://github.com/NVIDIA/TensorRT-LLM

TGI (text-generation-inference)

https://github.com/huggingface/text-generation-inference

Others

Index

LLM推理优化系统工程概述
https://zhuanlan.zhihu.com/p/680635901

LLM大模型推理部署优化技术综述
https://zhuanlan.zhihu.com/p/655557420

Modest Understandings on LLM
https://bytedance.larkoffice.com/docx/doxcn3zm448MK9sK6pHuPsqtH8f

NVIDIA Deep Learning Performance
https://docs.nvidia.com/deeplearning/performance/index.html

刀刀宁
https://www.zhihu.com/people/zzningxp

猛猿
https://www.zhihu.com/people/lemonround

手抓饼熊
https://www.zhihu.com/people/tongsanpang

Essay

A Survey on Efficient Inference for Large Language Models
http://arxiv.org/abs/2404.14294

Attention is all you need
https://arxiv.org/pdf/1706.03762

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/pdf/2205.14135

MapReduce: Simplified Data Processing on Large Clusters
https://storage.googleapis.com/gweb-research2023-media/pubtools/4449.pdf

LLM Basic

LLM Visualization
https://bbycroft.net/llm

从啥也不会到GPT-3和InstructGPT————一篇LLM的学习过程总结
https://zhuanlan.zhihu.com/p/684034047

This post is all you need（上卷）——层层剥开Transformer
https://zhuanlan.zhihu.com/p/420820453

This post is all you need（下卷）——步步走进BERT
https://zhuanlan.zhihu.com/p/519432336

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning
https://www.youtube.com/watch?v=wjZofJX0v4M

Attention in transformers, visually explained | Chapter 6, Deep Learning
https://www.youtube.com/watch?v=eMlx5fFNoYc

How might LLMs store facts | Chapter 7, Deep Learning
https://www.youtube.com/watch?v=9-Jl0dxWQs8

LLM：旋转位置编码（RoPE）的通俗理解
https://zhuanlan.zhihu.com/p/690610231

Q、K、V 与 Multi-Head Attention 多头注意力机制
https://zhuanlan.zhihu.com/p/669027091

大模型推理优化技术-KV Cache
https://zhuanlan.zhihu.com/p/700197845