Struct

  • Embedding Layer

    • Token Embedding
    • Positional Embedding
  • Tranformer

    • Multi-Headed Self-Attention
      • Query, Key, Value
    • Feed-Forward Neural Networks
      • Linear Layer
        • Up-projection
        • Down-projection
      • Activation Function
        • ReLU
    • Residual Connection & Layer Normalization
  • Decoding & Softmax

Memory Usage

  • Model Weight
  • KV Cache
  • Activation/Overhead Buffers

Inference Optimization

Metrics

When talking about Efficient Inference, what are we truly discussing?

  • Cost
    • Computational Cost
    • Memory Access Cost
    • Memory Cost
  • Latency
    • First token latency
    • Per-output token latency
    • Generation latency
  • Performance
    • Accuracy & Precision
    • Robustness
    • Generalization

Overview

  • Data-level Optimization

    • Input Compression
      • Prompt Pruning
      • Prompt Summary
      • Soft Prompt-based Compression
      • Retrieval-Augmented Generation (RAG)
    • Output Organization
      • Skeleton-of-Thought (SoT)
      • Directed Acyclic Graph (DAG)
  • Model-level Optimization

    • Efficient Structure Design
      • Efficient FFN Design
        • Mixture-of-Experts (MoE)
      • Efficient Attention Design
        • Low-Complexity Attention
          • Kernel-Based Attention
          • Low-Rank Attention
        • Multi/Group-Query Attention
      • Transformer Alternate
    • Model Compression
      • Quantization
        • Post-Training Quantization (PTQ)
        • Quantization-aware Training (QAT)
      • Sparsification
        • Weight Pruning
        • Sparse Attention
      • Structure Optimization
        • Structure Factorization
        • Neural Architecture Search
      • Knowledge Distillation
        • White-box KD
        • Black-box KD
      • Dynamic Inference
        • Sample-level early exiting
        • Token-level early exiting
  • System-level Optimization

    • Inference Engine
      • Graph and Operator Optimization
        • Attention Operator Optimization
          • #FlashAttention
        • Graph-Level Optimization
        • Linear Operator Optimization
      • Speculative Decoding
    • Serving System
      • Memory Management
        • #LightLLM #TokenAttention
        • #vLLM #PagedAttention
        • #SGLang #RadixAttention
      • Continuous Batching
      • Scheduling
      • Distributed Systems
        • Data Parallelism
        • Model Parallelism
          • #OPER
        • Pipeline Parallelism
          • #PipeLLM
          • Layer-wise Parallelism
        • Tensor Parallelism

Methods

FlashAttention

#LightLLM

https://arxiv.org/pdf/2205.14135
https://github.com/Dao-AILab/flash-attention

⏱️78s看懂FlashAttention【有点意思·1】
https://www.bilibili.com/video/BV1Zz4y1q7FX

图解大模型计算加速系列:FlashAttention V1,从硬件到计算逻辑
https://zhuanlan.zhihu.com/p/669926191

图解大模型计算加速系列:Flash Attention V2,从原理到并行计算
https://zhuanlan.zhihu.com/p/691067658

FlashAttention: 更快训练更长上下文的GPT【论文粗读·6】
https://www.bilibili.com/video/BV1SW4y1X7kh

Flash Attention 为什么那么快?原理讲解
https://www.bilibili.com/video/BV1UT421k7rA

FlashAttention: 更快训练更长上下文的GPT
https://readpaper.feishu.cn/docx/AC7JdtLrhoKpgxxSRM8cfUounsh

MedAI #54: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Tri Dao
https://www.youtube.com/watch?v=FThvfkXWqtE

PagedAttention

#vLLM

图解大模型计算加速系列之:vLLM核心技术PagedAttention原理
https://zhuanlan.zhihu.com/p/691038809

TokenAttention

#LightLLM

RadixAttention

#SGLang

大模型推理在attention算子和kv cache复用上的最新进展
https://www.zhihu.com/question/637480772/answer/3391893087

Continuous Batching

#llamacpp #vLLM #LightLLM

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency
https://www.anyscale.com/blog/continuous-batching-llm-inference

Inference Frameworks

SGLang

https://github.com/sgl-project/sglang

SGLang Code Walk Through
https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through

vLLM

https://github.com/vllm-project/vllm

大模型并发加速部署 解析当前应用较广的几种并发加速部署方案
https://www.bilibili.com/video/BV1du4m1K7eF

The AI Acceleration Showdown: vLLM vs. TGI in the Race for Efficient LLM Deployment
https://runaker.medium.com/the-ai-acceleration-showdown-vllm-vs-tgi-in-the-race-for-efficient-llm-deployment-13fe90c635be

图解大模型计算加速系列:vLLM源码解析1,整体架构
https://zhuanlan.zhihu.com/p/691045737

图解大模型计算加速系列:vLLM源码解析2,调度器策略(Scheduler)
https://zhuanlan.zhihu.com/p/692540949

图解大模型计算加速系列:vLLM源码解析3,块管理器BlockManager(上篇)
https://zhuanlan.zhihu.com/p/700780161

图解大模型计算加速系列:vLLM源码解析3,Prefix Caching
https://zhuanlan.zhihu.com/p/707228704

LightLLM

https://github.com/ModelTC/lightllm

LightLLM轻量级高性能推理框架 和vLLM哪个更强?
https://www.bilibili.com/video/BV18G411U77C

lightllm代码解读
https://www.zhihu.com/people/robindu/posts

llama.cpp

https://github.com/ggerganov/llama.cpp

笔记:Llama.cpp 代码浅析(一):并行机制与KVCache
https://zhuanlan.zhihu.com/p/670515231

笔记:Llama.cpp 代码浅析(二):数据结构与采样方法
https://zhuanlan.zhihu.com/p/671761052

笔记:Llama.cpp 代码浅析(三):计算开销
https://zhuanlan.zhihu.com/p/672289691

笔记:Llama.cpp 代码浅析(四):量化那些事
https://zhuanlan.zhihu.com/p/672983861

llama.cpp 源码解析– CUDA版本流程与逐算子详解
https://zhuanlan.zhihu.com/p/665027154
https://www.bilibili.com/video/BV1Ez4y1w7fc

DeepSpeed

https://github.com/microsoft/DeepSpeed

从啥也不会到DeepSpeed————一篇大模型分布式训练的学习过程总结
https://zhuanlan.zhihu.com/p/688873027

ZeRO: Zero Redundancy Optimizer,一篇就够了
https://zhuanlan.zhihu.com/p/663517415

【通俗易读】LLM训练-从显存占用分析到DeepSpeed ZeRO 三阶段解读
https://zhuanlan.zhihu.com/p/694880795

Breaking the memory barrier: how ZeRO revolutionizes large model training
https://medium.com/@Shrishml/breaking-the-gmemory-barrier-how-zero-revolutionizes-large-language-model-training-8e00d2e2f30a

TensorRT-LLM

https://github.com/NVIDIA/TensorRT-LLM

TGI (text-generation-inference)

https://github.com/huggingface/text-generation-inference

Others

Index

LLM推理优化系统工程概述
https://zhuanlan.zhihu.com/p/680635901

LLM大模型推理部署优化技术综述
https://zhuanlan.zhihu.com/p/655557420

Modest Understandings on LLM
https://bytedance.larkoffice.com/docx/doxcn3zm448MK9sK6pHuPsqtH8f

NVIDIA Deep Learning Performance
https://docs.nvidia.com/deeplearning/performance/index.html

刀刀宁
https://www.zhihu.com/people/zzningxp

猛猿
https://www.zhihu.com/people/lemonround

手抓饼熊
https://www.zhihu.com/people/tongsanpang

Essay

A Survey on Efficient Inference for Large Language Models
http://arxiv.org/abs/2404.14294

Attention is all you need
https://arxiv.org/pdf/1706.03762

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
https://arxiv.org/pdf/2205.14135

MapReduce: Simplified Data Processing on Large Clusters
https://storage.googleapis.com/gweb-research2023-media/pubtools/4449.pdf

LLM Basic

LLM Visualization
https://bbycroft.net/llm

从啥也不会到GPT-3和InstructGPT————一篇LLM的学习过程总结
https://zhuanlan.zhihu.com/p/684034047

This post is all you need(上卷)——层层剥开Transformer
https://zhuanlan.zhihu.com/p/420820453

This post is all you need(下卷)——步步走进BERT
https://zhuanlan.zhihu.com/p/519432336

How large language models work, a visual intro to transformers | Chapter 5, Deep Learning
https://www.youtube.com/watch?v=wjZofJX0v4M

Attention in transformers, visually explained | Chapter 6, Deep Learning
https://www.youtube.com/watch?v=eMlx5fFNoYc

How might LLMs store facts | Chapter 7, Deep Learning
https://www.youtube.com/watch?v=9-Jl0dxWQs8

LLM:旋转位置编码(RoPE)的通俗理解
https://zhuanlan.zhihu.com/p/690610231

Q、K、V 与 Multi-Head Attention 多头注意力机制
https://zhuanlan.zhihu.com/p/669027091

大模型推理优化技术-KV Cache
https://zhuanlan.zhihu.com/p/700197845

Overview

  • Nvidia Graphics Card
    • GPU
      • GPC (Graphics Processing Clusters)
        • Raster Engine
        • TPC (Texture Processing Clusters)
          • SM (Streaming Multiprocessor)
            • Warp Scheduler
            • Dispatch Unit
            • L0 Instruction Cache
            • L1 Data Cache / Shared Memory
            • Register File
            • CUDA Core (Compute Unified Device Architecture)
            • Tensor Core
            • RT Core (Ray Tracing Core)
            • Tex (Texture Unit)
            • LD/ST (Load/Store Unit)
            • SFU (Special Function Unit)
      • L2 Cache
      • NVENC
      • NVDEC
      • Memory Controller
      • PCIe Host Interface
    • VRAM
    • Interface
      • PCIe
      • NVLink
      • Display Output
        • HDMI
        • DP
    • Power
    • Cooling

GPU

Architecture

  • Nvidia Fermi

  • Nvidia Kepler

  • Nvidia Maxwell

  • Nvidia Pascal

    • Add NVLink
    • eg. Nvidia P100, Nvidia GTX 10 Series
  • Nvidia Volta

    • Add Tensor Core
    • eg. Nvidia V100
  • Nvidia Turing

    • Add RT Core
    • eg. Nvidia GTX 16 Series, Nvidia RTX 20 Series
  • Nvidia Ampere

    • eg. Nvidia RTX 30 Series, Nvidia A100
  • Nvidia Ada-Lovelace

    • eg. Nvidia RTX 40 Series
  • Nvidia Hopper

    • eg. Nvidia H100
  • Nvidia Blackwell

CUDA Core

  • High precision: FP64, FP32, FP16, INT32

Tensor Core

  • Low precision: FP16, INT8
  • Special for matrix multiplication and accumulation
  • Mixed Precision

VRAM

Parameter

  • Capacity
  • Latency
  • Bandwidth
    • Refresh Rates
    • Memory Bus Width

128-bit = 4/8/16G
160-bit = 10G
192-bit = 3/6/12G
256-bit = 4/8/16G
320-bit = 10/20G
352-bit = 11/22G
384-bit = 6/12/24G

Type

  • GDDR (Graphics Double Data Rate)

    • GDDR5
    • GDDR5X
    • GDDR6
    • GDDR6X
  • HBM (High Bandwidth Memory)

    • HBM
    • HBM2
    • HBM2e
    • HBM3
    • HBM3e

Power

  • Power Phases

Interconnect

  • Hardware
    • Intra-Machine
      • Shared Memory
      • PCIe
      • NVLink
    • Inter-Machine
      • InfiniBand
      • TCP/IP Sockets
      • RDMA (Remote Direct Memory Access)
        • RoCE
  • Software
    • MPI
    • GLOO
    • XCCL

RDMA

  • CPU Offload
  • Kernel Bypass
  • Zero Copy

Software Tech

  • Ray Tracing
  • DLSS (Deep Learning Super Sampling)
  • CUDA
    • cuDNN (CUDA Deep Neural Network Library)
    • TensorRT

References

深入GPU原理
https://www.bilibili.com/video/BV1bm4y1m7Ki

GPU工作原理
https://www.bilibili.com/video/BV17L4y1a7Xy

RTX40系显卡评测序章:ADA新架构变化有多大?
https://www.bilibili.com/video/BV1W8411W7aM

Pascal Architecture Whitepaper
https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

Volta Architecture Whitepaper
https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

Turing Architecture Whitepaper
https://images.nvidia.com/aem-dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf

Ampere Architecture Whitepaper
https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf

Hopper Architecture Whitepaper
https://resources.nvidia.com/en-us-tensor-core

Ada-Lovelace Architecture Whitepaper
https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf

Blackwell Architecture Technical Brief
https://resources.nvidia.com/en-us-blackwell-architecture

0%