LLMの量子化手法と実装方法について

LLMの量子化技術と実装方法について説明しています。AIモデルの実運用におけるモデルの大きさや処理時間の問題を解決するための量子化技術について、具体的にはAWQ、GPTQ、SmoothQuantの3つの手法とその実装方法、結果について紹介しています。

AIモデルの実運用を考えたときに、モデルの大きさや処理時間がネックとなることがあります。この問題に対処するために、コンパクト化という技術があり、代表的な手法として蒸留・枝刈り・量子化の3種類が挙げられます。今回は、LLM(Large Language Model)の量子化についてと実装方法・結果について紹介します。

LLMの量子化

AWQ(Activation-aware Weight Quantization)

重要な重みを量子化せずに残す考え方を応用。重要な重みにスケール係数を掛け合わせた後量子化を行うことで、重要な重みの情報を保ち、精度を維持することが可能です。そのため、全体ではINT4ですが、一部量子化しない方法と同等の精度です。
AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

GPTQ(Generative Pre-trained Transformers Quantization)

GPUでの実行に最適化された量子化。勾配情報(Hessian情報)を使用して量子化誤差を最小化します。主に3ビット/4ビット量子化で使用されます。
[2210.17323] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

SmoothQuant

活性化値の外れ値に対応した量子化です。Y=XW → Y=(X/s)(s*W)に変換。活性化値を量子化しやすい値に変換します。
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

実装

ベースのBF16(Brain Floating point)とAWQ、GPTQ、SmoothQuantで動作させました。LLMモデルはQwen2.5-1.5Bを使用しました。

TensorRT-LLMというライブラリでQwenを動作させる方法を説明します。このライブラリを使用することで、簡単に量子化を実装することができます。
TensorRT-LLM/examples/models/core/qwen at v1.1.0rc5 · NVIDIA/TensorRT-LLM · GitHub

下記リンクで環境構築を行った後に、上記リンクの手順に沿って実装してください。
Installing on Linux via pip — TensorRT LLM

共通の手順

pip install -r requirements.txt
git lfs install
git clone https://huggingface.co/Qwen/Qwen-7B-Chat ./tmp/Qwen/7B
（GPTQのみ以下が必要です）
git clone https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 ./tmp/Qwen/Qwen-7B-Chat-Int4

それぞれの使い方

BF16

python convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \
                              --output_dir ./tllm_checkpoint_1gpu_bf16 \
                              --dtype bfloat16

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_bf16 \
            --output_dir ./tmp/Qwen/7B/trt_engines/bf16/1-gpu \
            --gpt_attention_plugin bfloat16 \
            --gemm_plugin bfloat16

python3 ../../../run.py --input_text "Hello, what is your name?" \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/bf16/1-gpu

AWQ

python ../../../quantization/quantize.py --model_dir ./tmp/Qwen/7B/ \
                                   --dtype float16 \
                                   --qformat int4_awq \
                                   --awq_block_size 128 \
                                   --output_dir ./quantized_int4-awq \
                                   --calib_size 32

python convert_checkpoint.py --model_dir ./tmp/Qwen2-7B-Instruct-AWQ \
                             --output_dir ./quantized_int4-awq

python3 ../../../run.py --input_text "Hello, what is your name?" \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int4_AWQ/1-gpu/

GPTQ

python3 convert_checkpoint.py --model_dir ./tmp/Qwen-7B-Chat-Int4 \
                              --output_dir ./tllm_checkpoint_1gpu_gptq \
                              --dtype float16 \
                              --use_weight_only \
                              --weight_only_precision int4_gptq \
                              --per_group

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_gptq \
                --output_dir ./tmp/Qwen/7B/trt_engines/int4_GPTQ/1-gpu/ \
                --gemm_plugin float16

python3 ../../../run.py --input_text "Hello, what is your name?" \
                  --max_output_len=50 \
                  --tokenizer_dir ./tmp/Qwen-7B-Chat-Int4 \
                  --engine_dir=./tmp/Qwen/7B/trt_engines/int4_GPTQ/1-gpu/

SmoothQuant

python3 convert_checkpoint.py --model_dir ./tmp/Qwen/7B/ \
                              --output_dir ./tllm_checkpoint_1gpu_sq \
                              --dtype float16 \
                              --smoothquant 0.5 \
                              --per_token \
                              --per_channel

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_sq \
             --output_dir ./engine_outputs \
             --gemm_plugin float16

python3 ../../../run.py --input_text "Hello, what is your name?" \
                  --max_output_len=150 \
                  --tokenizer_dir ./tmp/Qwen/7B/ \
                  --engine_dir=./engine_outputs