onnx和模型量化.md

AllenTT2025-04-222025-04-22

安装

1
2

pip install onnx onnxruntime -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# pip install onnxruntime-openvino -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host=mirrors.cloud.aliyuncs.com

PyTorch导出onnx模型

# 参考：https://pytorch.ac.cn/tutorials/advanced/super_resolution_with_onnxruntime.html
import os
import numpy as np
import torch
import onnx
import onnxruntime

# 一、定义模型输入数据和计算输出结果
batch_size = 5
x = torch.randn(batch_size, 3, 224, 224, requires_grad=True)
torch_out = torch_model(x)

# 二、torch原生支持导出onnx模型
torch.onnx.export(torch_model,               # model being run
                  x,                         # model input (or a tuple for multiple inputs)
                  "vit.onnx",                # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=16,          # the ONNX version to export the model to ≥14
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input' : {0 : 'batch_size'},    # variable length axes
                                'output' : {0 : 'batch_size'}})

# 三、检查导出的onnx模型
onnx_model = onnx.load("vit.onnx")
onnx.checker.check_model(onnx_model)

# 四、验证onnx模型输出结果和torch模型输出结果对比
## 4.1 配置onnxruntime运行时参数，并加载刚导出的模型
num_logical_cpus = 4  # os.cpu_count()
sess_options = onnxruntime.SessionOptions()
sess_options.intra_op_num_threads = num_logical_cpus
#sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL   # 启用图优化
ort_session = onnxruntime.InferenceSession("vit_quant.onnx", sess_options, providers=["CPUExecutionProvider"])
#ort_session = onnxruntime.InferenceSession("vit_quant.onnx", sess_options, providers=["OpenVINOExecutionProvider"], provider_options=[{"cache_dir": "./cache","device_type": "CPU", "num_of_threads": num_logical_cpus, "precision": "FP32"}]))  # 采用OpenVINO推理，针对部分Intel CPU

## 4.2 onnxruntime推理：计算结果，并和torch的输出结果对比
def to_numpy(tensor):
    return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs); print(torch_out.detach().numpy(), "\n", ort_outs[0])

np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Exported model has been tested with ONNXRuntime, and the result looks good!")

ONNX动态量化模型

INT8动态量化

参数量化
- 对于每层的模型参数$W$，采用Min-Max归一化，将参数归一化到[0,255]或[-128,127]
- 保存量化后的参数$W_q$，以及每层Min-Max归一化的scale和zero_point
输入/激活值(动态)量化
- 对于每层的输入向量$X$，同样采用Min-Max归一化，将参数归一化到[0,255]或[-128,127]
- 由于每次的输入和激活值都是不同的，因此，需要实时动态计算量化后的$X_q$，以及scale和zero_point
计算 & 反量化
- **计算(以linear层为例)*：$ X@W = (X_q * scale_x) @ (W_q * scale_w) = (scale_xscale_w) * (X_q @ W_q) $
  - 其中scale都是标量，X和W都是向量，@是矩阵乘法。
  - 通过公式，我们可以将浮点数的矩阵乘法，转换为INT8的矩阵乘法，从而降低计算量，加速推理
- 反量化：将上述的结果反量化回浮点数FP32，继续执行sigmoid,tanh等激活函数；之后做为下一层的输入，执行”输入/激活值量化”

对称量化
非对称量化

from onnxruntime.quantization import quantize_dynamic, QuantType

all_nodes = [n.name for n in model.graph.node if n.op_type in ('Conv', 'MatMul', 'Gemm')]
nodes_to_quantize = all_nodes[:]  # all_nodes[:len(all_nodes)//4*3]

quantize_dynamic(
    "vit.onnx",
    "vit_quant.onnx",
    # op_types_to_quantize=['Conv', 'MatMul', 'Add', 'Gemm'],
    nodes_to_quantize=nodes_to_quantize,
    weight_type=QuantType.QUInt8,
)

安装

PyTorch导出onnx模型

ONNX动态量化模型

INT8动态量化

参考资料