onnx和模型量化.md

安装

1
2
pip install onnx onnxruntime -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# pip install onnxruntime-openvino -i http://mirrors.cloud.aliyuncs.com/pypi/simple/ --trusted-host=mirrors.cloud.aliyuncs.com

PyTorch导出onnx模型

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# 参考:https://pytorch.ac.cn/tutorials/advanced/super_resolution_with_onnxruntime.html
import os
import numpy as np
import torch
import onnx
import onnxruntime

# 一、定义模型输入数据和计算输出结果
batch_size = 5
x = torch.randn(batch_size, 3, 224, 224, requires_grad=True)
torch_out = torch_model(x)

# 二、torch原生支持导出onnx模型
torch.onnx.export(torch_model, # model being run
x, # model input (or a tuple for multiple inputs)
"vit.onnx", # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=16, # the ONNX version to export the model to ≥14
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input'], # the model's input names
output_names = ['output'], # the model's output names
dynamic_axes={'input' : {0 : 'batch_size'}, # variable length axes
'output' : {0 : 'batch_size'}})

# 三、检查导出的onnx模型
onnx_model = onnx.load("vit.onnx")
onnx.checker.check_model(onnx_model)

# 四、验证onnx模型输出结果和torch模型输出结果对比
## 4.1 配置onnxruntime运行时参数,并加载刚导出的模型
num_logical_cpus = 4 # os.cpu_count()
sess_options = onnxruntime.SessionOptions()
sess_options.intra_op_num_threads = num_logical_cpus
#sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL # 启用图优化
ort_session = onnxruntime.InferenceSession("vit_quant.onnx", sess_options, providers=["CPUExecutionProvider"])
#ort_session = onnxruntime.InferenceSession("vit_quant.onnx", sess_options, providers=["OpenVINOExecutionProvider"], provider_options=[{"cache_dir": "./cache","device_type": "CPU", "num_of_threads": num_logical_cpus, "precision": "FP32"}])) # 采用OpenVINO推理,针对部分Intel CPU

## 4.2 onnxruntime推理:计算结果,并和torch的输出结果对比
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs); print(torch_out.detach().numpy(), "\n", ort_outs[0])

np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Exported model has been tested with ONNXRuntime, and the result looks good!")

ONNX动态量化模型

INT8动态量化

  1. 参数量化
    • 对于每层的模型参数$W$,采用Min-Max归一化,将参数归一化到[0,255]或[-128,127]
    • 保存量化后的参数$W_q$,以及每层Min-Max归一化的scale和zero_point
  2. 输入/激活值(动态)量化
    • 对于每层的输入向量$X$,同样采用Min-Max归一化,将参数归一化到[0,255]或[-128,127]
    • 由于每次的输入和激活值都是不同的,因此,需要实时动态计算量化后的$X_q$,以及scale和zero_point
  3. 计算 & 反量化
    • **计算(以linear层为例)*:$ X@W = (X_q * scale_x) @ (W_q * scale_w) = (scale_xscale_w) * (X_q @ W_q) $
      • 其中scale都是标量,X和W都是向量,@是矩阵乘法。
      • 通过公式,我们可以将浮点数的矩阵乘法,转换为INT8的矩阵乘法,从而降低计算量,加速推理
    • 反量化:将上述的结果反量化回浮点数FP32,继续执行sigmoid,tanh等激活函数;之后做为下一层的输入,执行”输入/激活值量化”

对称量化
非对称量化

1
2
3
4
5
6
7
8
9
10
11
12
13
from onnxruntime.quantization import quantize_dynamic, QuantType

all_nodes = [n.name for n in model.graph.node if n.op_type in ('Conv', 'MatMul', 'Gemm')]
nodes_to_quantize = all_nodes[:] # all_nodes[:len(all_nodes)//4*3]

quantize_dynamic(
"vit.onnx",
"vit_quant.onnx",
# op_types_to_quantize=['Conv', 'MatMul', 'Add', 'Gemm'],
nodes_to_quantize=nodes_to_quantize,
weight_type=QuantType.QUInt8,
)

参考资料