How to improve the inference speed of stable diffusion v1.5 deployed on SM8750 mobile device

污力大魔王 2025-04-29 16:10:58

https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1/introduction.html

Following the guide shown in the above link, the preparation and optimization of the stable diffusion model of the Qualcomm AI Stack is performed for deployment and reasoning on Snapdragon devices.

The entire execution process is: AI Model Efficiency Toolkit Model Optimization for Snapdragon Devices ==>Qualcomm AI Engine Direct Model Preparation on Linux ==>Qualcomm AI Engine Direct Model Execution on Android on Snapdragon

My related settings are as follows:

AIMET A16W8 config：

{

"activation_bit_width": 16,

"apply_adaround": false,

"calibration_prompts": "calibration_prompts.txt",

"config_file": null,

"gn_exceptions": true,

"half_precision": false,

"in_place": true,

"parameter_bit_width": 8,

"quant_scheme": "tf",

"remove_quantsim_after_inference": true,

"replace_attn_singlehead_conv": true,

"silu_sigmoid_encoding_override": true,

"softmax_encoding_override": true,

"text_encoder_exception_type": "text_encoder_attn_q=16_k=8_sm=8_v=16_as=16",

"unet_exception_type": "UNET_attn_q=16_k=8_sm=16_v=8_as=16",

"use_asymmetric_layernorm_weights": true,

"use_symmetric_matmul": true,

"vae_exception_type": "VAE_attn_q=16_k=8_sm=16_v=8_as=16",

"apply_adaround_text_encoder": true,

"adaround_iter_text_encoder": 1

}

htp_config.json

{

"graphs": [{

"vtcm_mb":8,

"graph_names":["qnn_model"]

}],

"devices": [

{

"soc_id": 69,

"dsp_arch": "v79",

"cores":[{

"core_id": 0,

"perf_profile": "burst",

"rpc_control_latency":100

}]

}

]

}

The mobile device I use: SM8750

However, after the quantized A16W8 context binary model pushes to the mobile terminal, the inference speed is 20 seconds for 20 inference steps to generate a 512x512 pixel image, which is significantly slower than the official inference speed (The result of this full-stack optimization is running Stable Diffusion on a smartphone under 15 seconds for 20 inference steps to generate a 512x512 pixel image — this is the fastest inference on a smartphone and comparable to cloud Latency. Reference https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android)

The execution time of unet under one step is about 0.5s, two unets are 1s, and 20 steps are about 20 seconds.

Therefore，what other operations or configurations in this demo process can speed up or affect the speed of reasoning？

...全文

348 1 打赏收藏转发到动态举报

写回复

用AI写文章

1 条回复

切换为时间正序

请发表友善的回复…

发表回复

weixin_38498942 05-30 15:20

打赏
举报

可尝试以下几种优化方法：
1、您当前配置使用 activation_bit_width=16，这会显著增加计算量。考虑降低为 8 位（需谨慎评估精度损失）
2、当前您对 UNet 和 VAE 的注意力层使用了混合精度策略（如UNET_attn_q=16_k=8_sm=16_v=8_as=16），可尝试简化：
{
"unet_exception_type": "UNET_attn_q=8_k=8_sm=8_v=8_as=8", // 统一为8位
"vae_exception_type": "VAE_attn_q=8_k=8_sm=8_v=8_as=8", // 统一为8位
// 其他参数保持不变...
}
3、启用更多优化技术
{
"use_symmetric_activations": true, // 对称激活值量化
"enable_per_channel_quantization": true, // 逐通道量化
"fuse_bn_with_conv": true, // 融合BatchNorm与卷积层
// 其他参数保持不变...
}