How to improve the inference speed of stable diffusion v1.5 deployed on SM8750 mobile device

污力大魔王 2025-04-29 16:10:58

https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1/introduction.html 

Following the guide shown in the above link, the preparation and optimization of the stable diffusion model of the Qualcomm AI Stack is performed for deployment and reasoning on Snapdragon devices. 

The entire execution process is: AI Model Efficiency Toolkit Model Optimization for Snapdragon Devices ==>Qualcomm AI Engine Direct Model Preparation on Linux ==>Qualcomm AI Engine Direct Model Execution on Android on Snapdragon 

My related settings are as follows: 

​AIMET A16W8 config:

{

"activation_bit_width": 16,

"apply_adaround": false,

"calibration_prompts": "calibration_prompts.txt",

"config_file": null,

"gn_exceptions": true,

"half_precision": false,

"in_place": true,

"parameter_bit_width": 8,

"quant_scheme": "tf",

"remove_quantsim_after_inference": true,

"replace_attn_singlehead_conv": true,

"silu_sigmoid_encoding_override": true,

"softmax_encoding_override": true,

"text_encoder_exception_type": "text_encoder_attn_q=16_k=8_sm=8_v=16_as=16",

"unet_exception_type": "UNET_attn_q=16_k=8_sm=16_v=8_as=16",

"use_asymmetric_layernorm_weights": true,

"use_symmetric_matmul": true,

"vae_exception_type": "VAE_attn_q=16_k=8_sm=16_v=8_as=16",

"apply_adaround_text_encoder": true,

"adaround_iter_text_encoder": 1

}

htp_config.json

{

"graphs": [{

"vtcm_mb":8,

"graph_names":["qnn_model"]

}],

"devices": [

{

"soc_id": 69,

"dsp_arch": "v79",

"cores":[{

"core_id": 0,

"perf_profile": "burst",

"rpc_control_latency":100

}]

}

]

}

The mobile device I use: SM8750 

However, after the quantized A16W8 context binary model pushes to the mobile terminal, the inference speed is 20 seconds for 20 inference steps to generate a 512x512 pixel image, which is significantly slower than the official inference speed (The result of this full-stack optimization is running Stable Diffusion on a smartphone under 15 seconds for 20 inference steps to generate a 512x512 pixel image — this is the fastest inference on a smartphone and comparable to cloud Latency. Reference https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android) 

image

The execution time of unet under one step is about 0.5s, two unets are 1s, and 20 steps are about 20 seconds.

Therefore,what other operations or configurations in this demo process can speed up or affect the speed of reasoning?

 

 

...全文
170 1 打赏 收藏 转发到动态 举报
写回复
用AI写文章
1 条回复
切换为时间正序
请发表友善的回复…
发表回复
  • 打赏
  • 举报
回复

可尝试以下几种优化方法:
1、您当前配置使用 activation_bit_width=16,这会显著增加计算量。考虑降低为 8 位(需谨慎评估精度损失)
2、当前您对 UNet 和 VAE 的注意力层使用了混合精度策略(如UNET_attn_q=16_k=8_sm=16_v=8_as=16),可尝试简化:
{
"unet_exception_type": "UNET_attn_q=8_k=8_sm=8_v=8_as=8", // 统一为8位
"vae_exception_type": "VAE_attn_q=8_k=8_sm=8_v=8_as=8", // 统一为8位
// 其他参数保持不变...
}
3、启用更多优化技术
{
"use_symmetric_activations": true, // 对称激活值量化
"enable_per_channel_quantization": true, // 逐通道量化
"fuse_bn_with_conv": true, // 融合BatchNorm与卷积层
// 其他参数保持不变...
}

2,843

社区成员

发帖
与我相关
我的任务
社区描述
本论坛以AI、WoS 、XR、IoT、Auto、生成式AI等核心板块组成,为开发者提供便捷及高效的学习和交流平台。 高通开发者专区主页:https://qualcomm.csdn.net/
人工智能物联网机器学习 技术论坛(原bbs) 北京·东城区
社区管理员
  • csdnsqst0050
  • chipseeker
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧