How to improve the inference speed of stable diffusion v1.5 deployed on SM8750 mobile device

污力大魔王 2025-04-29 16:10:58

https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1/introduction.html 

Following the guide shown in the above link, the preparation and optimization of the stable diffusion model of the Qualcomm AI Stack is performed for deployment and reasoning on Snapdragon devices. 

The entire execution process is: AI Model Efficiency Toolkit Model Optimization for Snapdragon Devices ==>Qualcomm AI Engine Direct Model Preparation on Linux ==>Qualcomm AI Engine Direct Model Execution on Android on Snapdragon 

My related settings are as follows: 

​AIMET A16W8 config:

{

"activation_bit_width": 16,

"apply_adaround": false,

"calibration_prompts": "calibration_prompts.txt",

"config_file": null,

"gn_exceptions": true,

"half_precision": false,

"in_place": true,

"parameter_bit_width": 8,

"quant_scheme": "tf",

"remove_quantsim_after_inference": true,

"replace_attn_singlehead_conv": true,

"silu_sigmoid_encoding_override": true,

"softmax_encoding_override": true,

"text_encoder_exception_type": "text_encoder_attn_q=16_k=8_sm=8_v=16_as=16",

"unet_exception_type": "UNET_attn_q=16_k=8_sm=16_v=8_as=16",

"use_asymmetric_layernorm_weights": true,

"use_symmetric_matmul": true,

"vae_exception_type": "VAE_attn_q=16_k=8_sm=16_v=8_as=16",

"apply_adaround_text_encoder": true,

"adaround_iter_text_encoder": 1

}

htp_config.json

{

"graphs": [{

"vtcm_mb":8,

"graph_names":["qnn_model"]

}],

"devices": [

{

"soc_id": 69,

"dsp_arch": "v79",

"cores":[{

"core_id": 0,

"perf_profile": "burst",

"rpc_control_latency":100

}]

}

]

}

The mobile device I use: SM8750 

However, after the quantized A16W8 context binary model pushes to the mobile terminal, the inference speed is 20 seconds for 20 inference steps to generate a 512x512 pixel image, which is significantly slower than the official inference speed (The result of this full-stack optimization is running Stable Diffusion on a smartphone under 15 seconds for 20 inference steps to generate a 512x512 pixel image — this is the fastest inference on a smartphone and comparable to cloud Latency. Reference https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android) 

image

The execution time of unet under one step is about 0.5s, two unets are 1s, and 20 steps are about 20 seconds.

Therefore,what other operations or configurations in this demo process can speed up or affect the speed of reasoning?

 

 

...全文
61 回复 打赏 收藏 转发到动态 举报
写回复
用AI写文章
回复
切换为时间正序
请发表友善的回复…
发表回复

2,754

社区成员

发帖
与我相关
我的任务
社区描述
本论坛以AI、WoS 、XR、IoT、Auto、生成式AI等核心板块组成,为开发者提供便捷及高效的学习和交流平台。 高通开发者专区主页:https://qualcomm.csdn.net/
人工智能物联网机器学习 技术论坛(原bbs) 北京·东城区
社区管理员
  • csdnsqst0050
  • chipseeker
加入社区
  • 近7日
  • 近30日
  • 至今
社区公告
暂无公告

试试用AI创作助手写篇文章吧