2,754
社区成员




https://docs.qualcomm.com/bundle/publicresource/topics/80-64748-1/introduction.html
Following the guide shown in the above link, the preparation and optimization of the stable diffusion model of the Qualcomm AI Stack is performed for deployment and reasoning on Snapdragon devices.
The entire execution process is: AI Model Efficiency Toolkit Model Optimization for Snapdragon Devices ==>Qualcomm AI Engine Direct Model Preparation on Linux ==>Qualcomm AI Engine Direct Model Execution on Android on Snapdragon
My related settings are as follows:
AIMET A16W8 config:
{
"activation_bit_width": 16,
"apply_adaround": false,
"calibration_prompts": "calibration_prompts.txt",
"config_file": null,
"gn_exceptions": true,
"half_precision": false,
"in_place": true,
"parameter_bit_width": 8,
"quant_scheme": "tf",
"remove_quantsim_after_inference": true,
"replace_attn_singlehead_conv": true,
"silu_sigmoid_encoding_override": true,
"softmax_encoding_override": true,
"text_encoder_exception_type": "text_encoder_attn_q=16_k=8_sm=8_v=16_as=16",
"unet_exception_type": "UNET_attn_q=16_k=8_sm=16_v=8_as=16",
"use_asymmetric_layernorm_weights": true,
"use_symmetric_matmul": true,
"vae_exception_type": "VAE_attn_q=16_k=8_sm=16_v=8_as=16",
"apply_adaround_text_encoder": true,
"adaround_iter_text_encoder": 1
}
htp_config.json
{
"graphs": [{
"vtcm_mb":8,
"graph_names":["qnn_model"]
}],
"devices": [
{
"soc_id": 69,
"dsp_arch": "v79",
"cores":[{
"core_id": 0,
"perf_profile": "burst",
"rpc_control_latency":100
}]
}
]
}
The mobile device I use: SM8750
However, after the quantized A16W8 context binary model pushes to the mobile terminal, the inference speed is 20 seconds for 20 inference steps to generate a 512x512 pixel image, which is significantly slower than the official inference speed (The result of this full-stack optimization is running Stable Diffusion on a smartphone under 15 seconds for 20 inference steps to generate a 512x512 pixel image — this is the fastest inference on a smartphone and comparable to cloud Latency. Reference https://www.qualcomm.com/news/onq/2023/02/worlds-first-on-device-demonstration-of-stable-diffusion-on-android)
The execution time of unet under one step is about 0.5s, two unets are 1s, and 20 steps are about 20 seconds.
Therefore,what other operations or configurations in this demo process can speed up or affect the speed of reasoning?