3,304
社区成员




"QRB5165上运行inception_v3的速度很低, 如何提升到官方宣传的范围?
高通官方文档里宣称的是能运行到337 inf/s, 但是我们实际测试的结果只有90 inf/s. 这个是测试的命令:
sh-5.0# snpe-parallel-run --container dlc-model/inception-v3/inception_v3_quantized.dlc \
> --input_list dlc-model/inception-v3/target_raw_list.txt \
> --perf_profile burst --cpu_fallback false --enable_init_cache \
> --userbuffer_tf8 --profiling_level basic --use_aip \
> --perf_profile burst --cpu_fallback false --enable_init_cache \
> --userbuffer_tf8 --profiling_level basic --use_aip \
> --perf_profile burst --cpu_fallback false --enable_init_cache \
> --userbuffer_tf8 --profiling_level basic --use_aip \
> --perf_profile burst --cpu_fallback false --enable_init_cache \
> --userbuffer_tf8 --profiling_level basic --use_aip \
> --duration 10
The number of input image is: 1
CONTAINER SAVE SUCCESS
Saved container into archive successfully
PSNPE inputDimensions is: [ 1 299 299 3 ]
Batch size for the container is: 1
Input/output buffer number is: 1
Processing DNN input(s):
./dlc-model/inception-v3/chairs.raw
PSNPE start executing...
runtimes: aip_fixed8_tf aip_fixed8_tf aip_fixed8_tf aip_fixed8_tf CPU Fxp Mode: 0 - Mode :0- Number of images processed: 903
Build time: 0.099488 seconds.
Start timestamp of the first input loading (0.0s): 1717741692765047
End time of the last input loading: 0.005729
Start time of the first execution: 0.005755
Start time of the last getOutputCallback: 10.0094
Start time of the first getOutputCallback: -1.71774e+09
End time of the last getOutputCallback: 10.01
Execution Time: 10.0036
Execution Time + getOutput Time: 10.0043
LoadInput time + Execution Time + getOutput Time: 10.01
Mean output time: 10.01
90.2673 infs/sec
Successfully executed!"
1, 使用命令snpe-dlc-graph-prepare对你的模型进行处理, 使模型可以运行在AIP+DSP上.
2, 进行尝试, 验证AIP和DSP的分布最合理的比例:
通过验证可以发现, 使用以下的命令:
snpe-throughput-net-run --container inception_v3_quantized.dlc --perf_profile burst --userbuffer_tf8 --use_dsp --container inception_v3_quantized.dlc
--perf_profile burst --userbuffer_tf8 --use_aip --container inception_v3_quantized.dlc --perf_profile burst --userbuffer_tf8 --use_aip --container inc
eption_v3_quantized.dlc --perf_profile burst --userbuffer_tf8 --use_aip --container inception_v3_quantized.dlc --perf_profile burst --userbuffer_tf8 -
-use_aip --duration 20
可以跑出最快的速度, 结果大约是:
Output:
/prj/qct/webtech_hyd18/mlg_user_admin/qaisw_source_repo/qaisw_repo_release/snpe_src/avante-tools/prebuilt/dsp/hexagon-sdk-4.1.0/ipc/fastrpc/rpcmem/src/rpcmem_android.c:38:dummy call to rpcmem_init, rpcmem APIs will be used from libxdsprpc
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
Processing DNN input(s):
Processing DNN input(s):
Processing DNN input(s):
Processing DNN input(s):
Processing DNN input(s):
[Thread 0 - dsp_fixed8_tf] 82.7196 infs/sec - Number of images processed: 1655 - Build time: 342806 microseconds - Elapsed time: 20009778 microseconds - Real time: 20007353 microseconds - Teardown time: 98853 microseconds - Batch : 1
[Thread 1 - aip_fixed8_tf] 62.4897 infs/sec - Number of images processed: 1250 - Build time: 42899 microseconds - Elapsed time: 20005349 microseconds - Real time: 20003308 microseconds - Teardown time: 14370 microseconds - Batch : 1
[Thread 2 - aip_fixed8_tf] 62.9518 infs/sec - Number of images processed: 1259 - Build time: 24320 microseconds - Elapsed time: 20001755 microseconds - Real time: 19999433 microseconds - Teardown time: 10066 microseconds - Batch : 1
[Thread 3 - aip_fixed8_tf] 62.7556 infs/sec - Number of images processed: 1256 - Build time: 13685 microseconds - Elapsed time: 20016268 microseconds - Real time: 20014137 microseconds - Teardown time: 103485 microseconds - Batch : 1
[Thread 4 - aip_fixed8_tf] 62.7268 infs/sec - Number of images processed: 1255 - Build time: 14063 microseconds - Elapsed time: 20009735 microseconds - Real time: 20007390 microseconds - Teardown time: 99701 microseconds - Batch : 1
Total throughput: 333.644 infs/sec
基本可以达到官方宣传的330+的输出.