7,150
社区成员
发帖
与我相关
我的任务
分享根据tutorial_for_llama2_ssd_auto教程,我生成了QNN context binaries和kv-cache.primary.qnn-htp等模型文件,下一步应该如何在高通8550芯片上执行推理呢?
tutorials的qnn_model_prepare.ipynb注释里提到:
After preparing the LLaMA with SSD models for inference, the next step is to execute the QNN context binaries for inference on a Snapdragon Android device. See qnn_model_execution.ipynb.
但是教程里没有附上对应的qnn_model_execution.ipynb文件。
之前通过llama3_tutorials生成过weight_sharing_.serialized.bin文件,可以通过genie推理。但是ssd技术引入了很多新的模块, genie的config文件应该怎么写呢?下面是之前针对llama3写过的cofig文件。
{
"dialog" : {
"version" : 1,
"type" : "basic",
"context" : {
"version" : 1,
"size": 4096,
"n-vocab": 128256,
"bos-token": 128000,
"eos-token": 128001,
"eot-token": 128009
},
"sampler" : {
"version" : 1,
"seed" : 42,
"temp" : 0.8,
"top-k" : 40,
"top-p" : 0.95
},
"tokenizer" : {
"version" : 1,
"path" : "/models/llama3-8b/tokenizer.json"
},
"engine" : {
"version" : 1,
"n-threads" : 3,
"backend" : {
"version" : 1,
"type" : "QnnHtp",
"QnnHtp" : {
"version" : 1,
"use-mmap" : false,
"spill-fill-bufsize" : 0,
"mmap-budget" : 0,
"poll" : true,
"pos-id-dim" : 64,
"cpu-mask" : "0xe0",
"kv-dim" : 128,
"rope-theta": 10000
},
"extensions" : "htp_backend_ext_config.json"
},
"model" : {
"version" : 1,
"type" : "binary",
"binary" : {
"version" : 1,
"ctx-bins" : [
"/models/llama3-8b/weight_sharing_model_1_of_5.serialized.bin",
"/models/llama3-8b/weight_sharing_model_2_of_5.serialized.bin",
"/models/llama3-8b/weight_sharing_model_3_of_5.serialized.bin",
"/models/llama3-8b/weight_sharing_model_4_of_5.serialized.bin",
"/models/llama3-8b/weight_sharing_model_5_of_5.serialized.bin"
]
}
}
}
}
}