-
Notifications
You must be signed in to change notification settings - Fork 750
Open
Labels
questionResponse providing clarification needed. Will not be assigned to a release. (type)Response providing clarification needed. Will not be assigned to a release. (type)
Description
❓Question
Hello,
I am trying to run the Llama-3.2-3B on the ANE on my M2 Max mac running macos 15.6.1.
Is the LLM inference with KV cache as a state, flexible input ranges, and int4 quantization/palettization supported on CPU+Neural Engine? If so, could you please point me to an example explaining the flow?
I tried to use this tutorial (describing the deployment flow on GPU) as reference, but I could not load the converted model on ANE. All the nodes are falling back to CPU (I verified the compute unit mapping through Xcode).
Thank you.
Metadata
Metadata
Assignees
Labels
questionResponse providing clarification needed. Will not be assigned to a release. (type)Response providing clarification needed. Will not be assigned to a release. (type)