LLMs on ANE with Flexible Inputs and States

## ❓Question
Hello, 

I am trying to run the Llama-3.2-3B on the ANE on my M2 Max mac running macos 15.6.1. 
Is the LLM inference with KV cache as a state, flexible input ranges, and int4 quantization/palettization supported on CPU+Neural Engine? If so, could you please point me to an example explaining the flow? 

I tried to use this [tutorial](https://machinelearning.apple.com/research/core-ml-on-device-llama) (describing the deployment flow on GPU) as reference, but I could not load the converted model on ANE. All the nodes are falling back to CPU (I verified the compute unit mapping through Xcode).

Thank you. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLMs on ANE with Flexible Inputs and States #2600

❓Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLMs on ANE with Flexible Inputs and States #2600

Description

❓Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions