Skip to content

LLMs on ANE with Flexible Inputs and States #2600

@RachidFK

Description

@RachidFK

❓Question

Hello,

I am trying to run the Llama-3.2-3B on the ANE on my M2 Max mac running macos 15.6.1.
Is the LLM inference with KV cache as a state, flexible input ranges, and int4 quantization/palettization supported on CPU+Neural Engine? If so, could you please point me to an example explaining the flow?

I tried to use this tutorial (describing the deployment flow on GPU) as reference, but I could not load the converted model on ANE. All the nodes are falling back to CPU (I verified the compute unit mapping through Xcode).

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionResponse providing clarification needed. Will not be assigned to a release. (type)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions