ggml is compiled with cublas, but GPU is not used

I compiled ggml with `-DGGML_CUBLAS=ON` and then clip.cpp, and used it to get text encodings, but the GPU is not being used. The code takes the same amount of time as it did with CPU-only. Is this expected? Does `clip_text_encode` always use the CPU no matter what? Or did I forget to do something?

Details:
ggml is detecting the GPU without problem (Nvidia AGX Orin):
```sh
$ ./myapp
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7
```

Simplified version of my code:
```cpp
#include "clip.h"
// ...
string model = "clip-vit-base-patch32_ggml-text-model-f16.gguf";
clip_ctx *ctx = clip_model_load(model.c_str(), verbosity);
for (int i = 0; i < 1000; i++)
  clip_tokenize(ctx, "person".c_str(), &tokens);
  float txt_vec[512];
  clip_text_encode(ctx, /*threads:*/4, &tokens, txt_vec, true);
}
```

This takes 8 seconds to finish. While this runs, I have [jtop](https://github.com/rbonghi/jetson_stats) open, and I see the GPU is only active during the first 3 seconds, when ggml gets the GPU name and compute capability to print them. After that, the GPU goes offline. GPU usage is always 0%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ggml is compiled with cublas, but GPU is not used #106

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

ggml is compiled with cublas, but GPU is not used #106

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions