Skip to content

[Python] clip_tokenize: unknown token ' ' #107

@h3ndrik

Description

@h3ndrik

I've copy-pasted the example python code from examples/python_bindings/README.md

The tokenizer complains about the spaces in 'cat on a Turtle'.

I've tried both "mys/ggml_CLIP-ViT-B-32-laion2B-s34B-b79K/CLIP-ViT-B-32-laion2B-s34B-b79K_ggml-model-f16.gguf" and the q8_0 variant.

Full log:

(venv)$ python test_clip.py 
[File Info] models/CLIP-ViT-B-32-laion2B-s34B-b79K_ggml-model-q8_0.gguf
clip_model_load: description:  two-tower CLIP model
clip_model_load: GGUF version: 2
clip_model_load: alignment:    32
clip_model_load: n_tensors:    397
clip_model_load: n_kv:         25
clip_model_load: ftype:        q8_0

clip_model_load: text_encoder:   1
clip_model_load: vision_encoder: 1
clip_model_load: model size:     156.10 MB
clip_model_load: metadata size:  0.13 MB

clip_model_load: text model hparams
n_vocab            49408
num_positions      77
t_hidden_size      512
t_n_intermediate   2048
t_projection_dim   512
t_n_head           8
t_n_layer          12

clip_model_load: vision model hparams
image_size         224
patch_size         32
v_hidden_size      768
v_n_intermediate   3072
v_projection_dim   512
v_n_head           12
v_n_layer          12

clip_model_load: 24 MB of memory allocated
clip_tokenize: unknown token ' '
Similarity score: 0.1402660459280014

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions