Skip to content
Merged

Vlm #436

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates

## :fire: Latest News

- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over a total of **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The phrasing "supporting over a total of" is a bit redundant. For better readability and grammar, I suggest simplifying it. Also, "please" at the beginning of a sentence should be capitalized.

Suggested change
- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over a total of **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. Please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).


- **May 12, 2025:** 🔥 We now fully support quantization for the **`Wan2.1`** series of video generation models and provide export of truly quantized **INT8/FP8** weights, compatible with the [lightx2v](https://github.com/ModelTC/lightx2v) inference framework. For details, please refer to the [lightx2v documentation](https://llmc-en.readthedocs.io/en/latest/backend/lightx2v.html).

- **Feb 07, 2025:** 🔥 We now fully support quantization of large-scale **`MOE`** models like **`DeepSeekv3`**, **`DeepSeek-R1`**, and **`DeepSeek-R1-zero`** with **`671B`** parameters. You can now directly load FP8 weights without any extra conversion. AWQ and RTN quantization can run on a single 80GB GPU, and we also support the export of true quantized **INT4/INT8** weights.
Expand Down
2 changes: 2 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates

## :fire: 最新动态

- **2025年8月13日:** 🚀 我们已开源针对 **视觉语言模型(VLMs)** 的压缩方案,支持共计超过 **20 种算法**,涵盖 **token reduction** 和 **quantization**。此次发布为多模态任务提供了灵活、即插即用的压缩策略。具体请参阅[文档](https://llmc-zhcn.readthedocs.io/en/latest/advanced/token_reduction.html)。

- **2025年5月12日:** 🔥 我们现已全面支持 **`Wan2.1`** 系列视频生成模型的量化,并支持导出真实量化的 **INT8/FP8** 权重,兼容 [lightx2v](https://github.com/ModelTC/lightx2v) 推理框架。详情请参考 [lightx2v 使用文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/lightx2v.html)。

- **2025年2月7日:** 🔥 我们现已全面支持 **`DeepSeekv3`**、**`DeepSeek-R1`** 和 **`DeepSeek-R1-zero`** 等 671B 大规模 **`MOE`** 模型的量化。 您可以直接加载 `FP8` 权重,无需额外转换,使用单张 80G 显存的 GPU 即可运行 `AWQ` 和 `RTN` 量化,同时还支持导出真实量化的 **INT4/INT8** 权重
Expand Down
26 changes: 26 additions & 0 deletions configs/sparsification/methods/Holitom/holitom.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
base:
seed: &seed 42
model:
type: Llava OneVision
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [mme]
download: False
path: MME dataset path
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: HoliTom
RETAIN_RATIO: 0.20
T: 0.65
HOLITOM_k: 18
HOLITOM_r: 0.5
Comment on lines +19 to +22

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hyperparameter keys use a mix of uppercase and lowercase letters. For consistency with other configurations in the project (e.g., FastV uses lowercase keys like rate), it's recommended to use lowercase snake_case for these keys. This improves readability and maintainability.

        retain_ratio: 0.20
        t: 0.65
        holitom_k: 18
        holitom_r: 0.5

save:
save_trans: False
save_fake: False
save_path: /path/to/save/
72 changes: 72 additions & 0 deletions docs/en/source/advanced/token_reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@


# Token Reduction

LightCompress currently supports token reduction for mainstream multimodal large language models. Configuration is very simple—plug and play.

Here is an example configuration

```yaml
base:
seed: &seed 42
model:
type: Llava
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [gqa, mmbench_en_dev, mme]
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: FastV
pruning_loc: 3
rate: 0.778
save:
save_trans: False
save_fake: False
save_path: /path/to/save/
```
The configuration file contains three core sections, including:
1. **`model`**
For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see the file. LightCompress will support more models in the future.

2. **`eval`**
For the `eval_pos` parameter:
- `pretrain` denotes the original model that keeps all visual tokens.
- `transformed` denotes the model with token reduction applied.
LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the lmms-eval documentation.

3. **`sparse`**
Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the configuration files for details.

## Combining Quantization

LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_qunat` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`.
Comment on lines +37 to +50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This documentation file has a few areas for improvement:

  1. On line 37, "see the file" is vague. Please provide a direct markdown link to the file listing supported models.
  2. On line 46, "refer to the configuration files" should also be a link to the relevant directory for easier navigation.
  3. On line 50, there's a typo: fake_qunat should be fake_quant.


```yaml
quant:
method: RTN
weight:
bit: 4
symmetric: False
granularity: per_group
group_size: 128
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
quant_out: True
token_reduction:
method: FastV
special:
pruning_loc: 3
rate: 0.778
```
20 changes: 20 additions & 0 deletions docs/en/source/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,26 @@ quant:
static: True
```

## sparse

<font color=792ee5> sparse.method </font>

The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The link to the __init__ file has a typo in the extension. It should be .py, not .pyn.

Suggested change
The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.


It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to `TokenReduction` first, and then specify the exact algorithm under `special`.

```yaml
sparse:
method: Wanda
```

```yaml
sparse:
method: TokenReduction
special:
method: FastV
```

## save

<font color=792ee5> save.save_vllm</font>
Expand Down
1 change: 1 addition & 0 deletions docs/en/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ arxiv: https://arxiv.org/abs/2405.06001
advanced/VLM_quant&img-txt_dataset.md
advanced/mix_bits.md
advanced/sparsification.md
advanced/token_reduction.md

.. toctree::
:maxdepth: 2
Expand Down
6 changes: 3 additions & 3 deletions docs/zh_cn/source/advanced/VLM_quant&img-txt_dataset.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# VLM quant and custom_mm datatsets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the original heading: datatsets should be datasets.

# VLM 量化和 custom_mm 数据集

llmc目前支持对VLM模型使用图像-文本数据集进行校准并量化

## VLM quant
## VLM 量化
当前支持的模型如下:
1. llava

Expand Down Expand Up @@ -34,7 +34,7 @@ calib:
padding: True
```

## custom_mm datatsets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the original heading: datatsets should be datasets.

## custom_mm 数据集
custom_mm 数据集格式如下:
```
custom_mm-datasets/
Expand Down
6 changes: 3 additions & 3 deletions docs/zh_cn/source/advanced/Vit_quant&img_dataset.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Vit quant and img datatsets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the original heading: datatsets should be datasets.

# Vit 量化和 img 数据集

llmc目前支持对Vit模型使用图像数据集进行校准并量化

## Vit quant
## Vit 量化

下面是一个配置的例子

Expand Down Expand Up @@ -33,7 +33,7 @@ eval:
eval_token_consist: False
```

## img datatsets

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the original heading: datatsets should be datasets.

## img 数据集
img数据集格式要求:img数据集目录下存在图像

img数据集格式示例:
Expand Down
68 changes: 68 additions & 0 deletions docs/zh_cn/source/advanced/token_reduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Token Reduction

目前LightCompress支持对主流的多模态大语言模型进行token reduction,配置十分简单,即插即用。

下面是一个配置的例子

```yaml
base:
seed: &seed 42
model:
type: Llava
path: model path
torch_dtype: auto
eval:
eval_pos: [pretrain, transformed]
type: vqa
name: [gqa, mmbench_en_dev, mme]
bs: 1
inference_per_block: False
sparse:
method: TokenReduction
special:
method: FastV
pruning_loc: 3
rate: 0.778
save:
save_trans: False
save_fake: False
save_path: /path/to/save/
```
配置文件中包含三大核心内容,包括:
1. `model`
在模型选择上,可以选择LLaVA,LLaVA-NeXT,Qwen2.5VL以及LLaVA OneVision等,这些模型涵盖了图像任务和视频任务,详细的模型支持列表可以查阅[文件](https://github.com/ModelTC/LightCompress/blob/main/llmc/models/__init__.py),未来LightCompress也会支持更多的模型。

2. `eval`
首先,在`eval_pos`参数的选择上,`pretrain`表示原始保留所有视觉token的模型,`transformed`表示应用相应算法进行token reduction的模型。LightCompress接入了[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)进行各种下游数据集测评,需要将`type`指定为`vqa`,`name`中的下游测评数据集参考lmms-eval[文档](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md)中的命名方式。

3. `sparse`
`method`需要首先指定为TokenReduction,在`special`中继续指定具体的算法以及相关的一些超参数。由于每个算法对应的超参数不同,详细的可以参考[配置文件](https://github.com/ModelTC/LightCompress/tree/main/configs/sparsification/methods)。


## 结合量化

LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_qunat`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in fake_qunat. It should be fake_quant.

Suggested change
LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_qunat`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
LightCompress也支持同时使用token reduction和量化的极致压缩方案,首先需要选择量化算法存储一个`fake_quant`模型,可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。


```yaml
quant:
method: RTN
weight:
bit: 4
symmetric: False
granularity: per_group
group_size: 128
special:
actorder: True
static_groups: True
percdamp: 0.01
blocksize: 128
true_sequential: True
quant_out: True
token_reduction:
method: FastV
special:
pruning_loc: 3
rate: 0.778
```
20 changes: 20 additions & 0 deletions docs/zh_cn/source/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,6 +401,26 @@ quant:
granularity: per_token
```

## sparse

<font color=792ee5> sparse.method </font>

使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The link to the __init__ file has a typo in the extension. It should be .py, not .pyn.

Suggested change
使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。
使用的稀疏化算法名,这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py),所有支持算法可以在文件中查看。


值得说明的是针对模型稀疏化,需要指定具体的算法名称,而token reduction只需要先指定为`TokenReduction`,在`special`中继续指定具体的算法。

```yaml
sparse:
method: Wanda
```

```yaml
sparse:
method: TokenReduction
special:
method: FastV
```

## save

<font color=792ee5> save.save_vllm </font>
Expand Down
1 change: 1 addition & 0 deletions docs/zh_cn/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ arxiv链接: https://arxiv.org/abs/2405.06001
advanced/VLM_quant&img-txt_dataset.md
advanced/mix_bits.md
advanced/sparsification.md
advanced/token_reduction.md

.. toctree::
:maxdepth: 2
Expand Down