ModelTC · helloyongyang · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025 · gemini-code-assist
diff --git a/README.md b/README.md
@@ -33,6 +33,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ## :fire: Latest News
 
+- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over a total of **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over a total of **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
+- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. Please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over a total of **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
+- **August 13, 2025:** 🚀 We have open-sourced our compression solution for **vision-language models (VLMs)**, supporting over **20 algorithms** that cover both **token reduction** and **quantization**. This release enables flexible, plug-and-play compression strategies for a wide range of multimodal tasks. Please refer to the [documentation](https://llmc-en.readthedocs.io/en/latest/advanced/token_reduction.html).
+
 - **May 12, 2025:** 🔥 We now fully support quantization for the **`Wan2.1`** series of video generation models and provide export of truly quantized **INT8/FP8** weights, compatible with the [lightx2v](https://github.com/ModelTC/lightx2v) inference framework. For details, please refer to the [lightx2v documentation](https://llmc-en.readthedocs.io/en/latest/backend/lightx2v.html).
 
 - **Feb 07, 2025:** 🔥 We now fully support quantization of large-scale **`MOE`** models like **`DeepSeekv3`**, **`DeepSeek-R1`**, and **`DeepSeek-R1-zero`** with **`671B`** parameters. You can now directly load FP8 weights without any extra conversion. AWQ and RTN quantization can run on a single 80GB GPU, and we also support the export of true quantized **INT4/INT8** weights.

diff --git a/README_zh.md b/README_zh.md
@@ -33,6 +33,8 @@ docker pull registry.cn-hangzhou.aliyuncs.com/yongyang/llmcompression:pure-lates
 
 ## :fire: 最新动态
 
+- **2025年8月13日:** 🚀 我们已开源针对 **视觉语言模型（VLMs）** 的压缩方案，支持共计超过 **20 种算法**，涵盖 **token reduction** 和 **quantization**。此次发布为多模态任务提供了灵活、即插即用的压缩策略。具体请参阅[文档](https://llmc-zhcn.readthedocs.io/en/latest/advanced/token_reduction.html)。
+
 - **2025年5月12日：** 🔥 我们现已全面支持 **`Wan2.1`** 系列视频生成模型的量化，并支持导出真实量化的 **INT8/FP8** 权重，兼容 [lightx2v](https://github.com/ModelTC/lightx2v) 推理框架。详情请参考 [lightx2v 使用文档](https://llmc-zhcn.readthedocs.io/en/latest/backend/lightx2v.html)。
 
 - **2025年2月7日:** 🔥 我们现已全面支持 **`DeepSeekv3`**、**`DeepSeek-R1`** 和 **`DeepSeek-R1-zero`** 等 671B 大规模 **`MOE`** 模型的量化。 您可以直接加载 `FP8` 权重，无需额外转换，使用单张 80G 显存的 GPU 即可运行 `AWQ` 和 `RTN` 量化，同时还支持导出真实量化的 **INT4/INT8** 权重

diff --git a/configs/sparsification/methods/Holitom/holitom.yml b/configs/sparsification/methods/Holitom/holitom.yml
@@ -0,0 +1,26 @@
+base:
+    seed: &seed 42
+model:
+    type: Llava OneVision
+    path: model path
+    torch_dtype: auto
+eval:
+    eval_pos: [pretrain, transformed]
+    type: vqa
+    name: [mme]
+    download: False
+    path: MME dataset path
+    bs: 1
+    inference_per_block: False
+sparse:
+    method: TokenReduction
+    special:
+        method: HoliTom
+        RETAIN_RATIO: 0.20
+        T: 0.65
+        HOLITOM_k: 18
+        HOLITOM_r: 0.5
+save:
+    save_trans: False
+    save_fake: False
+    save_path: /path/to/save/
diff --git a/docs/en/source/advanced/token_reduction.md b/docs/en/source/advanced/token_reduction.md
@@ -0,0 +1,72 @@
+
+
+# Token Reduction
+
+LightCompress currently supports token reduction for mainstream multimodal large language models. Configuration is very simple—plug and play.
+
+Here is an example configuration
+
+```yaml
+base:
+    seed: &seed 42
+model:
+    type: Llava
+    path: model path
+    torch_dtype: auto
+eval:
+    eval_pos: [pretrain, transformed]
+    type: vqa
+    name: [gqa, mmbench_en_dev, mme]
+    bs: 1
+    inference_per_block: False
+sparse:
+    method: TokenReduction
+    special:
+        method: FastV
+        pruning_loc: 3
+        rate: 0.778
+save:
+    save_trans: False
+    save_fake: False
+    save_path: /path/to/save/
+```
+
+The configuration file contains three core sections, including:
+
+1. **`model`**  
+   For model selection, you can choose LLaVA, LLaVA-NeXT, Qwen2.5VL, and LLaVA OneVision, etc. These models cover both image and video tasks. For the detailed list of supported models, see the file. LightCompress will support more models in the future.
+
+2. **`eval`**  
+   For the `eval_pos` parameter:  
+   - `pretrain` denotes the original model that keeps all visual tokens.  
+   - `transformed` denotes the model with token reduction applied.  
+   LightCompress integrates lmms-eval to evaluate various downstream datasets. Set `type` to `vqa`, and specify the datasets in `name` following the naming conventions in the lmms-eval documentation.
+
+3. **`sparse`**  
+   Set `method` to `TokenReduction` first, and then specify the concrete algorithm and related hyperparameters under `special`. Since each algorithm has different hyperparameters, refer to the configuration files for details.
+
+## Combining Quantization
+
+LightCompress also supports an extreme compression scheme that combines token reduction with quantization. First, choose a quantization algorithm to save a `fake_qunat` model (see the quantization section of the docs). Then load this model and add the `token_reduction` field under `quant`.
+
+```yaml
+quant:
+    method: RTN
+    weight:
+        bit: 4
+        symmetric: False
+        granularity: per_group
+        group_size: 128
+    special:
+        actorder: True
+        static_groups: True
+        percdamp: 0.01
+        blocksize: 128
+        true_sequential: True
+    quant_out: True
+    token_reduction:    
+        method: FastV
+        special:
+            pruning_loc: 3
+            rate: 0.778
+```
diff --git a/docs/en/source/configs.md b/docs/en/source/configs.md
@@ -360,6 +360,26 @@ quant:
         static: True
 ```
 
+## sparse
+
+<font color=792ee5> sparse.method </font>
+
+The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
-The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
+The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
-The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
+The name of the sparsification algorithm used. This includes both [model sparsification](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py) and [reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py) of visual tokens. All supported algorithms can be found in the corresponding files.
+
+It’s worth noting that for model sparsification, you need to specify the exact algorithm name, whereas for token reduction, you only need to set it to `TokenReduction` first, and then specify the exact algorithm under `special`.
+
+```yaml
+sparse:
+    method: Wanda
+```
+
+```yaml
+sparse:
+    method: TokenReduction
+    special:
+        method: FastV
+```
+
 ## save
 
 <font color=792ee5> save.save_vllm</font>

diff --git a/docs/en/source/index.rst b/docs/en/source/index.rst
@@ -36,6 +36,7 @@ arxiv: https://arxiv.org/abs/2405.06001
    advanced/VLM_quant&img-txt_dataset.md
    advanced/mix_bits.md
    advanced/sparsification.md
+   advanced/token_reduction.md
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/zh_cn/source/advanced/VLM_quant&img-txt_dataset.md b/docs/zh_cn/source/advanced/VLM_quant&img-txt_dataset.md
@@ -1,8 +1,8 @@
-# VLM quant and custom_mm datatsets
+# VLM 量化和 custom_mm 数据集
 
 llmc目前支持对VLM模型使用图像-文本数据集进行校准并量化
 
-## VLM quant
+## VLM 量化
 当前支持的模型如下：
 1. llava
 
@@ -34,7 +34,7 @@ calib:
     padding: True
 ```
 
-## custom_mm datatsets
+## custom_mm 数据集
 custom_mm 数据集格式如下：
 ```
 custom_mm-datasets/

diff --git a/docs/zh_cn/source/advanced/Vit_quant&img_dataset.md b/docs/zh_cn/source/advanced/Vit_quant&img_dataset.md
@@ -1,8 +1,8 @@
-# Vit quant and img datatsets
+# Vit 量化和 img 数据集
 
 llmc目前支持对Vit模型使用图像数据集进行校准并量化
 
-## Vit quant
+## Vit 量化
 
 下面是一个配置的例子
 
@@ -33,7 +33,7 @@ eval:
     eval_token_consist: False
 ```
 
-## img datatsets
+## img 数据集
 img数据集格式要求：img数据集目录下存在图像
 
 img数据集格式示例:

diff --git a/docs/zh_cn/source/advanced/token_reduction.md b/docs/zh_cn/source/advanced/token_reduction.md
@@ -0,0 +1,68 @@
+# Token Reduction
+
+目前LightCompress支持对主流的多模态大语言模型进行token reduction，配置十分简单，即插即用。
+
+下面是一个配置的例子
+
+```yaml
+base:
+    seed: &seed 42
+model:
+    type: Llava
+    path: model path
+    torch_dtype: auto
+eval:
+    eval_pos: [pretrain, transformed]
+    type: vqa
+    name: [gqa, mmbench_en_dev, mme]
+    bs: 1
+    inference_per_block: False
+sparse:
+    method: TokenReduction
+    special:
+        method: FastV
+        pruning_loc: 3
+        rate: 0.778
+save:
+    save_trans: False
+    save_fake: False
+    save_path: /path/to/save/
+```
+
+配置文件中包含三大核心内容，包括：
+
+1. `model`
+在模型选择上，可以选择LLaVA，LLaVA-NeXT，Qwen2.5VL以及LLaVA OneVision等，这些模型涵盖了图像任务和视频任务，详细的模型支持列表可以查阅[文件](https://github.com/ModelTC/LightCompress/blob/main/llmc/models/__init__.py)，未来LightCompress也会支持更多的模型。
+
+2. `eval`
+首先，在`eval_pos`参数的选择上，`pretrain`表示原始保留所有视觉token的模型，`transformed`表示应用相应算法进行token reduction的模型。LightCompress接入了[lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval)进行各种下游数据集测评，需要将`type`指定为`vqa`，`name`中的下游测评数据集参考lmms-eval[文档](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/docs/current_tasks.md)中的命名方式。
+
+3. `sparse`
+`method`需要首先指定为TokenReduction，在`special`中继续指定具体的算法以及相关的一些超参数。由于每个算法对应的超参数不同，详细的可以参考[配置文件](https://github.com/ModelTC/LightCompress/tree/main/configs/sparsification/methods)。
+
+
+## 结合量化
+
+LightCompress也支持同时使用token reduction和量化的极致压缩方案，首先需要选择量化算法存储一个`fake_qunat`模型，可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
-LightCompress也支持同时使用token reduction和量化的极致压缩方案，首先需要选择量化算法存储一个`fake_qunat`模型，可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
+LightCompress也支持同时使用token reduction和量化的极致压缩方案，首先需要选择量化算法存储一个`fake_quant`模型，可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
-LightCompress也支持同时使用token reduction和量化的极致压缩方案，首先需要选择量化算法存储一个`fake_qunat`模型，可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
+LightCompress也支持同时使用token reduction和量化的极致压缩方案，首先需要选择量化算法存储一个`fake_quant`模型，可以参考量化板块的文档。其次加载这个模型并在`quant`下加入`token_reduction`字段即可。
+
+```yaml
+quant:
+    method: RTN
+    weight:
+        bit: 4
+        symmetric: False
+        granularity: per_group
+        group_size: 128
+    special:
+        actorder: True
+        static_groups: True
+        percdamp: 0.01
+        blocksize: 128
+        true_sequential: True
+    quant_out: True
+    token_reduction:    
+        method: FastV
+        special:
+            pruning_loc: 3
+            rate: 0.778
+```
diff --git a/docs/zh_cn/source/configs.md b/docs/zh_cn/source/configs.md
@@ -401,6 +401,26 @@ quant:
         granularity: per_token
 ```
 
+## sparse
+
+<font color=792ee5> sparse.method </font>
+
+使用的稀疏化算法名，这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py)，所有支持算法可以在文件中查看。
-使用的稀疏化算法名，这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py)，所有支持算法可以在文件中查看。
+使用的稀疏化算法名，这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py)，所有支持算法可以在文件中查看。
-使用的稀疏化算法名，这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.pyn)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py)，所有支持算法可以在文件中查看。
+使用的稀疏化算法名，这包含对[模型的稀疏化](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/sparsification/__init__.py)和对视觉token的[reduction](https://github.com/ModelTC/LightCompress/blob/main/llmc/compression/token_reduction/__init__.py)，所有支持算法可以在文件中查看。
+
+值得说明的是针对模型稀疏化，需要指定具体的算法名称，而token reduction只需要先指定为`TokenReduction`，在`special`中继续指定具体的算法。
+
+```yaml
+sparse:
+    method: Wanda
+```
+
+```yaml
+sparse:
+    method: TokenReduction
+    special:
+        method: FastV
+```
+
 ## save
 
 <font color=792ee5> save.save_vllm </font>

diff --git a/docs/zh_cn/source/index.rst b/docs/zh_cn/source/index.rst
@@ -37,6 +37,7 @@ arxiv链接: https://arxiv.org/abs/2405.06001
    advanced/VLM_quant&img-txt_dataset.md
    advanced/mix_bits.md
    advanced/sparsification.md
+   advanced/token_reduction.md
 
 .. toctree::
    :maxdepth: 2