Skip to content

Conversation

@WANDY666
Copy link
Contributor

@WANDY666 WANDY666 commented Jan 5, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @WANDY666, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a specific aspect of state management within the Qwen2-VL model by ensuring that the maximum sequence length (max_seq_len) is correctly initialized during the prefill stage. This change is crucial for maintaining consistent internal state, which can impact the model's performance and stability during inference, particularly when handling initial input sequences.

Highlights

  • Qwen2-VL Model State Initialization: A conditional assignment for self.max_seq_len has been added within the init_some_extra_state method for the Qwen2-VL model. This ensures that max_seq_len is explicitly set to self.max_kv_seq_len when the model is in its prefill phase, improving state management during initial token processing.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds logic to set infer_state.max_seq_len for qwen2_vl models during prefill. While the change is functionally correct, it introduces some redundancy. I've suggested a refactoring to improve code clarity and remove the redundant assignment while preserving the logic.

Comment on lines +38 to 42
if self.is_prefill:
self.max_seq_len = self.max_kv_seq_len
if get_env_start_args().enable_fa3:
self.max_seq_len = self.max_kv_seq_len
self.q_max_seq_len = self.max_q_seq_len
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for setting self.max_seq_len is redundant. It's set if self.is_prefill is true, and then potentially set again to the same value if get_env_start_args().enable_fa3 is true. This can be refactored to avoid the redundant assignment and make the logic clearer, while preserving the existing behavior.

Suggested change
if self.is_prefill:
self.max_seq_len = self.max_kv_seq_len
if get_env_start_args().enable_fa3:
self.max_seq_len = self.max_kv_seq_len
self.q_max_seq_len = self.max_q_seq_len
if self.is_prefill or get_env_start_args().enable_fa3:
self.max_seq_len = self.max_kv_seq_len
if get_env_start_args().enable_fa3:
self.q_max_seq_len = self.max_q_seq_len

@AuZhoomLee
Copy link

AuZhoomLee commented Jan 5, 2026

环境

环境是RTX4090 x2, 驱动版本是CUDA 13.0

命令

启动命令是:

#!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x INTERNVL_IMAGE_LENGTH 256
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --mem_fraction 0.95 --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1

结果

错误是:

ERROR 01-04 15:23:55 [model_rpc.py:99] Traceback (most recent call last):
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/model_rpc.py", line 80, in rpc_loop
ERROR 01-04 15:23:55 [model_rpc.py:99]     ans = getattr(self, func_name)(*args)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/model_rpc.py", line 169, in init_model
ERROR 01-04 15:23:55 [model_rpc.py:99]     self.backend.init_model(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/mode_backend/base_backend.py", line 168, in init_model
ERROR 01-04 15:23:55 [model_rpc.py:99]     self.model, self.is_multimodal = get_model(model_cfg, model_kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/registry.py", line 97, in get_model
ERROR 01-04 15:23:55 [model_rpc.py:99]     model, is_multimodal = ModelRegistry.get_model(model_cfg, model_kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/registry.py", line 66, in get_model
ERROR 01-04 15:23:55 [model_rpc.py:99]     model = matches[0].model_class(model_kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_vl_moe/model.py", line 25, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_moe/model.py", line 23, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3/model.py", line 22, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen2/model.py", line 16, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/llama/model.py", line 63, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 107, in __init__
ERROR 01-04 15:23:55 [model_rpc.py:99]     self._init_weights()
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 178, in _init_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     load_hf_weights(
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_weights/hf_load_utils.py", line 69, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     for _ in iterator:
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/opt/conda/lib/python3.10/site-packages/tqdm/std.py", line 1178, in __iter__
ERROR 01-04 15:23:55 [model_rpc.py:99]     for obj in iterable:
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 873, in next
ERROR 01-04 15:23:55 [model_rpc.py:99]     raise value
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/opt/conda/lib/python3.10/multiprocessing/pool.py", line 125, in worker
ERROR 01-04 15:23:55 [model_rpc.py:99]     result = (True, func(*args, **kwds))
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_weights/hf_load_utils.py", line 26, in load_func
ERROR 01-04 15:23:55 [model_rpc.py:99]     layer.load_hf_weights(weights)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_vl_moe/layer_weights/transformers_layer_weight.py", line 43, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     super().load_hf_weights(weights)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_moe/layer_weights/transformer_layer_weight.py", line 42, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     return super().load_hf_weights(weights)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen2/layer_weights/transformer_layer_weight.py", line 51, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     return super().load_hf_weights(weights)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_weights/transformer_layer_weight.py", line 43, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     attr.load_hf_weights(weights)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_weights/meta_weights/fused_moe_weight_tp.py", line 252, in load_hf_weights
ERROR 01-04 15:23:55 [model_rpc.py:99]     self._fuse()
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_weights/meta_weights/fused_moe_weight_tp.py", line 185, in _fuse
ERROR 01-04 15:23:55 [model_rpc.py:99]     qw1, qw1_scale, qw1_zero_point = self.quant_method.quantize(w1)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/quantization/w8a8_quant.py", line 114, in quantize
ERROR 01-04 15:23:55 [model_rpc.py:99]     return self.quantize_moe(weight)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/quantization/w8a8_quant.py", line 126, in quantize_moe
ERROR 01-04 15:23:55 [model_rpc.py:99]     qweight, weight_scale = scaled_fp8_quant(
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/LightLLM/lightllm/common/quantization/w8a8_quant.py", line 20, in scaled_fp8_quant
ERROR 01-04 15:23:55 [model_rpc.py:99]     return light_ops.per_token_quant_bf16_fp8(tensor)
ERROR 01-04 15:23:55 [model_rpc.py:99]   File "/opt/conda/lib/python3.10/site-packages/lightllm_kernel/ops/quant.py", line 10, in per_token_quant_bf16_fp8
ERROR 01-04 15:23:55 [model_rpc.py:99]     _C.per_token_quant_bf16_fp8(output, input, scales)
ERROR 01-04 15:23:55 [model_rpc.py:99] RuntimeError: Input must be BF16 type
ERROR 01-04 15:23:55 [model_rpc.py:99] Exception raised from per_token_quant_bf16_fp8 at /workspace/csrc/quant/per_token_quantize_bf16_fp8.cu:224 (most recent call first):
ERROR 01-04 15:23:55 [model_rpc.py:99] frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x80 (0x7f83e2bdeeb0 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
ERROR 01-04 15:23:55 [model_rpc.py:99] frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x69 (0x7f83e2b7bb5f in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
ERROR 01-04 15:23:55 [model_rpc.py:99] frame #2: lightllm::ops::per_token_quant_bf16_fp8(at::Tensor&, at::Tensor const&, at::Tensor&) + 0x273d (0x7f82b7e440cd in /opt/conda/lib/python3.10/site-packages/lightllm_kernel/_C.so)

用了cuda同步希望获得更多信息

set -x CUDA_LAUNCH_BLOCKING 1
set -x TORCH_USE_CUDA_DSA 1

加了调试信息的结果是oom

ERROR 01-05 07:59:48 [registry.py:100] torch.AcceleratorError: CUDA error: unknown error
ERROR 01-05 07:59:48 [registry.py:100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 01-05 07:59:48 [registry.py:100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 01-05 07:59:48 [registry.py:100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 01-05 07:59:48 [model_rpc.py:99] Traceback (most recent call last):
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/model_rpc.py", line 80, in rpc_loop
ERROR 01-05 07:59:48 [model_rpc.py:99]     ans = getattr(self, func_name)(*args)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/model_rpc.py", line 169, in init_model
ERROR 01-05 07:59:48 [model_rpc.py:99]     self.backend.init_model(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/server/router/model_infer/mode_backend/base_backend.py", line 168, in init_model
ERROR 01-05 07:59:48 [model_rpc.py:99]     self.model, self.is_multimodal = get_model(model_cfg, model_kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/registry.py", line 97, in get_model
ERROR 01-05 07:59:48 [model_rpc.py:99]     model, is_multimodal = ModelRegistry.get_model(model_cfg, model_kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/registry.py", line 66, in get_model
ERROR 01-05 07:59:48 [model_rpc.py:99]     model = matches[0].model_class(model_kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_vl_moe/model.py", line 25, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3_moe/model.py", line 23, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen3/model.py", line 22, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/qwen2/model.py", line 16, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/llama/model.py", line 63, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     super().__init__(kvargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 124, in __init__
ERROR 01-05 07:59:48 [model_rpc.py:99]     self._init_cudagraph()
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 249, in _init_cudagraph
ERROR 01-05 07:59:48 [model_rpc.py:99]     self.graph.warmup(self)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR 01-05 07:59:48 [model_rpc.py:99]     return func(*args, **kwargs)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/cuda_graph.py", line 211, in warmup
ERROR 01-05 07:59:48 [model_rpc.py:99]     model_output: ModelOutput = model.forward(model_input)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 274, in forward
ERROR 01-05 07:59:48 [model_rpc.py:99]     return self._decode(model_input)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 501, in _decode
ERROR 01-05 07:59:48 [model_rpc.py:99]     model_output: ModelOutput = self.graph.capture_decode(self._token_forward, infer_state)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/cuda_graph.py", line 140, in capture_decode
ERROR 01-05 07:59:48 [model_rpc.py:99]     return self._capture_decode(decode_func, infer_state)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/cuda_graph.py", line 87, in _capture_decode
ERROR 01-05 07:59:48 [model_rpc.py:99]     model_output = decode_func(infer_state)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 585, in _token_forward
ERROR 01-05 07:59:48 [model_rpc.py:99]     input_embs: torch.Tensor = layer_method(input_embs, infer_state, self.trans_layers_weight[i])
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/common/basemodel/layer_infer/template/transformer_layer_infer_template.py", line 97, in token_forward
ERROR 01-05 07:59:48 [model_rpc.py:99]     o = self._token_attention_kernel(q, infer_state, layer_weight)
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/llama/layer_infer/transformer_layer_infer.py", line 781, in _token_decode_attention_ppl_int8kv_flashdecoding
ERROR 01-05 07:59:48 [model_rpc.py:99]     return token_decode_attention_flash_decoding(
ERROR 01-05 07:59:48 [model_rpc.py:99]   File "/LightLLM/lightllm/models/llama/triton_kernel/ppl_int8kv_flash_decoding.py", line 26, in token_decode_attention_flash_decoding
ERROR 01-05 07:59:48 [model_rpc.py:99]     mid_o = alloc_tensor_func(

ERROR 01-05 05:35:54 [basemodel.py:924] torch.AcceleratorError: CUDA error: unknown error
ERROR 01-05 05:35:54 [basemodel.py:924] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 01-05 05:35:54 [basemodel.py:924] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 01-05 05:35:54 [basemodel.py:924] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 01-05 05:35:54 [basemodel.py:924] Traceback (most recent call last):
ERROR 01-05 05:35:54 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 914, in _autotune_warmup
ERROR 01-05 05:35:54 [basemodel.py:924]     model_output = self.forward(
ERROR 01-05 05:35:54 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 563, in _context_forward
ERROR 01-05 05:35:54 [basemodel.py:924]     predict_logits = post_method(input_embs, infer_state, self.pre_post_weight)
ERROR 01-05 05:35:54 [basemodel.py:924]   File "/LightLLM/lightllm/models/llama/layer_infer/post_layer_infer.py", line 64, in token_forward
ERROR 01-05 05:35:54 [basemodel.py:924]     last_input, token_num = self._slice_get_last_input(input_embdings, infer_state)
ERROR 01-05 05:35:54 [basemodel.py:924]   File "/LightLLM/lightllm/models/llama/layer_infer/post_layer_infer.py", line 46, in _slice_get_last_input
ERROR 01-05 05:35:54 [basemodel.py:924]     torch.cumsum(infer_state.b_seq_len - infer_state.b_ready_cache_len, dim=0, dtype=torch.long) - 1

@AuZhoomLee
Copy link

AuZhoomLee commented Jan 5, 2026

摸底测试

#!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --mem_fraction 0.95 --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1

加载日志(已经双卡去重)

WARNING 01-05 08:23:04 [basemodel.py:923] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 01-05 08:23:04 [basemodel.py:924] CUDA error: unknown error
ERROR 01-05 08:23:04 [basemodel.py:924] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 01-05 08:23:04 [basemodel.py:924] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 01-05 08:23:04 [basemodel.py:924] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

ERROR 01-05 08:23:04 [basemodel.py:924] Traceback (most recent call last):
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 914, in _autotune_warmup
ERROR 01-05 08:23:04 [basemodel.py:924]     model_output = self.forward(
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
ERROR 01-05 08:23:04 [basemodel.py:924]     return func(*args, **kwargs)
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 272, in forward
ERROR 01-05 08:23:04 [basemodel.py:924]     return self._prefill(model_input)
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 467, in _prefill
ERROR 01-05 08:23:04 [basemodel.py:924]     model_output = self._context_forward(infer_state)
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/common/basemodel/basemodel.py", line 563, in _context_forward
ERROR 01-05 08:23:04 [basemodel.py:924]     predict_logits = post_method(input_embs, infer_state, self.pre_post_weight)
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/models/llama/layer_infer/post_layer_infer.py", line 64, in token_forward
ERROR 01-05 08:23:04 [basemodel.py:924]     last_input, token_num = self._slice_get_last_input(input_embdings, infer_state)
ERROR 01-05 08:23:04 [basemodel.py:924]   File "/LightLLM/lightllm/models/llama/layer_infer/post_layer_infer.py", line 46, in _slice_get_last_input
ERROR 01-05 08:23:04 [basemodel.py:924]     torch.cumsum(infer_state.b_seq_len - infer_state.b_ready_cache_len, dim=0, dtype=torch.long) - 1
ERROR 01-05 08:23:04 [basemodel.py:924] torch.AcceleratorError: CUDA error: unknown error

WARNING 01-05 08:23:04 [basemodel.py:923] autotune warmup for length 256 failed: CUDA error: unknown error
warming up:  75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████                                   | 9/12 [00:03<00:00,  3.66it/s]

最后卡在了 3/4

@AuZhoomLee
Copy link

使用明细参数控制后目前可以加载并启动

#!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1 --max_total_token_num 30000

结果

[2026-01-05 08:52:32 +0000] [6777] [INFO] Starting gunicorn 23.0.0
[2026-01-05 08:52:32 +0000] [6777] [INFO] Listening at: http://127.0.0.1:8080 (6777)
[2026-01-05 08:52:32 +0000] [6777] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2026-01-05 08:52:32 +0000] [6779] [INFO] Booting worker with pid: 6779
INFO 01-05 08:52:37 [__init__.py:216] Automatically detected platform cuda.
INFO 01-05 08:52:37 [communication_op.py:57] deep_ep is not installed, you can't use the api of it.
INFO 01-05 08:52:37 [cache_tensor_manager.py:17] USE_GPU_TENSOR_CACHE is On
All deep_gemm operations loaded successfully!
WARNING 01-05 08:52:37 [grouped_fused_moe_ep.py:28] no deepep or deep_gemm
[2026-01-05 08:52:38 +0000] [6779] [INFO] Started server process [6779]
[2026-01-05 08:52:38 +0000] [6779] [INFO] Waiting for application startup.
INFO 01-05 08:52:38 [api_http.py:359] server start up
INFO 01-05 08:52:39 [rpyc_fix_utils.py:107] set nodelay mode
INFO 01-05 08:52:39 [rpyc_fix_utils.py:113] change socket buffer from 2626560 131072 change to 4194304
INFO 01-05 08:52:40 [req_id_generator.py:34] ReqIDGenerator init finished
INFO 01-05 08:52:40 [api_http.py:363] server start up ok, loop use is <uvloop.Loop running=True closed=False debug=False>
[2026-01-05 08:52:40 +0000] [6779] [INFO] Application startup complete.
DEBUG 01-05 08:53:44 [manager.py:285] dp_i 0 frozen token num: 0 
DEBUG 01-05 08:53:44 [manager.py:285] 
DEBUG 01-05 08:53:44 [manager.py:286] dp_i 0 estimated_peak_token_count: 0 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants