-
Notifications
You must be signed in to change notification settings - Fork 292
[fix]qwen3_vl add infer_state.max_seq_len #1164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @WANDY666, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a specific aspect of state management within the Qwen2-VL model by ensuring that the maximum sequence length ( Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds logic to set infer_state.max_seq_len for qwen2_vl models during prefill. While the change is functionally correct, it introduces some redundancy. I've suggested a refactoring to improve code clarity and remove the redundant assignment while preserving the logic.
| if self.is_prefill: | ||
| self.max_seq_len = self.max_kv_seq_len | ||
| if get_env_start_args().enable_fa3: | ||
| self.max_seq_len = self.max_kv_seq_len | ||
| self.q_max_seq_len = self.max_q_seq_len |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for setting self.max_seq_len is redundant. It's set if self.is_prefill is true, and then potentially set again to the same value if get_env_start_args().enable_fa3 is true. This can be refactored to avoid the redundant assignment and make the logic clearer, while preserving the existing behavior.
| if self.is_prefill: | |
| self.max_seq_len = self.max_kv_seq_len | |
| if get_env_start_args().enable_fa3: | |
| self.max_seq_len = self.max_kv_seq_len | |
| self.q_max_seq_len = self.max_q_seq_len | |
| if self.is_prefill or get_env_start_args().enable_fa3: | |
| self.max_seq_len = self.max_kv_seq_len | |
| if get_env_start_args().enable_fa3: | |
| self.q_max_seq_len = self.max_q_seq_len |
环境环境是RTX4090 x2, 驱动版本是CUDA 13.0 命令启动命令是: #!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x INTERNVL_IMAGE_LENGTH 256
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --mem_fraction 0.95 --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1结果错误是: 用了cuda同步希望获得更多信息set -x CUDA_LAUNCH_BLOCKING 1
set -x TORCH_USE_CUDA_DSA 1加了调试信息的结果是oom |
摸底测试#!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --mem_fraction 0.95 --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1加载日志(已经双卡去重)最后卡在了 3/4 |
|
使用明细参数控制后目前可以加载并启动 #!/bin/fish
set -x MODEL_PATH '/model/Qwen/Qwen3-VL-30B-A3B-Instruct'
set -x LOADWORKER 12
python -m lightllm.server.api_server --port 8080 --tp 2 --model_dir "$MODEL_PATH" --trust_remote_code --enable_multimodal --quant_type fp8w8a8 --mode ppl_int8kv_flashdecoding --visual_dp 2 --visual_nccl_ports 29500 29501 --batch_max_tokens 8192 --visual_infer_batch_size 2 --visual_tp 1 --max_total_token_num 30000结果 |
No description provided.