Optimize exception raising and error process #4236

grimoire · 2025-12-25T09:36:54Z

3 new apis has been added to Engine / EngineLoop / Executor / ModelAgent.

start would start all the backgound tasks in the module and it's submodules.
wait_tasks would await all backendground tasks until one of them raised any exceptions. (wait_tasks of the submodules would also be treated as background task). All other tasks would be cancelled.
stop would stop all the background tasks.

If an exception raised in ModelAgent, wait_tasks would prapogate it to the parent module (ModelAgent -> Executor), other tasks in the ModelAgent would be cancelled. Executor would do the same when it received the exception from ModelAgent, then EngineLoop .... All the way to the Engine.

Users can manually stop the background tasks by calling stop, which would raise and CancelledError in background tasks.

Tip

Another solution is merging them all into an async def run function, which would start the tasks / wait for exception / clean up tasks. run function itself should be packed as an async task.

Copilot

Pull request overview

This PR implements a new exception handling and task lifecycle management system across the PyTorch engine stack. The changes introduce three key APIs (start, wait_tasks, and stop) that provide structured exception propagation from ModelAgent → Executor → EngineLoop → Engine, with proper task cancellation when any component fails.

Key changes include:

Added a centralized wait_for_async_tasks utility function for consistent task waiting and exception handling
Introduced wait_tasks methods throughout the component hierarchy to propagate exceptions upward
Refactored task lifecycle management to use a set-based tracking system with automatic cleanup callbacks
Enhanced signal handling in the ZMQ multiprocessing engine with async shutdown support

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
lmdeploy/pytorch/utils.py	Adds `wait_for_async_tasks` utility for centralized async task exception handling
lmdeploy/pytorch/engine/request.py	Sets ENGINE_STOP_ERROR response type when engine loop fails
lmdeploy/pytorch/engine/mp_engine/zmq_rpc.py	Adds task tracking and graceful error handling for ZMQ RPC send operations
lmdeploy/pytorch/engine/mp_engine/zmq_engine.py	Replaces signal handler with async-compatible version and adds proper cleanup in finally block
lmdeploy/pytorch/engine/model_agent.py	Refactors task management with set-based tracking and implements wait_tasks for exception propagation
lmdeploy/pytorch/engine/executor/base.py	Adds abstract wait_tasks method to executor interface
lmdeploy/pytorch/engine/executor/uni_executor.py	Implements wait_tasks by delegating to model_agent
lmdeploy/pytorch/engine/executor/ray_executor.py	Implements wait_tasks with Ray-specific worker monitoring and error handling
lmdeploy/pytorch/engine/executor/mp_executor.py	Implements wait_tasks using collective RPC calls to workers
lmdeploy/pytorch/engine/executor/base_worker.py	Implements wait_tasks to coordinate model_agent and output loop tasks
lmdeploy/pytorch/engine/executor/init.py	Adds deprecation warning for MPExecutor
lmdeploy/pytorch/engine/engine_loop.py	Refactors task creation and waiting with new exception propagation pattern
lmdeploy/pytorch/engine/engine.py	Adds stop and wait_tasks methods for top-level engine lifecycle management

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/pytorch/engine/mp_engine/zmq_engine.py

lmdeploy/pytorch/engine/executor/ray_executor.py

lmdeploy/pytorch/engine/executor/__init__.py

lmdeploy/pytorch/engine/engine.py

Copilot · 2025-12-26T07:08:22Z

lmdeploy/pytorch/utils.py

+        for task in done:
+            if task.exception():
+                exc = task.exception()
+                if isinstance(exc, asyncio.CancelledError) and ignore_cancellederror:
+                    logger.debug(f'Task <{task.get_name()}> cancelled.')
+                    continue
+                raise exc from None


The function checks if task.exception() to detect exceptions, but this call itself raises an exception if the task was cancelled. This should be wrapped in a try-except block or checked with task.cancelled() first. The current code may raise an unexpected asyncio.CancelledError when trying to check for exceptions on a cancelled task, bypassing the intended exception handling logic.

lmdeploy/pytorch/engine/engine_loop.py

Copilot · 2025-12-26T07:08:23Z

lmdeploy/pytorch/engine/executor/ray_executor.py

+        except SystemExit:
+            logger.error('Ray worker exited.')


Catching SystemExit and logging it as an error without re-raising may hide critical failures. SystemExit typically indicates the process should terminate, and this exception should normally be allowed to propagate. If there's a specific reason to catch it here, that should be documented. Otherwise, consider removing SystemExit from the exception handling or re-raising it after cleanup.

Suggested change

except SystemExit:

logger.error('Ray worker exited.')

except SystemExit as e:

logger.error('Ray worker exited.')

raise

lmdeploy/pytorch/engine/executor/mp_executor.py

lmdeploy/pytorch/engine/mp_engine/zmq_engine.py

lmdeploy/pytorch/engine/engine.py

CUHKSZzxy

LGTM

grimoire added 5 commits December 24, 2025 22:07

chained wait tasks

3183b31

even better

8cf2071

start/wait_tasks/stop pattern

8256702

remove debug error

fa691ce

propagate cancel

bb27256

lvhan028 requested review from Copilot and lvhan028 and removed request for Copilot December 26, 2025 07:04

lvhan028 added the improvement label Dec 26, 2025

lvhan028 requested a review from Copilot December 26, 2025 07:04

Copilot started reviewing on behalf of lvhan028 December 26, 2025 07:04 View session

Copilot AI reviewed Dec 26, 2025

View reviewed changes

grimoire added 9 commits December 26, 2025 20:22

unblock socket.recv

2cc5204

disable cancel for now

64a292a

add cancel

2ff24df

merge main

5cf028a

Idempotence

584800a

add engine.start

e3502da

fix throughput

e2cf4b8

fix model_agent/profile_throughput

1093d92

fix dump profile

a053817

lvhan028 requested a review from CUHKSZzxy December 31, 2025 06:17

CUHKSZzxy approved these changes Dec 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize exception raising and error process #4236

Optimize exception raising and error process #4236

Uh oh!

grimoire commented Dec 25, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CUHKSZzxy left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Optimize exception raising and error process #4236

Are you sure you want to change the base?

Optimize exception raising and error process #4236

Uh oh!

Conversation

grimoire commented Dec 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CUHKSZzxy left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grimoire commented Dec 25, 2025 •

edited

Loading