-
Notifications
You must be signed in to change notification settings - Fork 654
Add user-defined healthcheck #2652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| timeout=HEALTHCHECK_TIMEOUT, | ||
| ) | ||
|
|
||
| if result is False or result is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?
| except asyncio.TimeoutError: | ||
| done.error = True | ||
| done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds" | ||
| print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this output intended to go?
| custom_health_error = healthcheck_result.error_detail | ||
|
|
||
| if not custom_health_ok: | ||
| health = Health.SETUP_FAILED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're sure we shouldn't introduce a new value like Health.UNHEALTHY?
|
We should catch up next week on this. There is a general movement to move away from the python worker/runner in the works and so we’ll want to ensure that this feature also lands in the rust replacement so that it’s not lost in translation. Generally we will need to ensure that the architecture works with the split runner model we’re moving towards with the clear IPC transit. I’ll add that there is likely a better approach here with the new architecture. We have a concept of a poisoned prediction slot (think like a poisoned mutex). We can expose a way to mark that slot as poisoned via this callback which will then cause the slot to no longer be able to accept requests. We’ll want to bubble up the cause of the poisoned slot so that control systems can take action externally. I’m happy to help implement this on the rust side and we should ensure that we add the .txtar test for it. —- Come to think of it, we should discuss if we want to support a “burn the worker subprocess” trigger — since we’re managing the subprocess and interpreter in a way that gives us a lot more control about the process groups, we can consider things like “force worker recycle” behaviors without impacting overall state of the pod. Recycling the subprocess worker should be really straight forward [needs a control channel IPC message and a mechanism to recycle worker and all of its subprocesses if any, and probably skip calling This would most likely map better to an internal model control system “we know the worker is unhealthy, recycle it” instead of failing the health-checks. Something like try:
import coglet
except ImportError:
class coglet:
active = false
def recycle_worker(*, reason: str, rerun_setup: bool = false) -> None:
if coglet.active:
coglet.recycle_worker(reason, rerun_setup)
else:
raise RuntimeError(“cannot recycle workers when not running under a coglet subprocess”)The coglet recycle_worker would then be implemented in the orchestrator: https://github.com/replicate/cog/blob/main/crates/coglet/src/orchestrator.rs#L332-L373 cog/crates/coglet-python/src/lib.rs Line 220 in 1d220a6
(this would require a slight inversion of flow but nothing terribly hard - as long as the slots aren’t poisoned, we can even connect the new worker to the same bridge domain socket endpoints with the new worker, assuming the old worker really does exit) A couple key constraints that will be needed is we will need to probably transition health check back to BUSY not to STARTING during this worker recycle — or adjust our heuristic for “BUSY -> STARTING” being fatal… we could also introduce a new state in the health-check to help differentiate a container crash/restart from an explicit worker restart. Additionally one more thing is that we’ll want to I know this is a giant wall of text, but I want to make sure the right shape is emerging here (cc @michaeldwan and @markphelps for visibility and planning) |
Summary
We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a
healthcheckfunction to us and we can run it on behalf of the user whenever we hit the/health-checkendpointTest Plan
Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors