Skip to content

Conversation

@NikhilSinha1
Copy link
Contributor

Summary

We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a healthcheck function to us and we can run it on behalf of the user whenever we hit the /health-check endpoint

Test Plan

Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors

@NikhilSinha1 NikhilSinha1 requested a review from a team as a code owner January 24, 2026 00:25
timeout=HEALTHCHECK_TIMEOUT,
)

if result is False or result is None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?

except asyncio.TimeoutError:
done.error = True
done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds"
print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this output intended to go?

custom_health_error = healthcheck_result.error_detail

if not custom_health_ok:
health = Health.SETUP_FAILED

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're sure we shouldn't introduce a new value like Health.UNHEALTHY?

@tempusfrangit
Copy link
Member

tempusfrangit commented Jan 24, 2026

We should catch up next week on this. There is a general movement to move away from the python worker/runner in the works and so we’ll want to ensure that this feature also lands in the rust replacement so that it’s not lost in translation.

Generally we will need to ensure that the architecture works with the split runner model we’re moving towards with the clear IPC transit.

I’ll add that there is likely a better approach here with the new architecture. We have a concept of a poisoned prediction slot (think like a poisoned mutex). We can expose a way to mark that slot as poisoned via this callback which will then cause the slot to no longer be able to accept requests. We’ll want to bubble up the cause of the poisoned slot so that control systems can take action externally.

I’m happy to help implement this on the rust side and we should ensure that we add the .txtar test for it.

—-

Come to think of it, we should discuss if we want to support a “burn the worker subprocess” trigger — since we’re managing the subprocess and interpreter in a way that gives us a lot more control about the process groups, we can consider things like “force worker recycle” behaviors without impacting overall state of the pod. Recycling the subprocess worker should be really straight forward [needs a control channel IPC message and a mechanism to recycle worker and all of its subprocesses if any, and probably skip calling setup]. This would cause the model container to go unready for a short window of time.

This would most likely map better to an internal model control system “we know the worker is unhealthy, recycle it” instead of failing the health-checks.

Something like cog.recycle_worker(<reason>, rerun_setup=False) which then would, with something like:

try:
    import coglet
except ImportError:
    class coglet:
        active = false


def recycle_worker(*, reason: str, rerun_setup: bool = false) -> None:
    if coglet.active:
        coglet.recycle_worker(reason, rerun_setup)
    else:
        raise RuntimeError(“cannot recycle workers when not running under a coglet subprocess”)

The coglet recycle_worker would then be implemented in the orchestrator: https://github.com/replicate/cog/blob/main/crates/coglet/src/orchestrator.rs#L332-L373

tokio::spawn(async move {

(this would require a slight inversion of flow but nothing terribly hard - as long as the slots aren’t poisoned, we can even connect the new worker to the same bridge domain socket endpoints with the new worker, assuming the old worker really does exit)

A couple key constraints that will be needed is we will need to probably transition health check back to BUSY not to STARTING during this worker recycle — or adjust our heuristic for “BUSY -> STARTING” being fatal… we could also introduce a new state in the health-check to help differentiate a container crash/restart from an explicit worker restart.

Additionally one more thing is that we’ll want to setpgid when spawning the worker unless that breaks python multiprocessing/subprocessing semantics so that we can identify all spun off children of the worker. The key is that this is a reset. We may instead of rerun_setup support a worker_recycle_setup function [like setup] that can spin up processes but skip weights loading?

I know this is a giant wall of text, but I want to make sure the right shape is emerging here (cc @michaeldwan and @markphelps for visibility and planning)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants