Add user-defined healthcheck #2652

NikhilSinha1 · 2026-01-24T00:25:09Z

Summary

We want to add a user-defined healthcheck function to allow users to tell us if their container is healthy or not, given they may have some more information about what determines a "healthy" system than we do. This PR adds hooks for the user to pass a healthcheck function to us and we can run it on behalf of the user whenever we hit the /health-check endpoint

Test Plan

Unit tests added to verify this works when a user's healthcheck succeeds, fails, times out and errors

meatballhat-cf · 2026-01-24T02:53:22Z

python/cog/server/worker.py

+                        timeout=HEALTHCHECK_TIMEOUT,
+                    )
+
+                if result is False or result is None:


I find this a bit confusing that None counts as unhealthy since it's the default value if the function has a bare return or no return statement at all. WDYT about only treating result is False as failure?

meatballhat-cf · 2026-01-24T02:53:53Z

python/cog/server/worker.py

+        except asyncio.TimeoutError:
+            done.error = True
+            done.error_detail = f"Healthcheck failed: user-defined healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds"
+            print(f"Healthcheck timed out after {HEALTHCHECK_TIMEOUT} seconds")


Where is this output intended to go?

meatballhat-cf · 2026-01-24T02:59:39Z

python/cog/server/http.py

+            custom_health_error = healthcheck_result.error_detail
+
+            if not custom_health_ok:
+                health = Health.SETUP_FAILED


You're sure we shouldn't introduce a new value like Health.UNHEALTHY?

tempusfrangit · 2026-01-24T07:16:28Z

We should catch up next week on this. There is a general movement to move away from the python worker/runner in the works and so we’ll want to ensure that this feature also lands in the rust replacement so that it’s not lost in translation.

Generally we will need to ensure that the architecture works with the split runner model we’re moving towards with the clear IPC transit.

I’ll add that there is likely a better approach here with the new architecture. We have a concept of a poisoned prediction slot (think like a poisoned mutex). We can expose a way to mark that slot as poisoned via this callback which will then cause the slot to no longer be able to accept requests. We’ll want to bubble up the cause of the poisoned slot so that control systems can take action externally.

I’m happy to help implement this on the rust side and we should ensure that we add the .txtar test for it.

—-

Come to think of it, we should discuss if we want to support a “burn the worker subprocess” trigger — since we’re managing the subprocess and interpreter in a way that gives us a lot more control about the process groups, we can consider things like “force worker recycle” behaviors without impacting overall state of the pod. Recycling the subprocess worker should be really straight forward [needs a control channel IPC message and a mechanism to recycle worker and all of its subprocesses if any, and probably skip calling setup]. This would cause the model container to go unready for a short window of time.

This would most likely map better to an internal model control system “we know the worker is unhealthy, recycle it” instead of failing the health-checks.

Something like cog.recycle_worker(<reason>, rerun_setup=False) which then would, with something like:

try:
    import coglet
except ImportError:
    class coglet:
        active = false


def recycle_worker(*, reason: str, rerun_setup: bool = false) -> None:
    if coglet.active:
        coglet.recycle_worker(reason, rerun_setup)
    else:
        raise RuntimeError(“cannot recycle workers when not running under a coglet subprocess”)

The coglet recycle_worker would then be implemented in the orchestrator: https://github.com/replicate/cog/blob/main/crates/coglet/src/orchestrator.rs#L332-L373

cog/crates/coglet-python/src/lib.rs

Line 220 in 1d220a6

tokio::spawn(async move {

(this would require a slight inversion of flow but nothing terribly hard - as long as the slots aren’t poisoned, we can even connect the new worker to the same bridge domain socket endpoints with the new worker, assuming the old worker really does exit)

A couple key constraints that will be needed is we will need to probably transition health check back to BUSY not to STARTING during this worker recycle — or adjust our heuristic for “BUSY -> STARTING” being fatal… we could also introduce a new state in the health-check to help differentiate a container crash/restart from an explicit worker restart.

Additionally one more thing is that we’ll want to setpgid when spawning the worker unless that breaks python multiprocessing/subprocessing semantics so that we can identify all spun off children of the worker. The key is that this is a reset. We may instead of rerun_setup support a worker_recycle_setup function [like setup] that can spin up processes but skip weights loading?

I know this is a giant wall of text, but I want to make sure the right shape is emerging here (cc @michaeldwan and @markphelps for visibility and planning)

Nikhil Sinha added 2 commits January 23, 2026 15:27

Allow user to define healthcheck function

98fa399

Add integration tests to verify behavior

d99b958

NikhilSinha1 requested a review from a team as a code owner January 24, 2026 00:25

python 3.8 compatability fix

2aa67cf

meatballhat-cf reviewed Jan 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add user-defined healthcheck #2652

Add user-defined healthcheck #2652

Uh oh!

NikhilSinha1 commented Jan 24, 2026

Uh oh!

meatballhat-cf Jan 24, 2026

Uh oh!

meatballhat-cf Jan 24, 2026

Uh oh!

meatballhat-cf Jan 24, 2026

Uh oh!

tempusfrangit commented Jan 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add user-defined healthcheck #2652

Are you sure you want to change the base?

Add user-defined healthcheck #2652

Uh oh!

Conversation

NikhilSinha1 commented Jan 24, 2026

Summary

Test Plan

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

meatballhat-cf Jan 24, 2026

Choose a reason for hiding this comment

Uh oh!

tempusfrangit commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tempusfrangit commented Jan 24, 2026 •

edited

Loading