Rewriting the 'RPO and RTO' page to clear up common confusion #4091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

lukeknep wants to merge 7 commits into main from rpo-rewrite

+80 −44

Contributor

lukeknep commented Dec 23, 2025 •

edited

Loading

What does this PR do?

Corrects errors and clears up common confusion points around our RPO, RTO, and SLA.

Explain the difference between RTO and SLA, and why a 20-minute RTO can still meet a 99.99% SLA.
Clarify that Temporal-initiated failovers must be enabled for the RTO to apply.
Clarify that MRR and MCR are still protected against AZ failures and cell failures.
Fixed the "8-hour RPO / RTO" for non-HA workloads

Internal Note on the previously-stated 8-hour RTO / RPO for non-HA Namespaces:

an "8-hour RTO" doesn't make sense when we are entirely dependent on the underlying cloud infrastructure -- if the infra is down for 24 hours, there's no way we can make an 8-hour RTO.
Conversely, if the infra is only down for 20 minutes, then an 8-hour RTO may be too long.
Additionally, the 8-hour RPO needs to be carefully explained, as it is not relevant to most outages; most outages historically have not caused data corruption. But if a customer just reads "8-hour RPO," they might erroneously think, "oh no, if the region has an incident like the AWS us-east-1 incident, I may lose 8 hours of data."

lukeknep requested review from a team and bechols as code owners

December 23, 2025 21:37

vercel bot commented Dec 23, 2025 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Review	Updated (UTC)
temporal-documentation	Ready	Preview, Comment	Jan 21, 2026 2:07pm

Contributor

github-actions bot commented Dec 23, 2025 •

edited

Loading

📖 Docs PR preview links

Cloud
- RPO and RTO

lukeknep changed the title ~~[WIP / Feedback requested] Rewriting the 'RPO and RTO' page to clear up common confusion~~ [Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion

lukeknep commented

View reviewed changes

docs/cloud/rto-rpo.mdx

    
              Temporal Cloud strives to maintain a P95 [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) of less than 1 minute.

Contributor Author

lukeknep Dec 23, 2025 •

edited

Loading

I removed this bit because

I'm not sure p95 is good enough. This could be read as "up to 5% of Namespaces could be above the 1-minute RPO at any given moment."
We already say we have a 1-minute RPO. I don't think we need additional standards / goals to be publicly stated. They would only add confusion. Let's state our main goal (RPO) and stand by it.

bechols approved these changes

View reviewed changes

docs/evaluate/temporal-cloud/sla.mdx Outdated Show resolved Hide resolved

docs/production-deployment/cloud/rto-rpo.mdx Outdated

    
              Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.

              In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.

              Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.

Contributor

bechols Dec 23, 2025

RPO and RTO are how we measure, and low values for RPO and RTO are what we strive for. Could tighten this phrasing.

Contributor Author

lukeknep Dec 26, 2025 •

edited

Loading

That's not accurate (or at least, if I understand your comment correctly, that's not how the terms are used in the industry)

Recovery Point/Time "Objective": this is the goal we have for all outages. That's why the term has "Objective" in it's name.
recovery time / recovery point: this is the actual observed values in a given outage. I could say "observed recovery time" or "achieved recovery time," but that gets bloated.

I wanted to make the distinction between the two terms really clear in the doc. If it's not clear, then I need to reword it.

P.S. Confirmed the industry standard with GPT 5.1:

Contributor

bechols Dec 27, 2025

To me, the current wording boils down to "we strive for RPO", which doesn't really have any informational content without the actual numerical objective. "We strive for zero RPO" or "we strive for sub 20 minute RTO" is informative.

Trying this wording with a similar concept: "Uptime is the objective that Temporal strives to meet for availability (service accessibility)". I think it's clearer/more informative to say something like "Temporal Cloud measures availability in terms of service uptime, and has a 99.99% availability SLO and 99.% availability SLA"

All that said: happy to merge as-is.

docs/production-deployment/cloud/rto-rpo.mdx Outdated

    
              Recovery Point Objective (RPO) and Recovery Time Objective (RTO) define data durability and service restoration timelines, respectively.

              In Temporal Cloud, these objectives vary depending on your deployment configuration and the scope of any failure.

              Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the objectives that Temporal strives to meet for data durability (recovery point) and service restoration (recovery time) in the event of cloud outages.

              These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.

Contributor

bechols Dec 23, 2025

This sounds rough. Can we say RPO + RTO aren't part of the availability SLA instead (and link to the SLA page)?

Contributor

bechols Dec 27, 2025

Suggested change

      
            These objectives are high priority goals for Temporal Cloud, but are not a contractual commitment.
          
            Temporal Cloud's RPO and RTO are complementary to but separate from the [availability SLA](/cloud/sla)."

docs/production-deployment/cloud/rto-rpo.mdx Outdated Show resolved Hide resolved

docs/production-deployment/cloud/rto-rpo.mdx Outdated

    
              In case of an outage in the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Executions can be started.

              ## High Availability, Regional Failure

              The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled:

Contributor

bechols Dec 23, 2025

This breakdown is great!

docs/production-deployment/cloud/rto-rpo.mdx Outdated Show resolved Hide resolved

docs/production-deployment/cloud/rto-rpo.mdx Outdated

    
              Temporal has put extensive work into tools and processes that minimize the recovery point and achieve its RPO for Temporal-initiated failovers, including:

              **Recovery Time Objective (RTO) - 20 minutes**

              - Best-in-class data replication technology that keeps the replica up to date with the active.

Contributor

bechols Dec 23, 2025

Could link to https://www.youtube.com/watch?v=mULBvv83dYM where Liang gets into more specifics

docs/production-deployment/cloud/rto-rpo.mdx Outdated Show resolved Hide resolved

docs/cloud/rto-rpo.mdx

    
              **All writes to storage are synchronously replicated across AZs**, including our writes to ElasticSearch.

              ElasticSearch is eventually consistent, but this does not impact our RPO as there is no data loss.

              - You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual

Contributor

bechols Dec 23, 2025

Can we suggest or link to specific guidance on how to do this?

docs/production-deployment/cloud/rto-rpo.mdx Outdated Show resolved Hide resolved

vercel bot deployed to Preview

December 26, 2025 19:40

View deployment

vercel bot had a problem deploying to Preview

January 9, 2026 22:15

Failure

vercel bot had a problem deploying to Preview

January 19, 2026 22:03

Failure

vercel bot had a problem deploying to Preview

January 20, 2026 14:49

Failure

lukeknep changed the title ~~[Feedback requested DO NOT MERGE] Rewriting the 'RPO and RTO' page to clear up common confusion~~ Rewriting the 'RPO and RTO' page to clear up common confusion

sergeybykov approved these changes

View reviewed changes

Member

sergeybykov left a comment

A general good practice is to start each sentence on a new line, so that a later reviewer can add a comment to a specific sentence.

docs/cloud/rto-rpo.mdx

    
              Temporal Cloud delivers different RPO/RTOs based on these scenarios because of the way our platform performs writes to our data provider.

              As Workflows progress in the active region, history events are asynchronously replicated to the replica.

              Because replication is asynchronous, High Availability does not impact the latency or throughput of Workflow Executions in the active region.

              If an outage hits the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started.

Member

sergeybykov Jan 20, 2026

"failover" -> "fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              This means there is _no_ logical corruption and restoration is done from a live replicated instance.

              This applies for both single region Namespaces and multi region Namespaces.

              - Regular drills where we failover our internal Namespaces to test our tooling.

Member

sergeybykov Jan 20, 2026

"failover" -> "fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              **Recovery Time Objective (RTO) - 0**

              - You can sequence your failovers in a particular order. Your cloud infrastructure probably contains more pieces than just your Temporal Namespace: Temporal Workers, compute pools, data stores, and other cloud services. If you manually failover, you can choose the order in which these pieces switch to the replica region. You can then test that ordering with failover drills and ensure it executes smoothly without data consistency issues or bottlenecks.

Member

sergeybykov Jan 20, 2026

"If you manually failover" -> "If you manually fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              Temporal is active-active across AZs.

              The RTO is stated to be zero, meaning there should be no downtime in such scenarios.

              - You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage.

Member

sergeybykov Jan 20, 2026

"proactively failover" -> "proactively fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              The RTO is stated to be zero, meaning there should be no downtime in such scenarios.

              - You can proactively failover more aggressively than Temporal. While the 20-minute RTO should be sufficient for most use cases, some may strive to hit an even lower RTO. For workloads like high frequency trading, auctions, or popular sporting events, an outage at the wrong time could cause tremendous lost revenue per minute. You can adopt a posture that fails over more eagerly than Temporal does. For example, you could trigger a manual failover at the first sign of a possible disruption, before knowing whether there's a true regional outage. 

              - Even if you have robust tooling to detect an outage and trigger a failover, leaving Temporal-initiated failovers enabled provides a "safety net" in case your automation misses an outage. It also gives Temporal leeway to preemptively failover your Namespace if we detect that it may be disrupted soon, e.g., by a rolling failure that has impacted other Namespaces but not yours, yet.

Member

sergeybykov Jan 20, 2026

"preemptively failover" -> "preemptively fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              - Namespace 1_A was in the region and its cell experienced a partial degradation that caused 10% of requests to fail in the first 5 minutes, 25% in the second five minutes, and 50% in the third five minutes. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( (1 - 10%) + (1 - 25%) + (1 - 50%) + 8925 * 100% ) / 8928  = 99.990%. (Note: there are 8928 5-minute periods in a 31-day month.)

              - Namespace 1_B was in the same cell as Namespace 2_A, so it also experienced a partial degradation that caused 10% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 * (1 - 10%) + 8927 * 100% ) / 8928 = 99.998%.

Member

sergeybykov Jan 20, 2026

"failover" -> "fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              - Namespace 2_A was in the region and its cell was fully network partitioned at the start of the outage, causing 100% of requests to fail. Since it was significantly impacted from 10:00:00 to 10:15:00, its Recovery Time was 15 minutes. If it had no other service errors that month, then its service error rate for the month would be: ( 3 * (1 - 100%) + 8928 * 100% ) / 8640 5-minute periods per month = 99.97%.

              - Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.

Member

sergeybykov Jan 20, 2026

"failover" -> "fail over" (verb vs. noun)

docs/cloud/rto-rpo.mdx

    
              - Namespace 2_B was in the region and was fully network partitioned, causing 100% of requests to fail. However, its owner detected the outage via their own tooling and decided to manually failover at 10:05:00. This Namespace achieved a recovery time of 5 minutes and a service error rate of ( 1 5-minute periods * (1 - 100%) + 8639 5-minute periods * 100% ) / 8640 5-minute periods per month = 99.99%.

              All of the above Namespaces were in the affected region, but they achieved varying recovery times and service error rates.

Member

sergeybykov Jan 20, 2026

Can we say that in all these cases RPO was zero?

lukeknep and others added 7 commits

January 21, 2026 08:05


          Rewriting the 'RPO and RTO' page to clear up common confusion

947ad20


          Updated RTO and RPO page based on feedback

049ac6c


          Update docs/production-deployment/cloud/rto-rpo.mdx

fd64ca2

Co-authored-by: Ben Echols <benjamin.echols@temporal.io>


          Update docs/production-deployment/cloud/rto-rpo.mdx

35c51f5

Co-authored-by: Ben Echols <benjamin.echols@temporal.io>


          Remove changes to SLA page

82601a4


          typo

cea6809

typo removed


          Rewrite the RPO and RTO goals to be more succinct

76593c7

flippedcoder force-pushed the rpo-rewrite branch from 900b2d5 to 76593c7 Compare

January 21, 2026 14:05

vercel bot deployed to Preview

January 21, 2026 14:07

View deployment

flippedcoder reviewed

View reviewed changes

docs/cloud/rto-rpo.mdx

    
              1. **[High Availability](/cloud/high-availability) features enabled** (same-region, multi-region, or multi-cloud replication): Sub-1-minute RPO and 20 minutes or less RTO

              2. **Default (non-HA) namespace, regional failure**: 8-hour RPO and RTO

              3. **Default (non-HA) namespace, availability zone failure**: 0 RPO and RTO

              To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the type of outages that can be handled. "Multi-region Replication" is when the active and replica are in different regions on the same cloud (e.g., AWS us-east-1 and AWS us-west-2). "Multi-cloud Replication" is when the active and replica are in different clouds (e.g., AWS and GCP). "Same-region Replication" is when the active and replica are in the same region. Temporal always places the active and replica in different [cells](/cloud/sla).

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the type of outages that can be handled. "Multi-region Replication" is when the active and replica are in different regions on the same cloud (e.g., AWS us-east-1 and AWS us-west-2). "Multi-cloud Replication" is when the active and replica are in different clouds (e.g., AWS and GCP). "Same-region Replication" is when the active and replica are in the same region. Temporal always places the active and replica in different [cells](/cloud/sla).
          
            To achieve the lowest RPO and RTO, Temporal Cloud offers [High Availability](/cloud/high-availability) features that keep Workflows operational with minimal downtime. When High Availability is enabled on a Namespace, the user chooses a region to place a "replica" that will take over in the event of a failure. The location of the replica determines the type of replication used and the type of outages that can be handled. Multi-region Replication is when the active and replica are in different regions on the same cloud (e.g., AWS us-east-1 and AWS us-west-2). Multi-cloud Replication is when the active and replica are in different clouds (e.g., AWS and GCP). Same-region Replication is when the active and replica are in the same region. Temporal always places the active and replica in different [cells](/cloud/sla).

docs/cloud/rto-rpo.mdx

    
              If an outage hits the active region or cell, Temporal Cloud will failover to the replica so that existing Workflow Executions will continue to run and new Workflow Executions can be started.

              ## High Availability, Regional Failure

              The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the below RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers (which comes enabled by default).

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the below RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers (which comes enabled by default). 
          
            The Recovery Point Objective and Recovery Time Objective for Temporal Cloud depend on the type of outage and which [High Availability](/cloud/high-availability) feature your Namespace has enabled. Temporal Cloud can only set an RPO and RTO for cases where it has the ability to mitigate the outage. Therefore, the below RPOs and RTOs apply to Namespaces that have the corresponding type of replication and have enabled Temporal-initiated failovers, which comes enabled by default.

docs/cloud/rto-rpo.mdx

    
              Notes:

              **Recovery Point Objective (RPO) - sub-1-minute**

              - The above goals are only applicable to Namespaces that have enabled "Temporal-initiated failovers" (which comes enabled by default). Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not "cancel out" or "accidentally reverse" a Temporal-initiated failover. **Temporal highly recommends keeping Temporal-initiated failovers enabled.** When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover.

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            - The above goals are only applicable to Namespaces that have enabled "Temporal-initiated failovers" (which comes enabled by default). Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not "cancel out" or "accidentally reverse" a Temporal-initiated failover. **Temporal highly recommends keeping Temporal-initiated failovers enabled.** When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover.
          
            - The above goals are only applicable to Namespaces that have enabled Temporal-initiated failovers, which comes enabled by default. Temporal-initiated failovers are initiated by Temporal's tooling and/or on-call engineers without user action. Users can always initiate a failover on their Namespace, even when Temporal-initiated failovers are enabled. In an outage, a user-initiated failover will not cancel out or accidentally reverse a Temporal-initiated failover.
          
            :::note
          
            Temporal highly recommends keeping Temporal-initiated failovers enabled. When Temporal-initiated failovers are _disabled,_ Temporal Cloud cannot set an RPO and RTO for that Namespace, because it cannot control when or if the user will trigger a failover.
          
            :::

docs/cloud/rto-rpo.mdx

    
              - Monitoring, alerting, and internal SLOs on the replication lag for every Temporal Cloud Namespace.

              **Recovery Point Objective (RPO) - 8 hours**

              However, user actions on a Namespace can affect the recovery point. For example, suddenly "bursting" into much higher throughput than a Namespace has seen before could create a period of replication lag where the replica falls behind the active.

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            However, user actions on a Namespace can affect the recovery point. For example, suddenly "bursting" into much higher throughput than a Namespace has seen before could create a period of replication lag where the replica falls behind the active. 
          
            However, user actions on a Namespace can affect the recovery point. For example, suddenly spiking into much higher throughput than a Namespace has seen before could create a period of replication lag where the replica falls behind the active.

docs/cloud/rto-rpo.mdx

    
              - Our data provider “snapshot” duration which is _4 hours_

              - The time window of _4 hours_ allocated to detection of corruption point before we mitigate.

              Temporal provides a [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) metric for each Namespace. This metric approximates the recovery point the Namespace would achieve in a "worst case" failure at that given moment. **Temporal recommends monitoring the replication lag and alerting should it rise too high,** e.g., above 1 minute.

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            Temporal provides a [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) metric for each Namespace. This metric approximates the recovery point the Namespace would achieve in a "worst case" failure at that given moment. **Temporal recommends monitoring the replication lag and alerting should it rise too high,** e.g., above 1 minute.
          
            Temporal provides a [replication lag](/cloud/high-availability/monitoring#replication-lag-metric) metric for each Namespace. This metric approximates the recovery point the Namespace would achieve in a worst case failure at that given moment.
          
            :::note
          
            Temporal recommends monitoring the replication lag and alerting should it rise too high, e.g., above 1 minute.
          
            :::

docs/cloud/rto-rpo.mdx

    
              To achieve the lowest possible recovery times, Temporal recommends that you 1. keep Temporal-initiated failovers enabled on your Namespace (the default), and 2. invest in a process to detect outages and trigger a manual failover. Users can trigger manual failovers on their Namespaces even if Temporal-initiated failovers are enabled. There are several benefits to combining a manual failover process with Temporal-initiated failovers:

              Anything that gets committed into the zone is protected by replication in another AZ.

              - You can detect outages that Temporal doesn't. In the cloud, regional outages never affect every service the same way. It's possible that Temporal--and the services it depends on--are unaffected by the outage, even while your Workers or other cloud infrastructure are disrupted. If you monitor each service in your critical path and alert on unusual

Contributor

flippedcoder Jan 21, 2026

I think the sentence was cut off here 😅 If you monitor each service in your critical path and alert on unusual ...

docs/cloud/rto-rpo.mdx

    
              | Aspect                            | RTO                                                                                                                                                                                                                                                                     | SLA                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |

              |-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

              | What is it?                       | An objective, or high-priority goal, for the total time that an outage disrupts a Namespace.                                                                                                                                                                            | A contractual agreement that sets an upper bound on the service error rate, with financial repercussions.                                                                                                                                                                                                                                                                                                                                                                      |

              | How is it measured?               | The achieved "recovery time" is measured in terms of "minutes per outage."                                                                                                                                                                                              | The achieved "service error rate" is measured in terms of "error rate per month."                                                                                                                                                                                                                                                                                                                                                                                              |

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            | How is it measured?               | The achieved "recovery time" is measured in terms of "minutes per outage."                                                                                                                                                                                              | The achieved "service error rate" is measured in terms of "error rate per month."                                                                                                                                                                                                                                                                                                                                                                                              |
          
            | How is it measured?               | The achieved recovery time is measured in terms of minutes per outage.                                                                                                                                                                                              | The achieved service error rate is measured in terms of error rate per month.                                                                                                                                                                                                                                                                                                                                                                                              |

docs/cloud/rto-rpo.mdx

    
              |-----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

              | What is it?                       | An objective, or high-priority goal, for the total time that an outage disrupts a Namespace.                                                                                                                                                                            | A contractual agreement that sets an upper bound on the service error rate, with financial repercussions.                                                                                                                                                                                                                                                                                                                                                                      |

              | How is it measured?               | The achieved "recovery time" is measured in terms of "minutes per outage."                                                                                                                                                                                              | The achieved "service error rate" is measured in terms of "error rate per month."                                                                                                                                                                                                                                                                                                                                                                                              |

              | How is the calculation performed? | The achieved recovery time in a given outage is the total time between `<when a disruption to a Namespace began>` and `<when the Namespace was restored to full functionalilty>`, either after a `failover` to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month.                                                                                                                                                                                                                                                                                                                       |

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            | How is the calculation performed? | The achieved recovery time in a given outage is the total time between `<when a disruption to a Namespace began>` and `<when the Namespace was restored to full functionalilty>`, either after a `failover` to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month.                                                                                                                                                                                                                                                                                                                       |
          
            | How is the calculation performed? | The achieved recovery time in a given outage is the total time between when a disruption to a Namespace began and when the Namespace was restored to full functionality, either after a failover to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month.                                                                                                                                                                                                                                                                                                                       |

docs/cloud/rto-rpo.mdx

    
              | What is it?                       | An objective, or high-priority goal, for the total time that an outage disrupts a Namespace.                                                                                                                                                                            | A contractual agreement that sets an upper bound on the service error rate, with financial repercussions.                                                                                                                                                                                                                                                                                                                                                                      |

              | How is it measured?               | The achieved "recovery time" is measured in terms of "minutes per outage."                                                                                                                                                                                              | The achieved "service error rate" is measured in terms of "error rate per month."                                                                                                                                                                                                                                                                                                                                                                                              |

              | How is the calculation performed? | The achieved recovery time in a given outage is the total time between `<when a disruption to a Namespace began>` and `<when the Namespace was restored to full functionalilty>`, either after a `failover` to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month.                                                                                                                                                                                                                                                                                                                       |

              | Do partial degradations count?    | Most outages contain periods of __partial degradation__ where some % of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.                       | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.                                                                                                                                                                                                                                                                                      |

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            | Do partial degradations count?    | Most outages contain periods of __partial degradation__ where some % of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.                       | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.                                                                                                                                                                                                                                                                                      |
          
            | Do partial degradations count?    | Most outages contain periods of __partial degradation__ where some percentage of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.                       | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.                                                                                                                                                                                                                                                                                      |

docs/cloud/rto-rpo.mdx

    
              | How is it measured?               | The achieved "recovery time" is measured in terms of "minutes per outage."                                                                                                                                                                                              | The achieved "service error rate" is measured in terms of "error rate per month."                                                                                                                                                                                                                                                                                                                                                                                              |

              | How is the calculation performed? | The achieved recovery time in a given outage is the total time between `<when a disruption to a Namespace began>` and `<when the Namespace was restored to full functionalilty>`, either after a `failover` to a healthy region or after the outage has been mitigated. | Temporal measures the percentage of requests to Temporal Cloud that fail, and applies a [formula](/cloud/sla) to get the final percentage for the month.                                                                                                                                                                                                                                                                                                                       |

              | Do partial degradations count?    | Most outages contain periods of __partial degradation__ where some % of Namespace operations fail while the rest complete as normal. When they disrupt a Namespace, periods of partial degradation count in the calculation of the recovery time.                       | Partial degradations only partially count for the service error rate calculation. A 5-minute window with a 10% error rate would count less than a 5-minute window with a 100% error rate.                                                                                                                                                                                                                                                                                      |

              | What is excluded?                 | For partial degradations, what counts as a "disruption to a Namespace" is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%.                                                                                                | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). |

Contributor

flippedcoder Jan 21, 2026

Suggested change

      
            | What is excluded?                 | For partial degradations, what counts as a "disruption to a Namespace" is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%.                                                                                                | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). |
          
            | What is excluded?                 | For partial degradations, what counts as a disruption to a Namespace is subject to Temporal's expert judgment, but a good rule of thumb is a service error rate >=10%.                                                                                                | We exclude outages that are out of Temporal's control to mitigate, e.g., a failure of the underlying cloud provider infrastructure that affects a Namespace without High Availability and Temporal-initiated failovers enabled. If a Namespace has the relevant High Availability feature and has Temporal-initiated failovers enabled, then Temporal can act to mitigate the outage and it does usually count against the SLA. Full exclusions on the [SLA page](/cloud/sla). |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet