Monitor Temporal Cloud

Temporal Cloud metrics help monitor production deployments. This documentation covers best practices for monitoring Temporal Cloud.

Monitor availability issues

When you see a sudden drop in Worker resource utilization, verify whether Temporal Cloud's API is showing increased latency and error rates.

Reference Metrics

temporal_cloud_v1_service_latency_p99

This metric measures latency for SignalWithStartWorkflowExecution, SignalWorkflowExecution, StartWorkflowExecution operations. These operations are mission critical and never throttled. This metric is a good indicator of your lowest possible latency for the 99th percentile of requests.

Workflow execution latency

To monitor end-to-end Workflow execution time (not just the service API latency above), use the workflow schedule-to-close latency metrics:

These measure the time from when a Workflow is scheduled until it closes, including all Activity execution time. A sudden increase may indicate Worker capacity issues, downstream service degradation, or retry storms.

Monitor Temporal Service errors

Check for Temporal Service gRPC API errors. Note that Service API errors are not equivalent to guarantees mentioned in the Temporal Cloud SLA.

Reference Metrics

Prometheus Query for this Metric

Measure your daily average success rate over 10-minute windows.

OpenMetrics v1 metrics are pre-computed rates. Use sum() to aggregate across dimensions rather than increase() or rate().

avg_over_time((
    (
        (
            sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
            -
            sum(temporal_cloud_v1_service_error_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
        )
        /
        sum(temporal_cloud_v1_service_request_count{temporal_namespace=~"$namespace", operation=~"StartWorkflowExecution|SignalWorkflowExecution|SignalWithStartWorkflowExecution|RequestCancelWorkflowExecution|TerminateWorkflowExecution"})
    )

    or vector(1)

    )[1d:1m])

Detecting Activity and Workflow Failures

The metrics temporal_cloud_v1_activity_fail_count and temporal_cloud_v1_workflow_failed_count together provide failure detection for Temporal applications. These metrics work in tandem to give you both granular component-level visibility and high-level workflow health insights.

Activity failure cascade

If not using infinite retry policies, Activity failures can lead to Workflow failures:

Activity Failure --> Retry Logic --> More Activity Failures --> Workflow Decision --> Potential Workflow Failure

Activity failures are often recoverable and expected. Workflow failures represent terminal states requiring immediate attention. A spike in activity failures may precede workflow failures. Generally Temporal recommends that Workflows should be designed to always succeed. If an Activity fails more than its retry policy allows, we suggest having the Workflow handle Activity failure and take action to notify a human to take corrective action or be aware of the error.

Ratio-based monitoring

Failure conversion rate

Monitor the ratio of workflow failures to activity failures:

workflow_failure_rate = temporal_cloud_v1_workflow_failed_count / temporal_cloud_v1_activity_fail_count

What to watch for:

High ratio (greater than 0.1): Poor error handling - activities failing are causing workflow failures
Low ratio (less than 0.01): Good resilience - activities fail but workflows recover
Sudden spikes: May indicate systematic issues

Activity success rate

activity_success_rate = temporal_cloud_v1_activity_success_count / (temporal_cloud_v1_activity_success_count + temporal_cloud_v1_activity_fail_count)

Target: >95% for most applications. Lower success rate can be a sign of system troubles. See also:

Monitor replication lag for Namespaces with High Availability features

Replication lag refers to the transmission delay of Workflow updates and history events from the primary Namespace to the replica. Always check the metric replication lag before initiating a failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.

Who owns the replication lag? Temporal owns replication lag.

What guarantees are available? There is no SLA for replication lag. Temporal recommends that customers do not trigger failovers except for testing or emergency situations. High Availability feature's four-9 guarantee SLA means Temporal will handle failovers and ensure high availability. Temporal also monitors replication lag. Customers who decide to trigger failovers should look at this metric before moving forward.

If the lag is high, what should you do? We don't expect users to failover. Please contact Temporal support if you feel you have a pressing need.

Where can you read more? See operations and metrics for Namespaces with High Availability features.

Reference Metrics

Detecting Resource Exhaustion

Resource exhaustion happens when a single resource (a Namespace, Task Queue, or Workflow ID) receives a burst of operations larger than that resource can absorb in the moment. The Cloud metric temporal_cloud_v1_resource_exhausted_error_count increments and ResourceExhausted gRPC errors are returned to the client. SDKs retry these errors gracefully, so workflow progress is rarely impacted.

Persistent non-zero values are unexpected and indicate a hot resource. Use the operation label to identify which RPC is hitting the burst limit. For example, StartWorkflowExecution increments here when the same Workflow ID is started more than once per second.

Resource exhaustion is distinct from rate limiting against your account limits. For workloads that are throttled because they exceed their provisioned capacity, see Monitoring Trends Against Limits. Limits-driven throttling slows or stalls a workload, so it is generally the more important signal to monitor.

Monitoring Trends Against Limits

Tracking trends against your account limits is the most important throttling signal to monitor. Unlike Resource Exhaustion, which usually self-heals through retries, hitting a limit slows or stalls progress until the workload backs off or your capacity is increased.

The set of limit metrics provide a time series of values for limits. Use these metrics with their corresponding count metrics to monitor general trends against limits and set alerts when limits are exceeded. Use the corresponding throttle metrics to determine the severity of any active rate limiting.

Limit Metric	Count Metric	Throttle Metric
`temporal_cloud_v1_action_limit`	`temporal_cloud_v1_total_action_count`	`temporal_cloud_v1_total_action_throttled_count`
`temporal_cloud_v1_service_request_limit`	`temporal_cloud_v1_service_request_count`	`temporal_cloud_v1_service_request_throttled_count`
`temporal_cloud_v1_operations_limit`	`temporal_cloud_v1_operations_count`	`temporal_cloud_v1_operations_throttled_count`

On-demand envelope limits

For Namespaces using provisioned capacity, the following metrics show what your limits would be under on-demand mode. Compare these against your current provisioned limits to evaluate capacity mode choices:

On-Demand Envelope Metric	Equivalent Limit Metric
`temporal_cloud_v1_action_on_demand_envelope_limit`	`temporal_cloud_v1_action_limit`
`temporal_cloud_v1_operations_on_demand_envelope_limit`	`temporal_cloud_v1_operations_limit`
`temporal_cloud_v1_service_request_on_demand_envelope_limit`	`temporal_cloud_v1_service_request_limit`

For Namespaces already in on-demand mode, these metrics track the same values as their equivalent limit metrics.

The Grafana dashboard example includes a Usage & Quotas section that creates demo charts for these limits and count metrics respectively.

The limit metrics, throttle metrics, and count metrics are already directly comparable as per second rates. Keep in mind that each count metric is represented as a per second rate averaged over each minute. For example, to get the total count of Actions, you must multiply this metric by 60. When setting alerts against limits, consider if your workload is spiky or sensitive to throttling (e.g. does latency matter?). If your workload is sensitive, consider alerting for temporal_cloud_v1_total_action_count at a 50% threshold of the temporal_cloud_v1_action_limit. If your workload is not sensitive, consider an alert at 90% of this threshold or directly when throttling is detected as a value greater than zero for temporal_cloud_v1_total_action_throttled_count. This logic can also be used to automatically scale Temporal Resource Units up or down as needed. Some workloads choose to exceed limits and accept throttling because they are not latency sensitive.

Monitor availability issues​

Reference Metrics​

Workflow execution latency​

Monitor Temporal Service errors​

Reference Metrics​

Prometheus Query for this Metric​

Detecting Activity and Workflow Failures​

Activity failure cascade​

Ratio-based monitoring​

Failure conversion rate​

Activity success rate​

Monitor replication lag for Namespaces with High Availability features​

Reference Metrics​

Detecting Resource Exhaustion​

Monitoring Trends Against Limits​

On-demand envelope limits​

Monitor availability issues

Reference Metrics

Workflow execution latency

Monitor Temporal Service errors

Reference Metrics

Prometheus Query for this Metric

Detecting Activity and Workflow Failures

Activity failure cascade

Ratio-based monitoring

Failure conversion rate

Activity success rate

Monitor replication lag for Namespaces with High Availability features

Reference Metrics

Detecting Resource Exhaustion

Monitoring Trends Against Limits

On-demand envelope limits