Robot SRE Incident Taxonomy And Policy Matrix

Purpose

Define the operational incident classes, evidence rules, severity model, and remediation policy boundaries for the Robot SRE / Ops Persona.

This document is the safety contract for the system. The AI layer may explain, summarize, and help choose actions, but it must stay inside the policy boundaries defined here.

Design Intent

This taxonomy is optimized for:

small OKE clusters
demo and staging environments first
recent real failure modes in Image Factory
gradual automation with strong guardrails

It is not intended to be exhaustive on day one. It should start narrow and expand only when each class has good evidence, clear rollback posture, and proven low false-positive behavior.

Domain Categories

The taxonomy should be organized first by operational domain, then by incident class. This keeps the system easier to reason about, easier to extend, and friendlier in the admin UI.

`infrastructure`

Cluster and cloud substrate concerns:

nodes
capacity
storage
OCI worker lifecycle
cluster scheduling

`runtime_services`

Shared in-cluster dependencies and control-plane helpers:

Redis
NATS
MinIO
internal registry
GLAuth
background workers and watcher processes

`application_services`

Image Factory user-facing and business-critical services:

backend API
frontend UI
docs service
dispatcher
notification worker
email worker
external tenant service

`network_ingress`

Traffic routing and reachability:

ingress
DNS
TLS
load balancer
service-to-service resolution when customer-visible

`golden_signals`

Cross-cutting service health and capacity indicators:

latency
traffic
errors
saturation
queue backlog and throughput when they behave like service health signals

`identity_security`

Authentication, authorization, and trust path concerns:

LDAP / identity provider connectivity
auth-provider drift
secret and certificate mismatches
security control disablement or auth-path failures

`release_configuration`

Intended vs actual system shape:

Helm drift
stale config in system tables
image source drift
missing pull secrets

`operator_channels`

How the robot reaches humans:

Telegram delivery
WhatsApp delivery
in-app admin notifications
approval channel reachability

Incident Lifecycle

Each incident moves through these states:

observed
triaged
contained
recovering
resolved
suppressed
escalated

Severity Model

`info`

no customer-visible impact
advisory only
no automated write action needed

`warning`

localized degradation
blast radius is narrow or slow-moving
low-risk containment may be automatic

`critical`

customer-visible or imminent outage
core dependency unavailable
data loss, auth failure, or control-plane instability risk
severe golden-signal exhaustion such as sustained saturation, runaway error rate, or dramatic latency collapse

Environment Modes

Policy must vary by environment:

`demo`

more automation allowed
faster containment acceptable
human approval still required for destructive actions

`staging`

moderate automation allowed
prefer approval for anything beyond low-risk cleanup

`production`

conservative mode
notify and recommend first
auto-remediation restricted to clearly safe idempotent actions

Action Classes

`observe`

Read-only evidence collection:

query Kubernetes
query OCI
query app runtime health
query release state
query system config

`notify`

open incident
send chat alert
send in-app/admin notification
update incident thread

`contain`

Low-risk actions to reduce churn or blast radius:

delete completed/failed pods
suspend allowlisted CronJobs
scale allowlisted noncritical workloads to zero
mark incident suppressed for cooldown

`recover`

Actions that alter service topology or runtime behavior:

rollout restart deployment
patch targeted system config
reconcile Helm release
resume paused jobs/workloads
cordon one node

`disruptive`

Higher-risk recovery actions:

drain node
reboot worker
replace worker
scale shared controllers
disable subsystems

Approval Policy

Auto-Allowed

all observe
all notify
low-risk contain actions on explicitly allowlisted resources

Approval Required

any recover
any disruptive
any config mutation
any Helm reconciliation
any OCI instance or node-pool operation

Human Only

delete persistent data
rotate secrets without runbook
remove namespaces with tenant data
disable auth or security controls

Cooldown Policy

Unless overridden per incident class:

observation polling: 60s
duplicate alert suppression: 15m
same remediation action retry: 15m
disruptive action retry: 60m

Evidence Confidence Bands

`high`

multiple corroborating signals
direct error from failing component
repeated signal across two or more checks

`medium`

one strong signal plus one weak signal
inferred root cause but not directly proven

`low`

single ambiguous signal
no corroborating data

The robot may auto-act only on high confidence incidents in auto-allowed categories.

Incident Taxonomy

Taxonomy Structure

Each incident should be represented as:

domain
incident_type
display_name
description
default_severity
evidence_rules
policy_binding

Example:

domain: runtime_services
incident_type: runtime_dependency_outage
display_name: Runtime Dependency Outage

Domain: `infrastructure`

1. `node_disk_pressure`

Description

Node ephemeral storage pressure causing evictions, scheduling failures, and runtime instability.

Primary Signals

Kubernetes node condition DiskPressure=True
node taint node.kubernetes.io/disk-pressure
events:
- EvictionThresholdMet
- FreeDiskSpaceFailed
burst of evicted pods

Secondary Signals

repeated image pulls
large backlog of completed/failed pods
image-pull retries after churn

Severity Rules

warning
- one node in pressure for <10m and cluster remains functional
critical
- multiple nodes in pressure
- or one node in pressure with customer-facing pod failures
- or pressure persists >10m

Root Cause Hypotheses

image churn
failed runtime garbage collection
noisy CronJobs / Tekton backlog
pod log buildup
hostPath growth

Auto-Allowed Actions

notify operator
collect top evidence
delete Succeeded and Failed pods
suspend allowlisted noisy CronJobs
scale allowlisted demo workloads down

Approval-Required Actions

cordon node
reboot worker
replace worker
scale shared controllers down

Cooldown

same containment set once per 15m
reboot/replace once per 60m per node

Rollback / Exit Criteria

DiskPressure=False
no disk-pressure taint
no new evictions for 10m

1.1 `node_unreachable_or_notready`

Description

Worker becomes unreachable, NotReady, or stops reporting heartbeats.

Primary Signals

node Ready=False or Ready=Unknown
node.kubernetes.io/unreachable
OCI instance state mismatch with Kubernetes node state

Auto-Allowed Actions

notify
collect node, kubelet, and OCI evidence

Approval-Required Actions

cordon
reboot worker
replace worker

Domain: `release_configuration`

2. `registry_pull_failure`

Description

Pods fail to pull images due to registry auth issues, missing tags, or external registry rate limits.

Primary Signals

ErrImagePull
ImagePullBackOff
error text containing:
- toomanyrequests
- unauthorized
- manifest unknown
- 403 Forbidden

Secondary Signals

runtime dependency outage
rollout stuck
fresh replacement nodes unable to hydrate

Severity Rules

warning
- one noncritical deployment impacted
critical
- core backend or shared runtime dependency blocked

Auto-Allowed Actions

notify
collect image and pull-secret evidence

Approval-Required Actions

patch image refs
patch pull secrets
Helm reconcile or rollback

Domain: `runtime_services`

3. `runtime_dependency_outage`

Description

One of the runtime dependencies (Redis, NATS, MinIO, internal registry) is unreachable or failing health probes.

Primary Signals

runtime dependency health check failed
direct connection failure or auth error
service pods down

Severity Rules

warning
- noncritical dependency degraded
critical
- core dependency down or cascading impact

Auto-Allowed Actions

notify
capture runtime dependency logs and pod status

Approval-Required Actions

rollout restart dependency
patch config
scale dependency

Domain: `network_ingress`

4. `ingress_configuration_drift`

Description

Ingress routes, DNS, or TLS configuration drift or misconfiguration causes customer-facing impact.

Primary Signals

ingress errors or 404s for expected routes
certificate errors
DNS mismatch

Severity Rules

warning
- partial route failure
critical
- core routes unavailable or TLS failures

Auto-Allowed Actions

notify
collect ingress, cert, and DNS evidence

Approval-Required Actions

patch ingress
patch TLS/cert
reconcile Helm release

Domain: `identity_security`

5. `identity_provider_unreachable`

Description

LDAP / identity provider is unavailable or degraded.

Primary Signals

auth errors from identity provider
LDAP ping failures

Severity Rules

warning
- transient failures
critical
- sustained auth failure

Auto-Allowed Actions

notify
capture auth logs and LDAP evidence

Approval-Required Actions

patch LDAP config
restart backend

Domain: `golden_signals`

6. `http_error_rate_spike`

Description

The application server error rate spikes above a configured threshold.

Primary Signals

HTTP 5xx rate above threshold
sustained errors in access logs

Severity Rules

warning
- brief spike
critical
- sustained or rising errors

Auto-Allowed Actions

notify
collect recent error logs and HTTP signal evidence

Approval-Required Actions

restart backend
rollback release

Domain: `golden_signals`

7. `database_connectivity_degraded`

Description

The application cannot reliably reach the configured database.

Primary Signals

DB ping failure from runtime dependency watcher
app startup/connect errors
migration/bootstrap failures

Severity Rules

always critical

Auto-Allowed Actions

notify
gather DB health evidence

Approval-Required Actions

switch configured DB target
run migration/reconcile jobs
patch system configs referencing DB

Human-Only Actions

data restore
schema reset
PVC deletion

Domain: `release_configuration`

8. `release_drift_or_partial_apply`

Description

Helm or runtime state partially diverges from intended release state.

Primary Signals

Helm failed or pending-*
live images differ from desired values
missing imagePullSecrets or stale field ownership

Severity Rules

warning
- workloads healthy but metadata inconsistent
critical
- rollout blocked and workloads unhealthy

Auto-Allowed Actions

detect diff
notify operator with exact drift

Approval-Required Actions

run Helm reconcile
force conflict ownership
restart impacted workloads

Domain: `runtime_services`

9. `background_job_buildup`

Description

Completed, failed, or noisy background workloads build up and degrade small-cluster stability.

Primary Signals

large counts of Succeeded/Failed pods
repeating CronJobs with no user value
Tekton history growth

Severity Rules

warning
- backlog exceeds threshold but no node impact yet
critical
- backlog contributing to disk pressure or control-plane churn

Auto-Allowed Actions

delete completed/failed pods
suspend allowlisted CronJobs
post cleanup summary

Approval-Required Actions

pause shared controllers
bulk cleanup outside allowlist

Domain: `operator_channels`

10. `chatops_delivery_failure`

Description

The robot cannot reliably reach operators through configured channels.

Primary Signals

Telegram/WhatsApp send failures
repeated delivery retries

Severity Rules

warning
- one channel unavailable
critical
- all operator channels unavailable during active incident

Auto-Allowed Actions

fail over to alternate channel
surface alert in admin UI

Extensible Rule Model

The taxonomy should support two rule types:

Built-In Rules

Shipped by engineering and versioned in code:

canonical incident types
default evidence rules
default policies
default remediations

Operator-Defined Rules

Created from the admin UI:

additional signal thresholds
custom environment-specific incident variants
routing and notification rules
approval requirements and overrides
suppression windows

Operator-defined rules should extend built-in rules, not replace core safety constraints.

Operator-Defined Rule Boundaries

Operators should be able to add or change:

thresholds
severity escalation conditions
channel routing
cooldown values
enable/disable per incident type per environment
allowlisted resources inside pre-approved action families

Operators should not be able to change from the UI:

human-only action classes into auto actions
destructive actions into auto-allowed
secret values directly in incident rules
unrestricted shell command definitions

Suggested Rule Schema

Each rule should contain:

id
name
enabled
domain
incident_type
environment_scope
signal_selector
threshold_expression
severity
notification_policy
allowed_action_profile
cooldown_seconds
suppression_schedule
owner
version

Admin UI Requirements For Rules

Add an operator-facing rules interface under:

Operations > Robot SRE > Rules

Minimum capabilities:

list built-in and custom rules
clone built-in rule into custom override
enable/disable rule per environment
edit thresholds and routing
preview policy outcome
test rule against recent evidence
audit who changed what and when

Policy Resolution Order

When evaluating a potential incident:

built-in taxonomy definition
built-in policy defaults
environment policy overlay
operator-defined rule override
hard safety constraints

Hard safety constraints must always win.

Expanded V1 Recommendation

V1 should still implement a narrow set of incident classes, but the taxonomy should be structured by domain from the start so we can add:

more infrastructure incidents
application-service incidents
security incidents
operator-defined rules

without redesigning the system later.

Policy Matrix

Incident Class	Auto Observe	Auto Notify	Auto Contain	Approval Required	Human Only
`node_disk_pressure`	yes	yes	yes, allowlist only	cordon, reboot, replace, shared scale-down	persistent data deletion
`registry_pull_failure`	yes	yes	no	image patch, pull-secret patch, release reconcile	registry credential rotation outside runbook
`runtime_dependency_outage`	yes	yes	limited restart in demo/staging	shared dependency restart, config patch, release reconcile	none by default
`ingress_configuration_drift`	yes	yes	no	ingress enable/patch, TLS patch, release reconcile	certificate/key replacement outside runbook
`identity_provider_unreachable`	yes	yes	no	patch LDAP config, backend restart	auth disablement
`database_connectivity_degraded`	yes	yes	no	DB target patch, recovery workflow start	restore/reset/delete data
`release_drift_or_partial_apply`	yes	yes	no	Helm reconcile, restart workloads	uninstall/delete release resources
`background_job_buildup`	yes	yes	yes, allowlist only	shared controller pause, bulk cleanup	namespace/data deletion
`chatops_delivery_failure`	yes	yes	alternate-channel retry	channel credential/config changes	none

Allowlist Proposal For V1

Auto-Contain Workloads

headlamp
tekton-pipelines deployments
tekton-pipelines-resolvers deployments
image-factory demo app deployments
trivy-db-warmup

Auto-Contain Actions

delete Succeeded and Failed pods cluster-wide
suspend trivy-db-warmup
scale image-factory app workloads to zero in demo mode

Never Auto-Contain

Supabase config
ingress controller
cert-manager
namespace deletion
PVC deletion
database reset

Required Incident Evidence Schema

Each incident record should capture:

incident_type
severity
confidence
signal_sources
resource_scope
environment
evidence_summary
raw_evidence_refs
recommended_actions
executed_actions
approval_state

Approval Prompt Requirements

When the robot asks for approval, the prompt must include:

action
target
why now
expected impact
rollback or next fallback
expiry time

Example:

Robot SRE requests approval: reboot worker 10.0.10.202. Reason: disk pressure persists after cleanup for 17m. Expected impact: pods on that node will reschedule. Rollback/fallback: replace worker from node pool if reboot does not clear pressure. Approval expires in 10m.

Metrics To Add

ops_incidents_open_total
ops_incidents_resolved_total
ops_incidents_by_type
ops_auto_remediations_total
ops_auto_remediations_failed_total
ops_approvals_requested_total
ops_approvals_granted_total
ops_approvals_denied_total
ops_policy_suppressions_total
ops_chat_delivery_failures_total

MVP Recommendation

Implement first:

node_disk_pressure
runtime_dependency_outage
registry_pull_failure
identity_provider_unreachable
release_drift_or_partial_apply

These cover most of the recent real operational pain while staying small enough to validate safely.

Next Design Step

Use this taxonomy to define:

data model and APIs
operator approval workflow
MCP tool contracts
Telegram conversation flows

Robot SRE Incident Taxonomy And Policy Matrix

Purpose

Design Intent

Domain Categories

infrastructure

runtime_services

application_services

network_ingress

golden_signals

identity_security

release_configuration

operator_channels

Incident Lifecycle

Severity Model

info

warning

critical

Environment Modes

demo

staging

production

Action Classes

observe

notify

contain

recover

disruptive

Approval Policy

Auto-Allowed

Approval Required

Human Only

Cooldown Policy

Evidence Confidence Bands

high

medium

low

Incident Taxonomy

Taxonomy Structure

Domain: infrastructure

1. node_disk_pressure

Description

Primary Signals

Secondary Signals

Severity Rules

Root Cause Hypotheses

Auto-Allowed Actions

Approval-Required Actions

Cooldown

Rollback / Exit Criteria

1.1 node_unreachable_or_notready

Description

Primary Signals

Auto-Allowed Actions

Approval-Required Actions

Domain: release_configuration

2. registry_pull_failure

Description

Primary Signals

Secondary Signals

Severity Rules

Auto-Allowed Actions

Approval-Required Actions

Domain: runtime_services

3. runtime_dependency_outage

Description

Primary Signals

Severity Rules

Auto-Allowed Actions

Approval-Required Actions

Domain: network_ingress

4. ingress_configuration_drift

Description

Primary Signals

Severity Rules

Auto-Allowed Actions

Approval-Required Actions

Domain: identity_security

5. identity_provider_unreachable

Description

Primary Signals

`infrastructure`

`runtime_services`

`application_services`

`network_ingress`

`golden_signals`

`identity_security`

`release_configuration`

`operator_channels`

`info`

`warning`

`critical`

`demo`

`staging`

`production`

`observe`

`notify`

`contain`

`recover`

`disruptive`

`high`

`medium`

`low`

Domain: `infrastructure`

1. `node_disk_pressure`

1.1 `node_unreachable_or_notready`

Domain: `release_configuration`

2. `registry_pull_failure`

Domain: `runtime_services`

3. `runtime_dependency_outage`

Domain: `network_ingress`

4. `ingress_configuration_drift`

Domain: `identity_security`

5. `identity_provider_unreachable`

Domain: `golden_signals`

6. `http_error_rate_spike`

Domain: `golden_signals`

7. `database_connectivity_degraded`

Domain: `release_configuration`

8. `release_drift_or_partial_apply`

Domain: `runtime_services`

9. `background_job_buildup`

Domain: `operator_channels`

10. `chatops_delivery_failure`