Robot SRE Incident Taxonomy And Policy Matrix
Purpose
Define the operational incident classes, evidence rules, severity model, and remediation policy boundaries for the Robot SRE / Ops Persona.
This document is the safety contract for the system. The AI layer may explain, summarize, and help choose actions, but it must stay inside the policy boundaries defined here.
Design Intent
This taxonomy is optimized for:
- small OKE clusters
- demo and staging environments first
- recent real failure modes in Image Factory
- gradual automation with strong guardrails
It is not intended to be exhaustive on day one. It should start narrow and expand only when each class has good evidence, clear rollback posture, and proven low false-positive behavior.
Domain Categories
The taxonomy should be organized first by operational domain, then by incident class. This keeps the system easier to reason about, easier to extend, and friendlier in the admin UI.
infrastructure
Cluster and cloud substrate concerns:
- nodes
- capacity
- storage
- OCI worker lifecycle
- cluster scheduling
runtime_services
Shared in-cluster dependencies and control-plane helpers:
- Redis
- NATS
- MinIO
- internal registry
- GLAuth
- background workers and watcher processes
application_services
Image Factory user-facing and business-critical services:
- backend API
- frontend UI
- docs service
- dispatcher
- notification worker
- email worker
- external tenant service
network_ingress
Traffic routing and reachability:
- ingress
- DNS
- TLS
- load balancer
- service-to-service resolution when customer-visible
golden_signals
Cross-cutting service health and capacity indicators:
- latency
- traffic
- errors
- saturation
- queue backlog and throughput when they behave like service health signals
identity_security
Authentication, authorization, and trust path concerns:
- LDAP / identity provider connectivity
- auth-provider drift
- secret and certificate mismatches
- security control disablement or auth-path failures
release_configuration
Intended vs actual system shape:
- Helm drift
- stale config in system tables
- image source drift
- missing pull secrets
operator_channels
How the robot reaches humans:
- Telegram delivery
- WhatsApp delivery
- in-app admin notifications
- approval channel reachability
Incident Lifecycle
Each incident moves through these states:
observedtriagedcontainedrecoveringresolvedsuppressedescalated
Severity Model
info
- no customer-visible impact
- advisory only
- no automated write action needed
warning
- localized degradation
- blast radius is narrow or slow-moving
- low-risk containment may be automatic
critical
- customer-visible or imminent outage
- core dependency unavailable
- data loss, auth failure, or control-plane instability risk
- severe golden-signal exhaustion such as sustained saturation, runaway error rate, or dramatic latency collapse
Environment Modes
Policy must vary by environment:
demo
- more automation allowed
- faster containment acceptable
- human approval still required for destructive actions
staging
- moderate automation allowed
- prefer approval for anything beyond low-risk cleanup
production
- conservative mode
- notify and recommend first
- auto-remediation restricted to clearly safe idempotent actions
Action Classes
observe
Read-only evidence collection:
- query Kubernetes
- query OCI
- query app runtime health
- query release state
- query system config
notify
- open incident
- send chat alert
- send in-app/admin notification
- update incident thread
contain
Low-risk actions to reduce churn or blast radius:
- delete completed/failed pods
- suspend allowlisted CronJobs
- scale allowlisted noncritical workloads to zero
- mark incident suppressed for cooldown
recover
Actions that alter service topology or runtime behavior:
- rollout restart deployment
- patch targeted system config
- reconcile Helm release
- resume paused jobs/workloads
- cordon one node
disruptive
Higher-risk recovery actions:
- drain node
- reboot worker
- replace worker
- scale shared controllers
- disable subsystems
Approval Policy
Auto-Allowed
- all
observe - all
notify - low-risk
containactions on explicitly allowlisted resources
Approval Required
- any
recover - any
disruptive - any config mutation
- any Helm reconciliation
- any OCI instance or node-pool operation
Human Only
- delete persistent data
- rotate secrets without runbook
- remove namespaces with tenant data
- disable auth or security controls
Cooldown Policy
Unless overridden per incident class:
- observation polling: 60s
- duplicate alert suppression: 15m
- same remediation action retry: 15m
- disruptive action retry: 60m
Evidence Confidence Bands
high
- multiple corroborating signals
- direct error from failing component
- repeated signal across two or more checks
medium
- one strong signal plus one weak signal
- inferred root cause but not directly proven
low
- single ambiguous signal
- no corroborating data
The robot may auto-act only on high confidence incidents in auto-allowed categories.
Incident Taxonomy
Taxonomy Structure
Each incident should be represented as:
domainincident_typedisplay_namedescriptiondefault_severityevidence_rulespolicy_binding
Example:
domain:runtime_servicesincident_type:runtime_dependency_outagedisplay_name:Runtime Dependency Outage
Domain: infrastructure
1. node_disk_pressure
Description
Node ephemeral storage pressure causing evictions, scheduling failures, and runtime instability.
Primary Signals
- Kubernetes node condition
DiskPressure=True - node taint
node.kubernetes.io/disk-pressure - events:
EvictionThresholdMetFreeDiskSpaceFailed
- burst of evicted pods
Secondary Signals
- repeated image pulls
- large backlog of completed/failed pods
- image-pull retries after churn
Severity Rules
warning- one node in pressure for <10m and cluster remains functional
critical- multiple nodes in pressure
- or one node in pressure with customer-facing pod failures
- or pressure persists >10m
Root Cause Hypotheses
- image churn
- failed runtime garbage collection
- noisy CronJobs / Tekton backlog
- pod log buildup
- hostPath growth
Auto-Allowed Actions
- notify operator
- collect top evidence
- delete
SucceededandFailedpods - suspend allowlisted noisy CronJobs
- scale allowlisted demo workloads down
Approval-Required Actions
- cordon node
- reboot worker
- replace worker
- scale shared controllers down
Cooldown
- same containment set once per 15m
- reboot/replace once per 60m per node
Rollback / Exit Criteria
DiskPressure=False- no disk-pressure taint
- no new evictions for 10m
1.1 node_unreachable_or_notready
Description
Worker becomes unreachable, NotReady, or stops reporting heartbeats.
Primary Signals
- node
Ready=FalseorReady=Unknown node.kubernetes.io/unreachable- OCI instance state mismatch with Kubernetes node state
Auto-Allowed Actions
- notify
- collect node, kubelet, and OCI evidence
Approval-Required Actions
- cordon
- reboot worker
- replace worker
Domain: release_configuration
2. registry_pull_failure
Description
Pods fail to pull images due to registry auth issues, missing tags, or external registry rate limits.
Primary Signals
ErrImagePullImagePullBackOff- error text containing:
toomanyrequestsunauthorizedmanifest unknown403 Forbidden
Secondary Signals
- runtime dependency outage
- rollout stuck
- fresh replacement nodes unable to hydrate
Severity Rules
warning- one noncritical deployment impacted
critical- core backend or shared runtime dependency blocked
Auto-Allowed Actions
- notify
- collect image and pull-secret evidence
Approval-Required Actions
- patch image refs
- patch pull secrets
- Helm reconcile or rollback
Domain: runtime_services
3. runtime_dependency_outage
Description
One of the runtime dependencies (Redis, NATS, MinIO, internal registry) is unreachable or failing health probes.
Primary Signals
- runtime dependency health check failed
- direct connection failure or auth error
- service pods down
Severity Rules
warning- noncritical dependency degraded
critical- core dependency down or cascading impact
Auto-Allowed Actions
- notify
- capture runtime dependency logs and pod status
Approval-Required Actions
- rollout restart dependency
- patch config
- scale dependency
Domain: network_ingress
4. ingress_configuration_drift
Description
Ingress routes, DNS, or TLS configuration drift or misconfiguration causes customer-facing impact.
Primary Signals
- ingress errors or 404s for expected routes
- certificate errors
- DNS mismatch
Severity Rules
warning- partial route failure
critical- core routes unavailable or TLS failures
Auto-Allowed Actions
- notify
- collect ingress, cert, and DNS evidence
Approval-Required Actions
- patch ingress
- patch TLS/cert
- reconcile Helm release
Domain: identity_security
5. identity_provider_unreachable
Description
LDAP / identity provider is unavailable or degraded.
Primary Signals
- auth errors from identity provider
- LDAP ping failures
Severity Rules
warning- transient failures
critical- sustained auth failure
Auto-Allowed Actions
- notify
- capture auth logs and LDAP evidence
Approval-Required Actions
- patch LDAP config
- restart backend
Domain: golden_signals
6. http_error_rate_spike
Description
The application server error rate spikes above a configured threshold.
Primary Signals
- HTTP 5xx rate above threshold
- sustained errors in access logs
Severity Rules
warning- brief spike
critical- sustained or rising errors
Auto-Allowed Actions
- notify
- collect recent error logs and HTTP signal evidence
Approval-Required Actions
- restart backend
- rollback release
Domain: golden_signals
7. database_connectivity_degraded
Description
The application cannot reliably reach the configured database.
Primary Signals
- DB ping failure from runtime dependency watcher
- app startup/connect errors
- migration/bootstrap failures
Severity Rules
- always
critical
Auto-Allowed Actions
- notify
- gather DB health evidence
Approval-Required Actions
- switch configured DB target
- run migration/reconcile jobs
- patch system configs referencing DB
Human-Only Actions
- data restore
- schema reset
- PVC deletion
Domain: release_configuration
8. release_drift_or_partial_apply
Description
Helm or runtime state partially diverges from intended release state.
Primary Signals
- Helm
failedorpending-* - live images differ from desired values
- missing
imagePullSecretsor stale field ownership
Severity Rules
warning- workloads healthy but metadata inconsistent
critical- rollout blocked and workloads unhealthy
Auto-Allowed Actions
- detect diff
- notify operator with exact drift
Approval-Required Actions
- run Helm reconcile
- force conflict ownership
- restart impacted workloads
Domain: runtime_services
9. background_job_buildup
Description
Completed, failed, or noisy background workloads build up and degrade small-cluster stability.
Primary Signals
- large counts of
Succeeded/Failedpods - repeating CronJobs with no user value
- Tekton history growth
Severity Rules
warning- backlog exceeds threshold but no node impact yet
critical- backlog contributing to disk pressure or control-plane churn
Auto-Allowed Actions
- delete completed/failed pods
- suspend allowlisted CronJobs
- post cleanup summary
Approval-Required Actions
- pause shared controllers
- bulk cleanup outside allowlist
Domain: operator_channels
10. chatops_delivery_failure
Description
The robot cannot reliably reach operators through configured channels.
Primary Signals
- Telegram/WhatsApp send failures
- repeated delivery retries
Severity Rules
warning- one channel unavailable
critical- all operator channels unavailable during active incident
Auto-Allowed Actions
- fail over to alternate channel
- surface alert in admin UI
Extensible Rule Model
The taxonomy should support two rule types:
Built-In Rules
Shipped by engineering and versioned in code:
- canonical incident types
- default evidence rules
- default policies
- default remediations
Operator-Defined Rules
Created from the admin UI:
- additional signal thresholds
- custom environment-specific incident variants
- routing and notification rules
- approval requirements and overrides
- suppression windows
Operator-defined rules should extend built-in rules, not replace core safety constraints.
Operator-Defined Rule Boundaries
Operators should be able to add or change:
- thresholds
- severity escalation conditions
- channel routing
- cooldown values
- enable/disable per incident type per environment
- allowlisted resources inside pre-approved action families
Operators should not be able to change from the UI:
- human-only action classes into auto actions
- destructive actions into auto-allowed
- secret values directly in incident rules
- unrestricted shell command definitions
Suggested Rule Schema
Each rule should contain:
idnameenableddomainincident_typeenvironment_scopesignal_selectorthreshold_expressionseveritynotification_policyallowed_action_profilecooldown_secondssuppression_scheduleownerversion
Admin UI Requirements For Rules
Add an operator-facing rules interface under:
Operations > Robot SRE > Rules
Minimum capabilities:
- list built-in and custom rules
- clone built-in rule into custom override
- enable/disable rule per environment
- edit thresholds and routing
- preview policy outcome
- test rule against recent evidence
- audit who changed what and when
Policy Resolution Order
When evaluating a potential incident:
- built-in taxonomy definition
- built-in policy defaults
- environment policy overlay
- operator-defined rule override
- hard safety constraints
Hard safety constraints must always win.
Expanded V1 Recommendation
V1 should still implement a narrow set of incident classes, but the taxonomy should be structured by domain from the start so we can add:
- more infrastructure incidents
- application-service incidents
- security incidents
- operator-defined rules
without redesigning the system later.
Policy Matrix
| Incident Class | Auto Observe | Auto Notify | Auto Contain | Approval Required | Human Only |
|---|---|---|---|---|---|
node_disk_pressure |
yes | yes | yes, allowlist only | cordon, reboot, replace, shared scale-down | persistent data deletion |
registry_pull_failure |
yes | yes | no | image patch, pull-secret patch, release reconcile | registry credential rotation outside runbook |
runtime_dependency_outage |
yes | yes | limited restart in demo/staging | shared dependency restart, config patch, release reconcile | none by default |
ingress_configuration_drift |
yes | yes | no | ingress enable/patch, TLS patch, release reconcile | certificate/key replacement outside runbook |
identity_provider_unreachable |
yes | yes | no | patch LDAP config, backend restart | auth disablement |
database_connectivity_degraded |
yes | yes | no | DB target patch, recovery workflow start | restore/reset/delete data |
release_drift_or_partial_apply |
yes | yes | no | Helm reconcile, restart workloads | uninstall/delete release resources |
background_job_buildup |
yes | yes | yes, allowlist only | shared controller pause, bulk cleanup | namespace/data deletion |
chatops_delivery_failure |
yes | yes | alternate-channel retry | channel credential/config changes | none |
Allowlist Proposal For V1
Auto-Contain Workloads
headlamptekton-pipelinesdeploymentstekton-pipelines-resolversdeploymentsimage-factorydemo app deploymentstrivy-db-warmup
Auto-Contain Actions
- delete
SucceededandFailedpods cluster-wide - suspend
trivy-db-warmup - scale
image-factoryapp workloads to zero in demo mode
Never Auto-Contain
- Supabase config
- ingress controller
- cert-manager
- namespace deletion
- PVC deletion
- database reset
Required Incident Evidence Schema
Each incident record should capture:
incident_typeseverityconfidencesignal_sourcesresource_scopeenvironmentevidence_summaryraw_evidence_refsrecommended_actionsexecuted_actionsapproval_state
Approval Prompt Requirements
When the robot asks for approval, the prompt must include:
- action
- target
- why now
- expected impact
- rollback or next fallback
- expiry time
Example:
Robot SRE requests approval: reboot worker 10.0.10.202. Reason: disk pressure persists after cleanup for 17m. Expected impact: pods on that node will reschedule. Rollback/fallback: replace worker from node pool if reboot does not clear pressure. Approval expires in 10m.
Metrics To Add
ops_incidents_open_totalops_incidents_resolved_totalops_incidents_by_typeops_auto_remediations_totalops_auto_remediations_failed_totalops_approvals_requested_totalops_approvals_granted_totalops_approvals_denied_totalops_policy_suppressions_totalops_chat_delivery_failures_total
MVP Recommendation
Implement first:
node_disk_pressureruntime_dependency_outageregistry_pull_failureidentity_provider_unreachablerelease_drift_or_partial_apply
These cover most of the recent real operational pain while staying small enough to validate safely.
Next Design Step
Use this taxonomy to define:
- data model and APIs
- operator approval workflow
- MCP tool contracts
- Telegram conversation flows