SRE Smart Bot Implementation Plan

Purpose

Turn the SRE Smart Bot design into an incremental delivery plan that fits the current Image Factory backend, admin APIs, and small-cluster operating model.

Delivery Principle

Build the system in this order:

policy and storage
incidents and evidence
safe actions and approvals
operator channels
richer agent and MCP capabilities

That keeps the deterministic control plane ahead of the persona layer.

Architecture Guardrail

SRE Smart Bot must remain modular.

Specifically:

backend/cmd/server/main.go should only compose dependencies and start background workers
incident correlation, signal mapping, policy evaluation, channel delivery, and action execution should live in dedicated packages
each signal source should have a small adapter boundary rather than embedding business rules directly in startup code
new SRE Smart Bot slices should prefer service + repository + adapter structure over inline logic in main.go

This is important because the backend already has a large startup surface, and SRE Smart Bot will grow quickly if we do not enforce boundaries early.

Current Checkpoint

Status: 2026-03-14 checkpoint reached

Completed since kickoff:

persisted SRE Smart Bot policy config and admin API
incident ledger schema, repository, and read APIs
initial watcher signal wiring for runtime dependency and cluster metrics ingester incidents
modularization checkpoint:
- runtime dependency watcher extracted
- cluster metrics ingester extracted
- dispatcher runner extracted
- workflow runner extracted
- stale execution watchdog extracted
- provider readiness watcher extracted
- tenant asset drift watcher extracted
- quarantine release compliance watcher extracted
- build notification subscriber health reporter extracted
first product-facing admin incident page route added for Operations > SRE Smart Bot
normalized SRE ledger events now publish through the existing event bus for:
- sre.finding.observed
- sre.incident.resolved
- sre.evidence.added
- sre.action.proposed
backend ingestion path now exists for detector-published findings through:
- sre.detector.finding.observed
local observability bootstrap now exists for development:
- local Loki config
- local log shipper for logs/*.log
- local Grafana provisioning pointing to Loki

Current emphasis:

keep main.go as a composition root
continue product work on top of the extracted runner boundaries
prioritize operator visibility and actionability before richer persona/channel work
make the local observability workflow usable enough to iterate on detector rules quickly
start treating golden signals as first-class SRE inputs, beginning with low-risk saturation detection from existing cluster metrics snapshots

Phase 0: Policy Foundation

Status: done

Scope:

persisted robot_sre_policy config
environment posture (demo, staging, production)
configurable channel providers through API contract
operator-defined rule metadata and validation
implementation backlog and epic alignment

Current slice:

add backend config model and admin endpoints for SRE Smart Bot policy
keep rules declarative and bounded

Exit criteria:

admin can read and update SRE Smart Bot policy
invalid policy payloads are rejected deterministically
defaults are safe for demo environment

Phase 1: Incident Ledger

Status: in_progress

Scope:

findings table
incidents table
incident evidence records
action attempts
approval requests and decisions
correlation keys and incident lifecycle transitions

First incident classes:

infrastructure.node_disk_pressure
runtime_services.runtime_dependency_outage
release_configuration.registry_auth_or_mirror_failure
identity_security.identity_provider_unreachable
application_services.application_service_degraded

Exit criteria:

repeated watcher signals fold into a stable incident record
incidents expose state: observed, triaged, contained, recovering, resolved, suppressed, escalated
all automated actions leave an auditable trail

Progress now:

incident, finding, evidence, action-attempt, and approval tables exist
incident list/detail admin APIs exist
initial admin incident page exists
runtime dependency watcher and cluster metrics snapshot ingester already create/resolve incidents

Next for Phase 1:

wire provider readiness, tenant asset drift, and release compliance watcher results into incident findings
add evidence capture helpers for watcher-specific detail snapshots
extend incident UI with filters, counts, and approval/action timeline polish

Phase 2: Guarded Remediation Engine

Status: planned

Scope:

policy evaluator
cooldown enforcement
allowlisted containment actions
approval gate for recover/disruptive actions
action runner abstraction for Kubernetes, OCI, and config mutations

V1 allowed actions:

notify
delete succeeded/failed pods
suspend allowlisted CronJobs
scale allowlisted noncritical workloads to zero

Approval-required actions:

rollout restart deployment
patch config
Helm reconcile
cordon node
OCI reboot/replace node

Exit criteria:

no disruptive action can execute without approval state
repeated failures do not thrash the same remediation
runtime dependency and disk-pressure incidents can be contained automatically in demo mode

Phase 3: Admin UI And Operator Workflow

Status: in_progress

Scope:

admin page for incidents
admin page for SRE Smart Bot policy/rules
approval inbox
incident timeline with evidence and action history

Exit criteria:

operator can inspect incident evidence and approve/reject actions from UI
operator-defined rules can be added without code changes
built-in rules remain protected from unsafe mutation

Progress now:

incidents workspace, approvals inbox, settings page, and detector-rules review page all exist
the incident drawer now has tabbed summary / AI workspace / signals / actions views
the AI workspace renders structured HTTP golden-signal MCP output instead of raw JSON

Next for Phase 3:

add trend/history visuals driven by persisted http_signals.history
add direct queue/backlog signal summaries now that async-worker backlog sources are landing

Phase 4: Provider-Based Operator Channels

Status: planned

Scope:

provider-based channel integration
incident summaries and action prompts
approval/reject commands
thread-safe mapping between provider messages and incident/action IDs

Exit criteria:

operator receives incident updates through a configured provider
operator can approve a remediation through a provider that supports interaction
provider delivery failures fall back to in-app notifications cleanly

Phase 4.5: Loki And Alloy Ingestion Baseline

Status: planned

Scope:

Alloy collection for pod logs and Kubernetes events
Loki monolithic deployment profile
bounded retention and low-cardinality labeling
detector-friendly namespace selection and labels

Exit criteria:

logs are queryable for recent incident windows without overloading the cluster
Loki/Alloy footprint fits the current small-cluster budget
detector services can query Loki instead of tailing raw node logs directly

Phase 5: Log Intelligence And NATS Findings

Status: planned

Scope:

lightweight detector service
NATS subject for normalized findings
log signature detection
correlation with metrics snapshots and runtime health
reuse the new SRE event-bus contracts so detector and ledger flows speak the same language

Exit criteria:

log-derived findings enrich incidents instead of bypassing policy
findings are normalized and replayable
remediation remains gated by corroborating evidence

Phase 5.5: Golden Signals And Metric Correlation

Status: in_progress

Scope:

derive signal findings from cluster and service metrics
normalize the common golden signal categories:
- latency
- traffic
- errors
- saturation
correlate metric findings with logs, runtime health, and incident evidence
keep the first slice lightweight by building on the existing metrics snapshot ingester

Current first slice:

node CPU saturation detection from cluster snapshots
node memory saturation detection from cluster snapshots
pod restart pressure detection from pod status snapshots
pod eviction pressure detection from pod status snapshots
app-level 5xx burst detection from backend access logs
app-level panic detection from backend logs
app-level request volume windows from HTTP middleware
app-level server-error rate windows from HTTP middleware
app-level average latency windows from HTTP middleware
read-only MCP access to recent HTTP golden-signal windows
recommendation-only action proposal:
- review_cluster_capacity
- review_workload_stability

Immediate next slice:

expose those windows directly in the incident summary UX
add queue-depth / backlog golden signals for asynchronous workloads
compare persisted HTTP trends with recent logs when drafting hypotheses

Exit criteria:

metric-backed findings create or enrich incident threads without requiring log signatures
saturation findings carry evidence snapshots and bounded recommendation actions
thresholds are configurable through env and later promotable into policy/admin settings
the agent/MCP layer can eventually inspect metric trends the same way it inspects logs

Phase 6: MCP And Agent Runtime

Status: planned

Scope:

MCP tool interfaces for Kubernetes, OCI, database, release state, and chat
agent explanation layer
bounded investigation flows

Exit criteria:

agent can summarize incidents and evidence through MCP tools
policy layer remains the final action authority
human approvals remain explicit and auditable

Phase 7: AI Operator Experience

Status: planned

Scope:

AI-generated operator summaries
suggested next investigative steps
configurable provider delivery contracts
standalone SRE Smart Bot worker/service extraction

Exit criteria:

operators receive useful summaries grounded in stored findings/evidence
provider delivery remains configurable through API contracts
standalone extraction does not require changing the incident or approval model

Next Build Slice

Recommended next implementation sequence:

add explicit observability/intelligence epics to backlog and handover
publish normalized SRE ledger events on the existing event bus so NATS/detector consumers have a stable contract
define and land the first detector/NATS subject contract using those event types
add Loki/Alloy deployment manifests or chart values sized for the small OKE cluster
then move into MCP tool contracts and AI operator features

MCP And AI Feature Layer

In progress: robot_sre_policy now includes MCP server definitions and bounded agent-runtime controls.
In progress: incident workspace API/UI now exposes an MCP/AI-ready bundle with executive summaries, recommended questions, enabled MCP servers, and tooling guidance.
In progress: concrete read-only MCP adapters now exist for:
- observability
- kubernetes
In progress: read-only tool coverage now includes:
- logs.recent
- release_drift.summary
In progress: log intelligence now also covers notification-delivery failure signatures from worker logs so SRE Smart Bot can open incidents for downstream async action failures.
In progress: detector-rule learning loop now exists in the backend:
- observed incidents can generate persisted detector-rule suggestions
- admins can accept or reject suggestions
- accepted suggestions become active detector_rules in robot_sre_policy
- detector_learning_mode=training_auto_create can auto-activate learned rules
In progress: the first deterministic agent workflow now exists:
- build a draft hypothesis set
- build a bounded investigation plan
- use only read-only MCP tools
In progress: an optional local-model interpretation layer now exists for:
- provider=ollama
- local model evaluation on top of the deterministic draft
Current default local profile:
- provider=ollama
- base_url=http://127.0.0.1:11434
- model=llama3.2:3b
In progress: Helm/runtime default wiring now supports env-driven agent runtime defaults:
- IF_SRE_AGENT_RUNTIME_BASE_URL
- IF_SRE_AGENT_RUNTIME_MODEL
- in-cluster ollama.enabled=true deployments can default to the internal service URL automatically
Done: bootstrap/reset can now optionally persist those deployment-aware defaults into the saved global robot_sre_policy on first run through bootstrap.seedRobotSREPolicyDefaults=true, without overwriting later operator edits
Next: add richer tool coverage, move tool invocation behind a standalone agent runtime seam, and keep mutating actions approval-bound.
Next: add admin UI for detector-rule suggestions and training-mode controls.