SRE Smart Bot Requirements And Design
Purpose
Define a production-minded but demo-friendly "SRE Smart Bot" capability for Image Factory that can:
- watch cluster and application health continuously
- explain incidents in operator-friendly language
- propose or take bounded remediation actions
- notify and converse with operators over chat channels such as Telegram or WhatsApp
- use an AI agent runtime and MCP tools without giving up deterministic control over high-risk actions
This document is intentionally focused on a practical first version that fits the current Image Factory architecture and recent OKE operational failure modes.
Problem Statement
Recent cluster instability exposed several gaps:
- node
ephemeral-storagepressure built up before operators had useful early warning - Docker Hub fallback and image churn amplified recovery pain
- some remediation steps were repetitive and mechanical
- diagnosis required stitching together OCI, Kubernetes, and application state manually
- stale configuration persisted across infrastructure changes
We want an ops capability that behaves like a careful teammate:
- detects trouble early
- narrates what is happening
- recommends safe actions first
- can execute approved actions
- leaves an auditable trail
Naming
The product-facing name should be SRE Smart Bot.
For now, some internal code and document references may still use Robot SRE as a technical codename while the implementation is being rolled out.
Product Vision
SRE Smart Bot is not just a chatbot and not just a cleanup CronJob.
It is a hybrid system with two layers:
- Deterministic control plane
- watches defined signals
- evaluates explicit policies
- executes allowlisted remediation actions
- records evidence, decisions, cooldowns, and outcomes
- Conversational operator interface
- translates system state into concise incident updates
- answers operator questions
- asks for approval when actions cross risk thresholds
- uses chat channels and Image Factory admin surfaces as the human interface
The AI persona should improve usability and investigation speed. The deterministic policy layer should remain the source of truth for what the system is allowed to do.
Taxonomy Shape
The Robot SRE should organize incidents by operational domain first, then by incident type.
Recommended top-level domains:
- infrastructure
- runtime services
- application services
- network / ingress
- identity / security
- release / configuration
- operator channels
This gives us a cleaner way to expand beyond cluster-only issues and lets operators browse and reason about rules in a way that matches how they already think about incidents.
Goals
- Detect cluster, runtime dependency, and application degradation before it becomes customer-visible.
- Create a single incident narrative from Kubernetes, OCI, and Image Factory runtime signals.
- Remediate a bounded set of known-safe failure classes automatically.
- Escalate safely for actions with non-obvious blast radius.
- Give operators a chat-first interface for incident follow-up, approvals, and status.
- Reuse existing Image Factory runtime health watcher patterns where possible.
- Keep the design compatible with OKE, Supabase, ingress-nginx, Tekton, and mirrored GitLab images.
Non-Goals
- Fully autonomous root access with unrestricted shell/tool execution.
- Generic "AI runs the cluster" behavior.
- Replacing Prometheus, Grafana, or normal alerting entirely.
- Performing destructive actions without policy, cooldowns, or approval.
- Solving every SRE use case in v1.
Key Design Principles
Modular System Boundaries
SRE Smart Bot should be implemented as a modular subsystem, not as a long chain of special cases inside server startup.
That means:
- startup code composes dependencies and launches workers
- signal adapters translate watcher output into normalized findings
- incident services manage correlation and lifecycle
- policy services decide whether actions are allowed
- channel adapters deliver notifications and approvals through provider contracts
This keeps the system testable, easier to extend, and less likely to turn main.go into an unmaintainable control tower.
Deterministic Before Generative
The system should use explicit policy evaluation for:
- detection
- severity classification
- remediation eligibility
- approvals
- cooldown enforcement
- audit logging
The AI layer should summarize, explain, and help choose between approved actions.
Safe By Default
The robot should prefer:
- observe
- notify
- suggest
- ask approval
- take low-risk action
It should not jump straight to reboot, replace, delete, or suspend critical services.
Explainability
Every decision should produce:
- what triggered the action
- what evidence was used
- why this action was chosen
- what was changed
- what happens next
Persona With Boundaries
The Robot SRE can feel like a teammate, but it must be honest about uncertainty and clearly distinguish:
- observations
- inferences
- recommendations
- executed actions
- required human approvals
Current Platform Hooks We Can Reuse
The backend already has a useful foundation:
runtime_dependency_watcherinbackend/cmd/server/main.go- process health store and runtime component status
- notification delivery and websocket updates
- tenant asset drift watcher
- provider readiness watcher
- quarantine/release compliance watcher
- system configuration storage in
system_configs
This suggests we should not start with a separate standalone bot. The better path is to add a new remediation/orchestration capability that integrates with the backend runtime and admin APIs.
Proposed Capability Scope
V1: Guarded Runtime And Cluster Remediation
The first version should handle a small number of repeatable operational problems:
- runtime dependency unavailable
- ingress broken or DNS/cert mismatch
- stale LDAP/system config after service replacement
- node disk pressure
- repeated image pull failures
- failed or noisy background job buildup
- Tekton or controller churn overwhelming small clusters
V1 should also include a small set of application-service incidents, not only infrastructure incidents, for example:
- backend deployment degraded
- frontend deployment degraded
- dispatcher unavailable
- login error spike
- worker crash-loop with dependency correlation
V1.5: Operator Conversation
Expose incident updates and approvals over configurable operator channels:
- Image Factory admin notifications / websocket feed
- enterprise webhook or chat gateway integrations
- optionally Telegram for environments that allow it
Support commands like:
statuswhat happenedwhy did you scale this downshow evidenceapprove remediation <id>pause robotresume robot
V2: Richer Tool Use Via MCP And Agent Runtime
Add:
- MCP-backed tool registry
- agent planning for investigations
- richer runbooks
- multi-step incidents with memory and threads
- provider-based channel integrations through API contract
Target Users
- system administrators
- demo environment owner/operator
- platform engineer
- on-call engineer
Functional Requirements
Operator Channel Contract
Many enterprise environments will not allow Telegram or WhatsApp directly.
Because of that, channels should not be hardcoded into the product design. Instead, SRE Smart Bot should treat channels as configurable providers exposed through an API contract.
Recommended shape:
- provider id
- provider kind
- display name
- enabled flag
- approval interaction support
- config reference or endpoint reference
Examples of provider kinds:
in_appemailwebhookslackteamstelegramwhatsappcustom
This makes the product usable in locked-down enterprises where the integration path may be:
- an internal notification broker
- a Teams bot
- a ServiceNow incident workflow
MCP Tool Catalog (Conceptual)
if-ops-k8s
Capabilities:
- list and inspect nodes
- list pods and watch restarts
- inspect events
- read node conditions
- check pending pods and scheduling failures
if-ops-oci
Capabilities:
- list instance pools
- inspect node lifecycle
- read instance health
if-ops-config
Capabilities:
- read system config
- update selected config keys through validated operations
- inspect health metadata
if-ops-release
Capabilities:
- inspect Helm release values/status
- apply approved release reconciliations
if-ops-chat
Capabilities:
- send Telegram/WhatsApp messages
- create approval prompts
- thread replies to incidents
if-ops-observability
Capabilities:
- query app runtime health
- query watcher health
- query incident and remediation history
Agent SDK Role
The agent runtime should be used for:
- evidence gathering from MCP servers
- summarization
- hypothesis ranking
- operator conversation
- action plan drafting
The agent runtime should not decide action legality by itself. That must stay in policy code.
Policy And Safety Model
Action Classes
Auto-Allowed
- send notification
- open incident
- collect evidence
- mark component degraded
- delete completed/failed pods
- suspend specific noisy CronJobs
- scale down specific noncritical demo workloads
Approval Required
- scale down shared controllers
- restart backend/frontend
- patch system config
- reconcile Helm release
- cordon or drain a node
- reboot an OCI worker
Human Only
- database-destructive actions
- credential rotation without workflow
- deleting tenant data
- disabling security controls
- bulk namespace deletion
Operator-Defined Rules
Operators should be able to extend the robot from the admin UI without code changes.
Recommended supported customizations:
- incident threshold overrides
- severity escalation rules
- environment-specific enable/disable
- notification routing
- suppression windows
- cooldown tuning
- allowlisted resource selection within approved action families
Recommended restrictions:
- operators cannot make destructive actions auto-allowed
- operators cannot introduce arbitrary shell commands
- operators cannot bypass approval for disruptive classes
- operators cannot store raw secrets in rule definitions
This makes the system flexible without turning it into an unsafe "run anything" automation tool.
Required Safeguards
- allowlist of resources and actions
- per-incident cooldown
- per-action rate limits
- environment awareness (
demo,staging,prod) - approval tokens with expiry
- dry-run mode
- audit log with before/after evidence
- rollback or fallback instructions for every action class
Channel Strategy
Telegram For MVP
Telegram is the best first chat channel because:
- easier bot onboarding
- lower business/legal friction
- supports threaded-ish operator flows well enough
- good for approval and incident updates
WhatsApp For Phase 2
WhatsApp is attractive for operator reach, but it introduces more platform overhead:
- Meta business onboarding
- template/message policy constraints
- channel approval and delivery complexity
Recommendation:
- build a channel abstraction first
- ship Telegram first
- add WhatsApp once workflows and policy prompts are stable
Example Operator Experience
Example alert:
Robot SRE: Disk pressure detected on worker 10.0.10.202. Evidence: EvictionThresholdMet, FreeDiskSpaceFailed, 12 pod evictions in 9m. Suggested actions: 1) pause nonessential workloads 2) clean completed/failed pods 3) reboot node if pressure persists for 15m. Reply APPROVE 1 to proceed.
Example follow-up:
Why did this happen?
Example response:
The node is low on ephemeral storage. Primary contributors appear to be image churn, repeated pod restarts, and unreclaimed runtime artifacts. Confidence medium.
Incident Classes And V1 Remediations
1. Node Disk Pressure
Signals:
- node
DiskPressure=True FreeDiskSpaceFailed- evictions
Automations:
- notify
- delete completed/failed pods
- suspend noisy jobs
- scale down allowlisted demo workloads
- if still degraded beyond threshold, request approval to recycle node
2. Registry / Image Pull Degradation
Signals:
ImagePullBackOffErrImagePull- auth failures
- rate limit signals
Automations:
- identify image source
- classify
dockerhub_rate_limitvsgitlab_authvstag_missing - suggest mirror or pull-secret remediation
- if live drift exists, reconcile release or image refs with approval
3. Runtime Dependency Failure
Signals:
- NATS/Redis/MinIO/registry/glauth health failures
- backend logs showing dependency timeout
Automations:
- verify service, endpoints, and health
- restart dependency if allowed
- verify dependent services recover
4. Config Drift
Signals:
- ingress missing expected hosts
- TLS secret mismatch
- LDAP host points to old service IP
- runtime config disagrees with Helm values
Automations:
- detect mismatch
- suggest or apply targeted config correction
- reopen incident if drift reappears after release change
Data Model Proposal
Add new persisted entities:
ops_incidentsops_incident_eventsops_remediation_actionsops_approvalsops_channel_threadsops_policy_bindings
Suggested incident fields:
idincident_typestatusseveritysummaryevidence_jsoncurrent_signatureopened_atresolved_atenvironmenttenant_scopeif relevantresource_scope
Admin UX Proposal
Add an admin area such as:
Operations > Robot SRE
Core views:
- active incidents
- incident detail with evidence timeline
- pending approvals
- action history
- policy configuration
- channel configuration
- simulation / dry-run console
Suggested Rollout Plan
Phase 0: Design And Safety
- define incident taxonomy
- define action allowlist
- define approval rules
- create data model and APIs
- define operator-defined rule boundaries and validation
Phase 1: Deterministic Watcher
- implement incident engine
- implement first remediations for disk pressure and runtime dependency failure
- add audit log
- add admin UI for incident visibility
- add rules UI for threshold/routing overrides
Phase 2: Telegram Channel
- add outbound incident notifications
- add approval workflow
- add simple command handlers
Phase 3: Agent + MCP Integration
- expose MCP servers for approved tools
- add agent summarization and investigation mode
- add evidence-aware operator Q&A
Phase 4: Expanded Remediation
- OCI node recycle
- Helm drift reconciliation
- config drift correction
- richer release recovery workflows
Open Questions
- Should Robot SRE live in the backend process or its own deployable from day one?
- Should approvals be required in demo environments for node recycle, or only in staging/prod?
- Should Telegram be the only MVP chat channel?
- How much authority should the agent have to propose Helm changes vs execute them?
- Do we want the robot to reason over historical incidents for pattern detection in v1, or wait for v2?
- How much of the application-service taxonomy should ship in v1 vs v1.5?
- Should operator-defined rules be stored in
system_configsfirst, or in dedicated ops tables from day one?
Recommended MVP Decisions
- Start with a deterministic remediation engine inside the backend runtime.
- Reuse the existing watcher/process health patterns already present in
backend/cmd/server/main.go. - Use MCP for tool integration boundaries, not as the policy engine.
- Use an agent runtime for operator conversation and evidence summarization only.
- Ship Telegram first.
- Treat WhatsApp as a later channel adapter.
- Restrict v1 automatic actions to low-risk containment and cleanup.
- Require explicit operator approval for node recycle, release reconciliation, and config mutation.
Immediate Next Deliverables
- Incident taxonomy and policy matrix.
- Data model and API design for incidents, actions, approvals, and operator-defined rules.
- MCP server interface definitions for Kubernetes, OCI, database, and chat.
- Telegram bot interaction design.
- V1 implementation plan for
node_disk_pressure,runtime_dependency_failure, and first application-service incidents.