SRE Smart Bot Requirements And Design

Purpose

Define a production-minded but demo-friendly "SRE Smart Bot" capability for Image Factory that can:

watch cluster and application health continuously
explain incidents in operator-friendly language
propose or take bounded remediation actions
notify and converse with operators over chat channels such as Telegram or WhatsApp
use an AI agent runtime and MCP tools without giving up deterministic control over high-risk actions

This document is intentionally focused on a practical first version that fits the current Image Factory architecture and recent OKE operational failure modes.

Problem Statement

Recent cluster instability exposed several gaps:

node ephemeral-storage pressure built up before operators had useful early warning
Docker Hub fallback and image churn amplified recovery pain
some remediation steps were repetitive and mechanical
diagnosis required stitching together OCI, Kubernetes, and application state manually
stale configuration persisted across infrastructure changes

We want an ops capability that behaves like a careful teammate:

detects trouble early
narrates what is happening
recommends safe actions first
can execute approved actions
leaves an auditable trail

Naming

The product-facing name should be SRE Smart Bot.

For now, some internal code and document references may still use Robot SRE as a technical codename while the implementation is being rolled out.

Product Vision

SRE Smart Bot is not just a chatbot and not just a cleanup CronJob.

It is a hybrid system with two layers:

Deterministic control plane

watches defined signals
evaluates explicit policies
executes allowlisted remediation actions
records evidence, decisions, cooldowns, and outcomes

Conversational operator interface

translates system state into concise incident updates
answers operator questions
asks for approval when actions cross risk thresholds
uses chat channels and Image Factory admin surfaces as the human interface

The AI persona should improve usability and investigation speed. The deterministic policy layer should remain the source of truth for what the system is allowed to do.

Taxonomy Shape

The Robot SRE should organize incidents by operational domain first, then by incident type.

Recommended top-level domains:

infrastructure
runtime services
application services
network / ingress
identity / security
release / configuration
operator channels

This gives us a cleaner way to expand beyond cluster-only issues and lets operators browse and reason about rules in a way that matches how they already think about incidents.

Goals

Detect cluster, runtime dependency, and application degradation before it becomes customer-visible.
Create a single incident narrative from Kubernetes, OCI, and Image Factory runtime signals.
Remediate a bounded set of known-safe failure classes automatically.
Escalate safely for actions with non-obvious blast radius.
Give operators a chat-first interface for incident follow-up, approvals, and status.
Reuse existing Image Factory runtime health watcher patterns where possible.
Keep the design compatible with OKE, Supabase, ingress-nginx, Tekton, and mirrored GitLab images.

Non-Goals

Fully autonomous root access with unrestricted shell/tool execution.
Generic "AI runs the cluster" behavior.
Replacing Prometheus, Grafana, or normal alerting entirely.
Performing destructive actions without policy, cooldowns, or approval.
Solving every SRE use case in v1.

Key Design Principles

Modular System Boundaries

SRE Smart Bot should be implemented as a modular subsystem, not as a long chain of special cases inside server startup.

That means:

startup code composes dependencies and launches workers
signal adapters translate watcher output into normalized findings
incident services manage correlation and lifecycle
policy services decide whether actions are allowed
channel adapters deliver notifications and approvals through provider contracts

This keeps the system testable, easier to extend, and less likely to turn main.go into an unmaintainable control tower.

Deterministic Before Generative

The system should use explicit policy evaluation for:

detection
severity classification
remediation eligibility
approvals
cooldown enforcement
audit logging

The AI layer should summarize, explain, and help choose between approved actions.

Safe By Default

The robot should prefer:

observe
notify
suggest
ask approval
take low-risk action

It should not jump straight to reboot, replace, delete, or suspend critical services.

Explainability

Every decision should produce:

what triggered the action
what evidence was used
why this action was chosen
what was changed
what happens next

Persona With Boundaries

The Robot SRE can feel like a teammate, but it must be honest about uncertainty and clearly distinguish:

observations
inferences
recommendations
executed actions
required human approvals

Current Platform Hooks We Can Reuse

The backend already has a useful foundation:

runtime_dependency_watcher in backend/cmd/server/main.go
process health store and runtime component status
notification delivery and websocket updates
tenant asset drift watcher
provider readiness watcher
quarantine/release compliance watcher
system configuration storage in system_configs

This suggests we should not start with a separate standalone bot. The better path is to add a new remediation/orchestration capability that integrates with the backend runtime and admin APIs.

Proposed Capability Scope

V1: Guarded Runtime And Cluster Remediation

The first version should handle a small number of repeatable operational problems:

runtime dependency unavailable
ingress broken or DNS/cert mismatch
stale LDAP/system config after service replacement
node disk pressure
repeated image pull failures
failed or noisy background job buildup
Tekton or controller churn overwhelming small clusters

V1 should also include a small set of application-service incidents, not only infrastructure incidents, for example:

backend deployment degraded
frontend deployment degraded
dispatcher unavailable
login error spike
worker crash-loop with dependency correlation

V1.5: Operator Conversation

Expose incident updates and approvals over configurable operator channels:

Image Factory admin notifications / websocket feed
enterprise webhook or chat gateway integrations
optionally Telegram for environments that allow it

Support commands like:

status
what happened
why did you scale this down
show evidence
approve remediation <id>
pause robot
resume robot

V2: Richer Tool Use Via MCP And Agent Runtime

Add:

MCP-backed tool registry
agent planning for investigations
richer runbooks
multi-step incidents with memory and threads
provider-based channel integrations through API contract

Target Users

system administrators
demo environment owner/operator
platform engineer
on-call engineer

Functional Requirements

Operator Channel Contract

Many enterprise environments will not allow Telegram or WhatsApp directly.

Because of that, channels should not be hardcoded into the product design. Instead, SRE Smart Bot should treat channels as configurable providers exposed through an API contract.

Recommended shape:

provider id
provider kind
display name
enabled flag
approval interaction support
config reference or endpoint reference

Examples of provider kinds:

in_app
email
webhook
slack
teams
telegram
whatsapp
custom

This makes the product usable in locked-down enterprises where the integration path may be:

an internal notification broker
a Teams bot
a ServiceNow incident workflow

MCP Tool Catalog (Conceptual)

`if-ops-k8s`

Capabilities:

list and inspect nodes
list pods and watch restarts
inspect events
read node conditions
check pending pods and scheduling failures

`if-ops-oci`

Capabilities:

list instance pools
inspect node lifecycle
read instance health

`if-ops-config`

Capabilities:

read system config
update selected config keys through validated operations
inspect health metadata

`if-ops-release`

Capabilities:

inspect Helm release values/status
apply approved release reconciliations

`if-ops-chat`

Capabilities:

send Telegram/WhatsApp messages
create approval prompts
thread replies to incidents

`if-ops-observability`

Capabilities:

query app runtime health
query watcher health
query incident and remediation history

Agent SDK Role

The agent runtime should be used for:

evidence gathering from MCP servers
summarization
hypothesis ranking
operator conversation
action plan drafting

The agent runtime should not decide action legality by itself. That must stay in policy code.

Policy And Safety Model

Action Classes

Auto-Allowed

send notification
open incident
collect evidence
mark component degraded
delete completed/failed pods
suspend specific noisy CronJobs
scale down specific noncritical demo workloads

Approval Required

scale down shared controllers
restart backend/frontend
patch system config
reconcile Helm release
cordon or drain a node
reboot an OCI worker

Human Only

database-destructive actions
credential rotation without workflow
deleting tenant data
disabling security controls
bulk namespace deletion

Operator-Defined Rules

Operators should be able to extend the robot from the admin UI without code changes.

Recommended supported customizations:

incident threshold overrides
severity escalation rules
environment-specific enable/disable
notification routing
suppression windows
cooldown tuning
allowlisted resource selection within approved action families

Recommended restrictions:

operators cannot make destructive actions auto-allowed
operators cannot introduce arbitrary shell commands
operators cannot bypass approval for disruptive classes
operators cannot store raw secrets in rule definitions

This makes the system flexible without turning it into an unsafe "run anything" automation tool.

Required Safeguards

allowlist of resources and actions
per-incident cooldown
per-action rate limits
environment awareness (demo, staging, prod)
approval tokens with expiry
dry-run mode
audit log with before/after evidence
rollback or fallback instructions for every action class

Channel Strategy

Telegram For MVP

Telegram is the best first chat channel because:

easier bot onboarding
lower business/legal friction
supports threaded-ish operator flows well enough
good for approval and incident updates

WhatsApp For Phase 2

WhatsApp is attractive for operator reach, but it introduces more platform overhead:

Meta business onboarding
template/message policy constraints
channel approval and delivery complexity

Recommendation:

build a channel abstraction first
ship Telegram first
add WhatsApp once workflows and policy prompts are stable

Example Operator Experience

Example alert:

Robot SRE: Disk pressure detected on worker 10.0.10.202. Evidence: EvictionThresholdMet, FreeDiskSpaceFailed, 12 pod evictions in 9m. Suggested actions: 1) pause nonessential workloads 2) clean completed/failed pods 3) reboot node if pressure persists for 15m. Reply APPROVE 1 to proceed.

Example follow-up:

Why did this happen?

Example response:

The node is low on ephemeral storage. Primary contributors appear to be image churn, repeated pod restarts, and unreclaimed runtime artifacts. Confidence medium.

Incident Classes And V1 Remediations

1. Node Disk Pressure

Signals:

node DiskPressure=True
FreeDiskSpaceFailed
evictions

Automations:

notify
delete completed/failed pods
suspend noisy jobs
scale down allowlisted demo workloads
if still degraded beyond threshold, request approval to recycle node

2. Registry / Image Pull Degradation

Signals:

ImagePullBackOff
ErrImagePull
auth failures
rate limit signals

Automations:

identify image source
classify dockerhub_rate_limit vs gitlab_auth vs tag_missing
suggest mirror or pull-secret remediation
if live drift exists, reconcile release or image refs with approval

3. Runtime Dependency Failure

Signals:

NATS/Redis/MinIO/registry/glauth health failures
backend logs showing dependency timeout

Automations:

verify service, endpoints, and health
restart dependency if allowed
verify dependent services recover

4. Config Drift

Signals:

ingress missing expected hosts
TLS secret mismatch
LDAP host points to old service IP
runtime config disagrees with Helm values

Automations:

detect mismatch
suggest or apply targeted config correction
reopen incident if drift reappears after release change

Data Model Proposal

Add new persisted entities:

ops_incidents
ops_incident_events
ops_remediation_actions
ops_approvals
ops_channel_threads
ops_policy_bindings

Suggested incident fields:

id
incident_type
status
severity
summary
evidence_json
current_signature
opened_at
resolved_at
environment
tenant_scope if relevant
resource_scope

Admin UX Proposal

Add an admin area such as:

Operations > Robot SRE

Core views:

active incidents
incident detail with evidence timeline
pending approvals
action history
policy configuration
channel configuration
simulation / dry-run console

Suggested Rollout Plan

Phase 0: Design And Safety

define incident taxonomy
define action allowlist
define approval rules
create data model and APIs
define operator-defined rule boundaries and validation

Phase 1: Deterministic Watcher

implement incident engine
implement first remediations for disk pressure and runtime dependency failure
add audit log
add admin UI for incident visibility
add rules UI for threshold/routing overrides

Phase 2: Telegram Channel

add outbound incident notifications
add approval workflow
add simple command handlers

Phase 3: Agent + MCP Integration

expose MCP servers for approved tools
add agent summarization and investigation mode
add evidence-aware operator Q&A

Phase 4: Expanded Remediation

OCI node recycle
Helm drift reconciliation
config drift correction
richer release recovery workflows

Open Questions

Should Robot SRE live in the backend process or its own deployable from day one?
Should approvals be required in demo environments for node recycle, or only in staging/prod?
Should Telegram be the only MVP chat channel?
How much authority should the agent have to propose Helm changes vs execute them?
Do we want the robot to reason over historical incidents for pattern detection in v1, or wait for v2?
How much of the application-service taxonomy should ship in v1 vs v1.5?
Should operator-defined rules be stored in system_configs first, or in dedicated ops tables from day one?

Recommended MVP Decisions

Start with a deterministic remediation engine inside the backend runtime.
Reuse the existing watcher/process health patterns already present in backend/cmd/server/main.go.
Use MCP for tool integration boundaries, not as the policy engine.
Use an agent runtime for operator conversation and evidence summarization only.
Ship Telegram first.
Treat WhatsApp as a later channel adapter.
Restrict v1 automatic actions to low-risk containment and cleanup.
Require explicit operator approval for node recycle, release reconciliation, and config mutation.

Immediate Next Deliverables

Incident taxonomy and policy matrix.
Data model and API design for incidents, actions, approvals, and operator-defined rules.
MCP server interface definitions for Kubernetes, OCI, database, and chat.
Telegram bot interaction design.
V1 implementation plan for node_disk_pressure, runtime_dependency_failure, and first application-service incidents.