Robot SRE Log Intelligence And Incident Detection
Purpose
Define how logs should be ingested, analyzed, and turned into structured incident signals for the Robot SRE / Ops Persona.
This document answers three questions:
- How should Image Factory ingest logs on a very small OKE cluster?
- How should NATS fit into the design?
- How should log analytics feed Robot SRE without making remediation unsafe?
Executive Recommendation
For the current cluster footprint, the best starting design is:
- Grafana Loki in monolithic mode
- Grafana Alloy as the log collector
- NATS JetStream for structured findings, incident events, approvals, and remediation workflow messages
- Robot SRE consuming both:
- direct cluster/runtime signals
- structured log findings
Recommended v1 architecture
Alloycollects pod logs and Kubernetes eventsLokistores and indexes logsLog detectorservice evaluates rules and anomaliesNATS JetStreamcarries normalized findings and workflow eventsRobot SREconsumes findings and decides whether to:- notify
- correlate
- request approval
- execute bounded remediation
What not to do in v1
- do not stream raw logs through NATS
- do not deploy distributed Loki on this cluster
- do not let an LLM consume raw log firehoses directly and autonomously execute actions
Current Cluster Constraints
The current remote OKE worker capacity is small:
- 2 worker nodes
- allocatable CPU per node: about
1830m - allocatable memory per node: about
9.6 Gi - allocatable ephemeral storage per node: about
34.2 GB
This is enough for a modest observability footprint, but not enough for a heavy self-hosted logging platform.
Design implications:
- keep ingestion lightweight
- keep retention short
- avoid running distributed read/write/backend observability stacks
- prefer low-cardinality labels
- prefer structured findings over shipping raw logs into workflow systems
Why Loki Fits Best Here
Loki is the best fit for this cluster because:
- it is designed for logs
- it can run in monolithic mode for small deployments
- it works well with Kubernetes log collection
- it avoids the heavier storage and indexing footprint of Elasticsearch/OpenSearch style stacks
- Grafana Alloy integrates cleanly with Kubernetes and Loki
Why Monolithic Loki
Monolithic Loki is the right fit for this environment because:
- it is specifically recommended by Grafana for small deployments / meta-monitoring
- it is much simpler to operate than scalable or microservices mode
- it keeps the footprint manageable for your free-tier cluster
Simple scalable Loki is a bad fit here because:
- it is heavier
- the Helm chart defaults assume a much larger footprint
- it introduces more moving parts than this cluster can comfortably absorb
Why NATS Still Matters
NATS is still a great fit, but for control-plane events, not bulk log transport.
Use NATS JetStream for:
- normalized incident findings
- detection events
- incident lifecycle events
- approval requests and responses
- remediation action requests
- remediation action outcomes
- operator conversation state pointers if useful
Do not use NATS JetStream in v1 for:
- raw pod logs
- high-volume full-text log fanout
Reason:
- raw logs are high-volume and bursty
- JetStream retention and consumers are excellent for workflow/event streams, but using them as the primary raw log store will create unnecessary storage and operational pressure
Proposed Architecture
1. Log Collection Layer
Collector: Grafana Alloy
Use Alloy to collect:
- Kubernetes pod logs
- Kubernetes events
- optionally selected system logs later
Recommended collection mode for v1
Start with Kubernetes API-based pod log collection:
loki.source.kubernetesloki.source.kubernetes_events
Why:
- no privileged container required
- no host filesystem mount required
- no root requirement
- no DaemonSet requirement
Tradeoff:
- more API and kubelet traffic than file tailing
For this small cluster, that tradeoff is acceptable and simpler operationally.
Optional v2 collection mode
If later you need:
- node logs
- lower kubelet/API overhead
- more complete infrastructure log coverage
then add a DaemonSet-based Alloy profile using file tails on node log paths.
2. Log Storage Layer
Store: Loki monolithic
Recommended v1 properties:
- single replica
- conservative resource requests/limits
- short retention
- object storage only if already available and justified
Storage recommendation
For this environment, start with small local persistence or hostPath only if you accept that log history is noncritical.
Better medium-term option:
- back Loki with MinIO only if the extra footprint is acceptable
Because this cluster is resource-constrained, I would keep the first version simple:
- short retention
- low-cost local persistence
- logs treated as operational telemetry, not compliance evidence
3. Log Intelligence Layer
Introduce a log-detector component that consumes from Loki queries rather than raw container streams.
Responsibilities:
- periodic rule-based scans
- burst / rate detection
- known-pattern matching
- anomaly grouping
- emit structured findings
Detector output model
Each finding should look like:
finding_idsource=logsdomainincident_typeseverityconfidencesummaryevidenceresource_scopededupe_keyobserved_at
For the current backend integration, detectors should publish normalized findings using the event type:
sre.detector.finding.observedsre.detector.finding.recovered
4. Event Backbone
Backbone: NATS JetStream
Recommended streams:
ops.findingsops.incidentsops.approvalsops.remediationsops.chatops
Recommended retention usage
ops.findings: limits retention, short age, bounded sizeops.incidents: limits retention, longer ageops.approvals: limits retentionops.remediations: limits retention, longer age for audit convenience
NATS should be the message bus for structured control-plane events, not the long-term store of raw log lines.
5. Robot SRE Consumption Layer
Robot SRE should consume:
- Kubernetes and OCI signals directly
- app runtime health directly
ops.findingsfrom the log detector
Then it should:
- correlate findings with other signals
- open or update incidents
- consult policy
- notify operators
- request approval
- execute bounded remediation
Log Analytics Model
Rule Classes
A. Signature Rules
Known error patterns with high operational value.
Examples:
toomanyrequestsImagePullBackOffFreeDiskSpaceFailedEvictionThresholdMetnats: no servers available for connectionLDAP Result Code 200dial tcp ... i/o timeoutmanifest unknownx509
These should be the MVP.
B. Rate / Spike Rules
Pattern counts or changes over time.
Examples:
- login failures > threshold in 5m
- backend 5xx spike
- repeated crash-loop stack traces
- sudden increase in image pull errors
C. Correlation Rules
Combine multiple signal classes.
Examples:
- LDAP timeout logs + reachable GLAuth service + stale stored LDAP host
- NATS connection failures + NATS pod unavailable
- disk pressure + completed pod buildup + image pull retries
D. LLM-Assisted Summaries
Use an LLM after detection to:
- summarize clustered log evidence
- explain likely root cause
- draft operator messages
Do not use LLM inference alone as the trigger for disruptive remediation.
Recommended MVP Detection Rules
Infrastructure
- disk pressure signature detection
- kubelet storage eviction signature detection
Runtime Services
- Redis/NATS/MinIO/registry/GLAuth connection failures
- dependency health endpoint failures
Application Services
- backend error spike
- frontend/API login failure spike
- dispatcher crash-loop with repeated same root cause
Identity / Security
- LDAP bind/search timeout spike
- stale auth-provider host mismatch
Release / Configuration
- image pull forbidden / manifest unknown
- release reconcile conflicts
Labeling And Cardinality Guidance
Keep Loki labels intentionally small.
Good labels:
namespaceappcomponentcontainerpodonly if needed for short retention troubleshootingclusterenvironment
Avoid high-cardinality labels for:
- request id
- user id
- incident id
- stack trace fragments
Those should stay in log payload, not labels.
Retention Guidance
For this cluster, keep retention modest.
Suggested starting point:
- 3 to 7 days in-cluster
That is enough for:
- active troubleshooting
- incident correlation
- rule tuning
If you later need long-term retention:
- archive structured incidents and remediation records in app storage
- optionally push logs to an external Loki/Grafana Cloud later
Resource Guidance
For this cluster size, prefer:
- Loki monolithic single replica
- Alloy single deployment initially
- small CPU/memory requests
- bounded retention
Avoid in v1:
- distributed Loki
- full-text heavy analytics stack
- multi-replica observability control plane
- shipping every raw log twice
Recommended Ingestion Strategy
Best v1 choice
Alloy -> Loki
This should be the default ingestion path.
Why:
- simplest
- low moving-part count
- native Kubernetes support
- easy path to later Grafana dashboards
Best use of NATS
Detector -> NATS JetStream -> Robot SRE
This should be the control/event path.
Why:
- durable workflow events
- replayable findings
- decouples detection from remediation
- fits your existing platform direction
Not recommended in v1
Apps -> NATS -> log processor -> store raw logs
Why not:
- too much event volume
- more custom code
- more retention complexity
- less value than using Loki directly
Operator-Defined Log Rules
The admin UI should eventually let operators add log-driven rules.
Allowed customizations
- match patterns
- threshold windows
- severity mapping
- routing
- suppression windows
- correlation hints
Disallowed customizations
- arbitrary code execution
- direct raw shell actions
- changing destructive policies into auto-remediation
Example custom rule
- name:
LDAP timeout spike - source:
logs - selector:
namespace=image-factory, component=backend - match:
LDAP Result Code 200ORfailed to connect to LDAP - threshold:
>= 5 in 10m - severity:
warning - emit incident type:
identity_provider_unreachable
Data Flow
- Pod emits log line.
- Alloy collects log.
- Alloy pushes to Loki.
- Detector queries recent Loki windows.
- Detector emits structured finding to NATS JetStream.
- Robot SRE consumes finding.
- Robot correlates with runtime/Kubernetes/OCI state.
- Policy engine determines:
- ignore
- notify
- ask approval
- execute bounded action
- Outcome is published back to NATS and persisted in incident/action tables.
Suggested MVP Components
In-Cluster
lokialloyrobot-sre-detector- existing
nats - existing backend-based Robot SRE incident engine
Existing Systems Reused
- Image Factory notifications
- process health store
- system config
- admin APIs and UI
Suggested Rollout Plan
Phase 1
- deploy Loki monolithic
- deploy Alloy for pod logs + k8s events
- create 10-15 high-value signature rules
- emit findings into NATS
- display findings in admin UI
Phase 2
- correlate findings into incidents
- add Telegram operator alerts
- add approval workflow
Phase 3
- add anomaly/spike detection
- add operator-defined log rules
- add LLM-generated incident explanations
Phase 4
- consider external Loki/Grafana Cloud if retention or query needs outgrow cluster
- consider DaemonSet Alloy for node/system logs
Recommendation Summary
For your current cluster:
- use
Lokifor raw log ingestion and storage - use
Alloyto collect logs - use
NATS JetStreamfor structured findings and workflow events - use Robot SRE policy engine as the only authority for remediation
This gives you:
- lightweight ingestion
- searchable logs
- event-driven automation
- a clean path to agent-assisted incident reasoning
without overloading the small free-tier cluster.