Deploying Azure SRE Agent for AKS and Drasi: A Hands-On Operational Blueprint

Date: 2026-05-09

Explore a practical walkthrough of deploying Azure SRE Agent for AKS and Drasi with AZD—custom agents, response plans, fault injection, and how it untangles complex reliability issues.

Tags: ["Azure SRE Agent", "AKS", "Drasi", "Azure Developer CLI", "Reliability", "Incident Response"]

Managing reliability across complex cloud-native architectures like Azure Kubernetes Service (AKS) and application workloads such as Drasi requires not only monitoring but also structured, automated incident response. Traditional portal-based troubleshooting can be fragmented and reactive. What if you could deploy an operations platform that autonomously triages issues, collects targeted evidence, reasons through root causes, and carefully proposes or enacts remediation—all governed by scoped permissions and rigorous review gates?

In this post, we dive deep into the deployment and operation of an Azure SRE Agent tailored for AKS and Drasi workloads, driven entirely by the Azure Developer CLI (azd). This blueprint integrates custom agents, specialized skills, response plans, scheduled health probes, and synthetic failure injections. It showcases how to operationalize SRE principles with automation, yet maintain safety and contextual awareness to avoid the classic trap of "restart everything" automation.

You’ll learn how this setup tackles the ambiguous boundary between platform-level Kubernetes issues and application runtime faults in Drasi, the lessons learned from real-world incident testing, and how to iterate your SRE Agents toward deeper operational excellence.

Architecture Overview

┌─────────────────────────────────────────────┐
│             Azure Kubernetes Service (AKS) │
├─────────────────────────────────────────────┤
│  • Cluster Nodes & Workloads                 │
│  • Metrics & Logs Collection (DCR, DCRA)    │
│  • Admission Webhooks & Autoscaler           │
└─────────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────┐
│                Drasi Application             │
├─────────────────────────────────────────────┤
│  • Change-Driven Architecture Runtime        │
│  • Sources, Continuous Queries, Reactions    │
│  • Dapr sidecar, Redis, Mongo connectivity   │
└─────────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────┐
│                Azure SRE Agent Platform      │
├─────────────────────────────────────────────┤
│  • Custom Agents (Triage, AKS, Drasi, Review)│
│  • Skills & Evidence Bundles                  │
│  • Azure Monitor Integration & Response Plans│
│  • Scheduled Tasks & Fault Injection         │
│  • Managed Identities & Scoped RBAC           │
└─────────────────────────────────────────────┘
                  ↓
┌─────────────────────────────────────────────┐
│                Operators & Incident Response │
├─────────────────────────────────────────────┤
│  • Portal Session Insights & Telemetry       │
│  • Autonomous Actions & Manual Approvals      │
│  • Runbook Version Control & Continuous Tune │
└─────────────────────────────────────────────┘

Azure SRE Agent Operations Hub
Operations Hub showing the integrated AZD workflow and agent interactions. Source: luke.geek.nz

Key Technical Observations

Phase-Based Incident Routing — Incidents route first by failure phase before product domain, isolating platform issues (e.g., pod scheduling) from application runtime faults (e.g., Drasi query staleness). This reduces noise and increases accuracy of diagnostics.
Custom Multi-Agent Design — Four distinct custom agents specialize in triage, AKS platform diagnostics, Drasi runtime diagnostics, and remediation review. This separation ensures no agent overreaches and context-specific expertise guides remediation decisions.
Safe Autonomy with Scoped Permissions — Autonomous remediation is selectively enabled only for low-risk operations like starting a stopped AKS cluster, preserving guardrails and manual review for complex or risky changes (e.g., cluster upgrades or network changes).
Evidence-Driven Skills and Runbooks — Custom skills have narrowly scoped evidence bundles, improving diagnostic precision and reducing tooling overhead. For example, the Drasi runtime skill always validates source connection health before evaluating continuous query status.
Connector Configuration Nuance — MCP connectors can show as healthy but remain disabled for tool assignment. Explicitly enabling agent tools for Microsoft Learn and Drasi docs connectors is necessary for live documentation lookup during investigations.
Fault Injection and Synthetic Alerting — Deploying synthetic metric alerts allows validation of response plans and routing logic without harming production workloads, but they must be carefully controlled to avoid excessive run costs and alert fatigue.

How It Works

1. Deployment with Azure Developer CLI (`azd`)

The entire infrastructure plus SRE Agent configuration is deployed using an AVM-style Bicep module and PowerShell scripts triggered by azd up. The project repo structure is modular:

drasi-aks-sre-agent/
├── infra/
│   ├── main.bicep
│   ├── drasi-sre-agent.bicep
│   └── drasi-sre-agent-rbac.bicep
├── avm/
│   └── res/app/agent/main.bicep
├── scripts/
│   └── setup-sre-agent.ps1
├── sre-config/
│   ├── agents/
│   ├── skills/
│   ├── response-plans/
│   ├── scheduled-tasks/
│   └── testing/
└── azure.yaml

The deployment flow:

git clone https://github.com/lukemurraynz/drasi-aks-sre-agent.git
cd drasi-aks-sre-agent
azd auth login
az login
azd env new drasi-sre-dev
azd env set DRASI_RESOURCE_GROUP_NAME <drasi-resource-group>
azd env set DRASI_AKS_CLUSTER_NAME <aks-cluster-name>
azd env set DRASI_LOG_ANALYTICS_WORKSPACE_NAME <workspace-name>
azd env set AZURE_RESOURCE_GROUP <agent-resource-group>
azd env set AZURE_SRE_AGENT_NAME <agent-name>
azd up

After provisioning core resources such as managed identities, Application Insights, Log Analytics, and the Azure SRE Agent infrastructure, the setup script applies the data-plane configuration via the SRE Agent API. This step provision custom agents, attach skills, enable connectors, and apply response plans.

This two-step approach copes with current platform API portability limitations and keeps infrastructure and operational content versioned and repeatable.

2. Incident Handling and Routing

On Azure Monitor incident detection, alerts with embedded route IDs trigger the Azure SRE Agent. The first responder agent (drasi-incident-triage) classifies the alert by failure phase and reroutes it to the appropriate specialist agent:

Failure Phase	Preferred Route
Pod creation fails	Admission webhook, workload identity, policy, API server
Pod is pending	Scheduler, node capacity, autoscaler, subnet, resource quota
HPA/KEDA metrics blindness	Metrics API or external metrics API
Broad `kubectl` or controller timeouts	API server, konnectivity tunnel, node/network health
Only Drasi resources unhealthy after changes	Drasi lifecycle diagnostics

Example response plan snippet enabling autonomy for cluster start:

{
  "id": "aks-cluster-stopped",
  "handlingAgent": "aks-platform-diagnostics",
  "agentMode": "autonomous"
}

Autonomous remediation limits prevent risky actions such as scaling node pools or upgrading clusters without human review.

3. Skills and Evidence Gathering

Skills codify runbook logic. Examples include:

Skill	Attached To	Evidence Bundle
aks-platform-diagnostics	aks-platform-diagnostics agent	Node status, pod events, admission webhook, metrics API health, konnectivity, SNAT stats
drasi-runtime-diagnostics	drasi-runtime-diagnostics agent	Drasi source and query status, Dapr sidecar health, Redis/Mongo connectivity, logs
drasi-remediation-review	drasi-remediation-review agent	Checklist of evidence completeness, risk classification, rollback path, validations

Narrow evidence bundles allow quick reasoning and focused troubleshooting steps. For instance, the Drasi runtime skill always verifies source connectivity prior to analyzing query anomalies, optimizing response time and accuracy.

4. Observability & Session Insights

Every investigation is recorded as a session in the Azure portal. The session details include:

Alert metadata and triggering conditions
Response plan and agent sequence
Detailed tool calls (kubectl, Log Analytics queries, REST API calls)
Collected evidence and reasoning trace
Proposed or enacted remediation steps
Structured Kepner-Tregoe IS/IS NOT problem tables forcing clarity on what is not faulty

Azure SRE Agent - Session Insights
Session Insights help improve agent accuracy over time by uncovering gaps and redundant calls.

Querying the agent's own telemetry via Application Insights helps spot slow or failed skill invocations not obvious in session views:

dependencies
| where cloud_RoleName == "sre-agent"
| where timestamp > ago(1h)
| project timestamp, name, duration, success
| order by timestamp desc

5. Scheduled Health Checks and Fault Injection

Proactive monitoring through scheduled natural-language tasks automatically produces health summaries and resilience reports:

Task	Purpose
drasi-health-probe-15m	Recurring AKS and Drasi health probe
drasi-daily-resilience-report	Daily operational risk and resilience summary

Synthetic alerting validates response plan routes without causing real harm:

az monitor metrics alert create \
  --resource-group <resource-group> \
  --name sre-e2e-aks-admission-webhook-failure \
  --scopes <aks-cluster-resource-id> \
  --description "Synthetic route validation. Expected route: aks-admission-webhook-failure" \
  --severity 3 \
  --evaluation-frequency 1m \
  --window-size 5m \
  --condition "avg kube_node_status_allocatable_cpu_cores > 0" \
  --action <sre-agent-action-group-resource-id> \
  --auto-mitigate false

⚠️ Always disable synthetic alerts after validation to avoid burning tokens and increasing alert fatigue.

Synthetic Incidents Validation
Simulated failure injections validate end-to-end routing and agent response.

6. Real Incident Discovery: Container Insights Pipeline Failure

Testing revealed a broken AKS telemetry pipeline: ama-logs pods ran, but no recent data arrived in crucial logs like KubePodInventory and ContainerLogV2. The root cause was missing Data Collection Rules (DCR) or DCR associations, resulting in blind monitoring.

Alert query detecting it:

KubePodInventory
| where TimeGenerated > ago(30m)
| summarize CurrentRows=count()
| where CurrentRows == 0

This routes to aks-monitoring-agent-fault, triggering human review before rerunning onboarding steps. It highlighted how platform-level diagnostics must precede application troubleshooting to avoid costly misdiagnosis.

Container Insights Incident
Agent diagnosing and proposing fix for missing Container Insights telemetry.

7. Drasi-Specific Failure Handling

Drasi has domain-specific failure modes, such as the drasi-source-bootstrap-race, where a Continuous Query is created before its Source connects properly. The remediation is narrowly scoped:

Confirm Source health
Inspect Continuous Query status and provider logs
Delete and recreate the problematic Continuous Query only

No cluster restarts or broad changes are triggered.

Drasi Source Fix
Targeted remediation for Drasi source-bootstrap race condition.

Quick Tips & Tricks

Route Incidents by Failure Phase First
Establish routing logic that classifies alerts by lifecycle phase (creation, pending, metrics blind) before product-specific diagnosis to triage efficiently.
Enable Agent Tools Explicitly for Connectors
Connector health alone does not enable tools. Use Enable-AgentTools to assign tools to agents and ensure live documentation and code search is accessible during investigations.
Limit Autonomous Actions to Low-Risk Remediations
Restrict autonomy to operations like cluster start/stop. Escalate complex or unsafe actions to human reviewers.
Use Sessions to Continuously Improve Skills
Review session traces after incidents to identify wasted tool calls, unexpected routing, or missing evidence and update skills accordingly.
Deploy Synthetic Alerts Temporarily for Validation
Use synthetic alerting to confirm routing and skill invocation without impacting production, then delete to prevent alert fatigue.
Monitor Observability Pipelines Separately
Add explicit alerts for telemetry pipeline health to avoid blind spots that hamper incident resolution efforts.

Conclusion

This Azure SRE Agent blueprint for AKS and Drasi demonstrates how tightly integrated automation, structured runbooks, and scoped autonomy can improve reliability without compromising safety. By separating concerns between platform and application workloads, routing incidents by failure phase, and layering domain-specific diagnostics, the agent streamlines incident triage and response across complex containerized ecosystems.

Lessons from synthetic testing and real incident discovery reinforce the importance of observability pipeline health and careful management of connector tooling. The blueprint’s modular, versionable deployment with Azure Developer CLI provides a repeatable foundation for operational excellence.

The future of SRE Agents lies in embracing them as operational platforms with deliberate guardrails, continuous tuning via session insights, and controlled autonomous actions. This approach ensures they remain boring—in the best possible way—and reliable aides rather than black-box chatbots.

References

Running Azure SRE Agent for AKS and Drasi Operations — Original detailed walkthrough by Luke Murray
Azure SRE Agent Overview — Official Microsoft Documentation
Azure Kubernetes Service (AKS) Monitoring — Monitoring and Diagnostics
Azure Developer CLI (azd) azd up Workflow — Deploying cloud resources with azd
Drasi Documentation — Change-Driven Architecture runtime reference
Troubleshoot Container Log Collection — Diagnosing telemetry ingestion problems