Deploying LiteLLM on AKS with azd and Bicep for a Secure, Scalable LLM Gateway

Date: 2026-06-13

Discover how to self-host LiteLLM on Azure Kubernetes Service using azd and Bicep — complete with private networking, Redis caching, spend tracking, and automatic TLS.

Tags: ["Azure", "AKS", "LiteLLM", "Bicep", "azd", "Kubernetes", "OpenAI", "Redis", "PostgreSQL"]

Self-hosting large language model (LLM) proxies can quickly become complex when you factor in operational security, cost management, and scalability. APIs from multiple LLM providers introduce demanding requirements for authentication, caching, and usage tracking — all while maintaining high availability.

LiteLLM is an open-source LLM proxy that simplifies this by routing requests through a single OpenAI-compatible endpoint to a wide range of supported backends, including Azure OpenAI, Anthropic, and over 100 providers. Having one unified gateway dramatically improves control over API keys, usage budgets, and caching.

In this post, we'll explore how Luke Murray successfully deployed LiteLLM on Azure Kubernetes Service (AKS) using Azure Developer CLI (azd) coupled with Bicep Infrastructure-as-Code. This approach provides a production-ready, self-hosted LLM gateway featuring private networking, Redis caching for cross-pod scalability, PostgreSQL-based spend tracking, and automated TLS certificate management — all wrapped in a single command deployment experience.

We'll cover the overall architecture, the detailed network design, key configuration points, how multi-replica caching behaves, and important AKS production best practices derived from the implementation.

Architecture Overview

┌────────────────────────────────────────────┐
│Architecture                                │
├────────────────────────────────────────────┤
│• Enterprise data sources                   │
│• Foundry platform                          │
│• AI applications                           │
└────────────────────────────────────────────┘

Key Technical Observations

Comprehensive Private Networking: Every critical data service—including PostgreSQL, Redis, ACR, and Key Vault—is isolated via Azure Private Endpoint within dedicated subnets. This design eliminates public IP exposure and strengthens security posture.
Azure CNI Overlay with Pod-Level Network Policy: The AKS cluster uses Azure CNI Overlay networking and granular Azure Network Policies to segment pods, achieving isolation and compliance at network layers.
Split-DNS CoreDNS Patch: CoreDNS is patched post-provisioning to resolve Azure private DNS zones for private endpoints internally while forwarding all other DNS queries to a public resolver (8.8.8.8). This hybrid DNS ensures cert-manager's Let's Encrypt HTTP-01 challenges function reliably alongside private DNS resolution.
Redis-Backed Distributed Cache Across Replicas: Azure Managed Redis supports caching in a multi-replica environment. Cache misses populate Redis from one pod, and subsequent requests served by any other pod get the cached response, achieving cross-pod cache consistency.
Production Best Practices from LiteLLM Documentation: Configurations such as batching spend writes every 60 seconds, controlling DB connection pool to 10 per pod, and serving requests during database unavailability improve operational stability and reduce PostgreSQL load.
Robust Rolling Updates with Graceful Pod Shutdown: Using Kubernetes deployment strategies with maxUnavailable: 0 and long termination grace periods ensures zero downtime during upgrades, respecting in-flight request timeouts.

How It Works

Infrastructure Provisioning

The deployment leverages an azd lifecycle which automates the entire provisioning and deployment process:

Preprovision Hooks: Generate random secrets for PostgreSQL credentials, LiteLLM master keys, and salt keys, ensuring strong security defaults. Also installs necessary tooling like kustomize if missing.
Provisioning Bicep Template: Deploys all resources in a clean resource group, including AKS cluster with system and user node pools, Azure Container Registry, PostgreSQL Flexible Server, Managed Redis, Key Vault, Azure OpenAI, NAT Gateway, and private DNS zones linked to the VNet.
Postprovision Configuration: Retrieves AKS cluster credentials, applies a CoreDNS split-DNS patch to handle mixed public and private DNS resolution, and deploys Kubernetes manifests with kustomize, including the LiteLLM proxy and ingress controller.
Postdeploy Operations: Refreshes Kubernetes secrets with current connection strings, triggers proxy rollout, and synchronizes DNS A records for the public ingress IP.

This pipeline enables a "single-command" deployment experience with azd up, taking around 10-15 minutes for the full environment.

Network Design Details

The virtual network is partitioned into:

snet-aks (10.30.0.0/23) for AKS nodes (both system and user pools).
snet-pe (10.30.2.0/24) hosting all private endpoints for PostgreSQL, Redis, ACR, and Key Vault.
snet-ingress (10.30.3.0/24) reserved for the NGINX ingress controller.

Outbound connectivity uses a NAT Gateway attached to the AKS subnet with a dedicated public IP to avoid SNAT exhaustion, critical at scale.

DNS zones like privatelink.postgres.database.azure.com, privatelink.redis.azure.net, privatelink.azurecr.io, and privatelink.vaultcore.azure.net are linked to the VNet, enabling pods to resolve private endpoint IPs transparently.

LiteLLM Proxy Configuration

The proxy config (managed via Kubernetes ConfigMap) specifies:

The list of routed models (Azure OpenAI GPT-4o, OpenCode Zen & Go models).
Authentication via a master key with virtual API keys scoped to individual consumers.
Redis caching parameters with TLS-enabled connections.
Spend tracking backed by PostgreSQL with batching to limit DB write load.
Enabling traffic through even if DB connection is temporarily unavailable.

model_list:
  - model_name: azure-gpt-4o
    litellm_params:
      model: azure/gpt-4o
      api_base: os.environ/AZURE_OPENAI_ENDPOINT
      api_key: os.environ/AZURE_OPENAI_KEY
      api_version: "2024-10-21"
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL
  database_connection_pool_limit: 10
  proxy_batch_write_at: 60
  allow_requests_on_db_unavailable: true
litellm_settings:
  cache: true
  cache_params:
    type: redis
    host: os.environ/REDIS_HOST
    port: os.environ/REDIS_PORT
    password: os.environ/REDIS_PASSWORD
    ssl: true

Adding non-Azure models, like OpenCode Zen and Go, is straightforward, exposing their OpenAI-compatible endpoints under the same proxy umbrella — centralized authentication, routing, and caching.

Multi-Replica and Caching Behavior

Running multiple LiteLLM pods behind the NGINX ingress controller enables high availability and load balancing.

The Redis cache is shared across pods, ensuring that if one pod misses a cache entry, it is written to Redis so that subsequent requests handled by any pod get a cache hit.

A basic test showed:

Pod A cache miss: 1.040s
Pod B cache hit: 0.636s

This behavior confirms correct cross-pod cache behavior using Azure Managed Redis.

AKS Pod Configuration for Production

Rolling Updates: maxUnavailable: 0 and maxSurge: 1 ensure pods never drop below the desired count during updates.
Readiness & Liveness Probes: Readiness probes delay 30 seconds to allow Prisma migrations before accepting traffic, and liveness probes restart pods if unhealthy after three failures.
Graceful Shutdown: A 620-second termination grace period with a 5-second preStop delay allows in-flight requests to finish cleanly before pod termination.
Security Hardening: Containers run with readOnlyRootFilesystem: true, drop all capabilities, and run as non-root. Writable directories are mounted via emptyDir to satisfy Prisma and UI requirements.

Quick Tips & Tricks

Use Azure CNI Overlay with Network Policies — This enables pod-level network segmentation and private IP assignments conforming to enterprise security standards.
Patch CoreDNS for Split DNS Resolution — Prevent failures in challenge validations by routing private zone queries internally and public zone queries to external DNS like 8.8.8.8.
Batch PostgreSQL Writes to Reduce Load — Group spend tracking updates into fixed intervals (proxy_batch_write_at) to avoid excessive frequent writes.
Leverage Managed Redis for Cache Consistency Across Replicas — A centralized Redis cache backing multiple pods guarantees cache hits regardless of request routing.
Set allow_requests_on_db_unavailable: true in Production — Enables high availability by allowing LiteLLM to continue serving requests even if the DB spikes or briefly disconnects.
Use Virtual Keys for Fine-Grained Access Control — Instead of exposing multiple upstream API keys, distribute scoped virtual keys via LiteLLM to enforce budgets and permitted models per user/team.

Conclusion

Deploying LiteLLM on AKS with azd and Bicep delivers a powerful, self-hosted, and production-grade LLM gateway. This approach tightly integrates Azure-managed infrastructure with Kubernetes best practices to meet requirements for security, scalability, cost control, and operational visibility.

Private endpoints ensure zero public exposure to backend data services while the NGINX ingress with cert-manager automates TLS certificates for client-facing access. Redis caching combined with multi-pod replicas enables responsiveness and high availability. Spend tracking via PostgreSQL adds crucial cost management and governance.

This fully automated, IaC-driven deployment path lowers the barrier to adopting LiteLLM within enterprises or teams needing centralized control of heterogeneous LLM providers. As LiteLLM continues evolving with features like MCP Gateway integration, this foundation opens paths for tight integrations with Microsoft Learn and GitHub servers — turning the proxy into a comprehensive AI operations hub.

References

Running LiteLLM on AKS with azd and Bicep | luke.geek.nz — Primary source blog post by Luke Murray
LiteLLM Documentation — Official docs covering configuration and best practices
LiteLLM Production Best Practices — Performance and operational configuration insights
Azure Developer CLI — Azure CLI tooling used for deployment
AKS Network Concepts and Security — Deep dive into AKS networking and policies
HTTPS Ingress on AKS with cert-manager — Setup of TLS with ingress controllers
Azure Private Link and Private Endpoints — Overview of private connectivity
Azure Cache for Redis with Private Endpoints — Securing managed Redis in VNet environments
PostgreSQL Flexible Server Networking — Private connectivity patterns for PostgreSQL

LiteLLM request flow showing client to LiteLLM proxy through ingress, auth, routing, cache, and provider
Request flow diagram courtesy of Luke Murray

LiteLLM UI walkthrough showing the configured models, MCP servers, and virtual keys
LiteLLM UI configuration demo

Terminal demo showing two AKS LiteLLM pods and a Redis cache hit across replicas
Demonstration of cross-pod Redis cache hit

Terminal demo showing azd deploy completing, LiteLLM pods reaching Ready, and the HPA status
Deployment rollout with azd and HPA scaling

Terminal demo showing LiteLLM virtual key creation, model access, chat completion, and a follow-up note about delete behaviour
Virtual key lifecycle operations

Terminal demo showing Azure GPT-4o, Big Pickle, and a Redis cache miss then hit from the live proxy
Cache miss and hit timing demonstration

Terminal demo showing LiteLLM readiness and 19 configured models returned by the live proxy
Health check and model catalogue retrieval

Article by Luke Murray from luke.geek.nz

azd down --purge

Command to tear down and clean up the entire deployment