Running Quality, Cost, and Latency Evaluations for Microsoft Foundry’s Model Router

Date: 2026-05-19

Discover how to run comprehensive evaluations on Microsoft Foundry’s Model Router with an open-source pipeline that balances quality, cost, and latency for smarter model selection.

Tags: ["Microsoft Foundry", "Model Router", "AI Evaluation", "Open Source", "LLM Benchmarking"]

One of the biggest challenges when integrating multiple large language models (LLMs) in real-time applications is deciding which model to use for each prompt—balancing cost, latency, and quality effectively. Microsoft Foundry’s Model Router addresses this by dynamically selecting the optimal LLM based on task complexity, reasoning needs, and prompt type. This removes much of the manual overhead developers usually face when managing model selection.

But how can you confidently verify that the Model Router truly outperforms using a single model alone? This post walks through a guided approach for running rigorous evaluations—covering quality, cost, latency, and compliance impacts—using an open-source GitHub repo explicitly designed for the Model Router’s routing-aware pipelines.

You’ll learn how the pipeline works under the hood, how to prepare datasets and configurations, and how to interpret evaluation results, enabling you to make data-driven decisions about whether the Model Router fits your production stack.

Architecture Overview

┌─────────────────────────────────────────────┐
│           Enterprise Prompt Data             │
├─────────────────────────────────────────────┤
│  • Developer-provided prompt sets            │
│  • JSONL, CSV, or SQL-based inputs           │
└─────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────┐
│           Foundry Model Router Eval          │
├─────────────────────────────────────────────┤
│  • Router queries deployed models            │
│  • Measures quality (judged by LLM-as-judge)│
│  • Calculates latency and cost per response  │
│  • Supports model subset compliance testing  │
└─────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────┐
│            Reporting & Integration           │
├─────────────────────────────────────────────┤
│  • Generates interactive evaluation dashboard│
│  • Provides diff tools for run comparisons   │
│  • Optional submission to Foundry enterprise │
│    tooling for cloud-graded insights         │
└─────────────────────────────────────────────┘

This architecture highlights a tightly integrated pipeline, which accepts your domain-specific prompts, exercises the Model Router's runtime decisions across underlying LLMs, and provides nuanced metrics—enabling objective trade-offs between cost, response time, and answer quality.

Microsoft Foundry Logo
Image source: Microsoft Foundry Blog

Key Technical Observations

Router-Aware Cost Calculations — Unlike naïve benchmarking, this pipeline includes router input-prompt markup plus underlying model pricing resolved dynamically per selected model. This ensures real cost-impact transparency.
LLM-as-Judge with Bias Control — Evaluations use a dual-ordered pairwise scoring system where a dedicated judge LLM scores results in a way that cancels out position bias in comparisons, improving scoring fidelity.
Flexible Prompt Input Formats — Support for JSONL, CSV, or SQL-based prompt datasets allows seamless integration with diverse developer workflows and data sources.
Resumable and Scalable Runs — Built-in dry-run validations, incremental resume capabilities, and concurrency settings optimize large evaluation jobs without overrunning rate limits, essential for enterprise-scale testing.
Subset Compliance Testing — The pipeline can simulate locked-down model subsets (e.g., for regulatory compliance), quantifying the impact on quality and costs—a crucial feature for governed AI use cases.
Optional Cloud Hand-Off — Results can be pushed back into Foundry’s cloud portal for enhanced governance, quality grading, and operational visibility, merging local experimentation with enterprise tooling.

How It Works

Quick Preview Without API Keys

To explore the pipeline outputs before full setup, developers can run a no-credential demo script providing mock data and a pre-rendered interactive dashboard.

# macOS / Linux
bash scripts/demo.sh

# Windows
.\scripts\demo.ps1

Alternatively, opening the WALKTHROUGH.ipynb Jupyter notebook allows interactive exploration.

Step 1 — Installation

Clone the repo and install dependencies using Python 3.9+:

git clone https://github.com/microsoft/foundry-model-router-autoeval.git
cd foundry-model-router-autoeval
pip install -e ".[dev]"

Note that Model Router deployments today only support East US 2 and Sweden Central regions.

Step 2 — Credential Setup

Copy the example environment file and fill in these required credentials:

cp .env.example .env

AZURE_MODEL_ROUTER_* for the router under test
AZURE_OPENAI_* and AZURE_BASELINE_DEPLOYMENT for the baseline model used in comparison
AZURE_JUDGE_* for the model that will act as scorer/judge

Make sure you deploy Claude models separately if you intend to route to them, since the router does not deploy those automatically.

Step 3 — Configuration Tuning

Edit configs/default.yaml or pick one of the presets like quick_test.yaml or foundry.yaml. Adjust baseline deployment names, judge concurrency, and pricing.

This flexibility allows adapting the evaluation for quick assessments or large-scale enterprise runs.

Step 4 — Prepare Your Prompts

Format prompts as JSONL, CSV, or SQL with at least id and prompt fields. For example:

{"id": "001", "prompt": "Explain quantum entanglement in simple terms."}

Keep prompts within the smallest underlying model’s context window, unless testing model subsets explicitly to avoid context-exceeded errors.

Step 5 — Run the Evaluation

Validate configuration without API calls:

python scripts/run_eval.py --dry-run

Run with your dataset:

python scripts/run_eval.py --dataset my_prompts.jsonl --sample-size 100

If interrupted, resume easily:

python scripts/run_eval.py --resume --output-dir results/my-run

Be mindful of rate limits — Global Standard regions default to 250 requests per minute or 250k tokens per minute. Concurrency settings in YAML should respect these limits.

Step 6 — Analyze Results

In results/<run-name>/, the key artifact is dashboard.html — a self-contained interactive report with charts covering:

Model-selection distribution
Cost vs quality trade-offs
Latency analysis
Composite metrics: quality-per-dollar and quality-per-second

Additional exports include report.md (markdown summary), results.json (machine-readable), and detailed_results.csv with per-prompt routing details, helping you drill down on model behavior.

Step 7 (Optional) — Comparing Runs and Submission

Compare two evaluation runs side-by-side:

python scripts/compare_results.py results/run-a results/run-b

Submit results to Foundry enterprise portal for cloud grading and governance:

pip install -e ".[foundry]"
az login
python scripts/run_foundry_eval.py --input-dir results/full-eval

This optional step ties local testing with cloud-grade tooling for broader operational insights.

Quick Tips & Tricks

Start with the Demo Script — Preview the evaluation dashboard with no API keys to understand output format and metrics before investing in setup.
Mind the Smallest Context Window — Make sure prompts fit the smallest model in your router’s catalog to avoid unexpected context overflows during routing.
Use Concurrency Wisely — Adjust your YAML concurrency settings to remain within Azure rate limits, avoiding throttling and incomplete runs.
Leverage Preset Configs — Use the quick_test.yaml for fast iteration or large_scale.yaml for production-grade stress testing, saving you configuration time.
JSONL for Flexible Prompting — JSONL is a simple, extensible format allowing easy integration with various data pipelines and prompt structures.
Push Results for Governance — Submit your runs to Foundry’s cloud portal to utilize enterprise-grade governance and visualization features.

Conclusion

The Microsoft Foundry Model Router opens up dynamic, context-driven model selection across a growing catalog of frontier LLMs, significantly easing developer overhead. This open-source evaluation repo complements Foundry’s enterprise benchmarks by enabling quick, local, and configurable assessments of quality, cost, and latency trade-offs — critical for confident operational adoption.

By following the step-by-step guide, developers can gain nuanced insights into how the router performs on their specific prompts and compliance constraints. This empowers data-driven decision making and ensures smarter spend without sacrificing response quality. The ability to submit results back to Foundry enterprise tooling further amplifies operational visibility and governance.

As model ecosystems continue to diversify, such rigorous, router-aware evaluation pipelines will become essential tooling for AI-centric product teams striving for both efficiency and performance at scale.

References

How to run evals for the model router | Microsoft Foundry Blog — Original article and detailed walkthrough
Foundry Model Router Autoeval GitHub repository — Source code and evaluation pipeline
Microsoft Foundry Model Router documentation on Microsoft Learn — Deployment and configuration guide
Azure OpenAI Service Pricing — Cost reference for underlying models
Microsoft Foundry Blog — Additional posts on Foundry AI platform updates