Foundry Local 1.1: Real-Time Transcription, Embeddings, and Agentic Responses On-Device

Date: 2026-05-12

Discover how Foundry Local 1.1 powers local AI with real-time transcription, semantic embeddings, and a flexible Responses API—no cloud needed.

Tags: ["AI", "Foundry", "On-Device AI", "Speech Recognition", "Semantic Search"]

Microsoft’s Foundry Local 1.1 release marks a significant step forward in local AI development, enabling developers to embed powerful AI capabilities directly inside their applications without reliance on cloud services. Imagine building voice-driven user interfaces, live captioning, semantic search, or multimodal vision-language applications that run entirely on-device—eliminating network latency, cloud costs, and privacy concerns.

This release introduces three standout features: a Live Transcription API optimized for real-time streaming speech recognition, a Text Embeddings API for semantic search and retrieval-augmented generation, and a new Responses API that facilitates structured, agentic conversations including tool calls and image understanding. Alongside these, Foundry Local 1.1 enhances hardware compatibility and reduces SDK package size for a smoother developer experience.

In this post, we’ll explore the updated architecture of Foundry Local, and dive into each major feature with practical code examples illustrating how you can harness these on-device AI capabilities today.

Architecture Overview

┌─────────────────────────────────────────────┐
│  Enterprise Data                            │
├─────────────────────────────────────────────┤
│  • Streaming Audio Input                    │
│  • Document & Knowledge Base Text           │
│  • Images & Multimodal Data                  │
└─────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────┐
│  Microsoft Foundry Local AI Platform        │
├─────────────────────────────────────────────┤
│  • Live Transcription API                    │
│  • Embeddings Generation                     │
│  • Responses API (agentic, multimodal)      │
│  • ONNX Runtime Optimized Models             │
│  • Plugin-Based Execution Providers          │
└─────────────────────────────────────────────┘
               ↓
┌─────────────────────────────────────────────┐
│  Developer Applications                      │
├─────────────────────────────────────────────┤
│  • Voice UIs & Captioning                    │
│  • Semantic Search & RAG                      │
│  • Vision-Language Agents                     │
│  • On-Device AI SDKs (.NET, Python, JS, Rust)│
└─────────────────────────────────────────────┘

This architecture highlights the flow from raw input—audio, text documents, and images—through Foundry’s carefully optimized, locally run AI models accessed via clean APIs, and finally into developer applications. The modular design leverages on-device inference accelerated by ONNX Runtime with post-training quantization and separates optional components like WebGPU into plugins to keep base package size minimal.

Key Technical Observations

Empirical Optimization for On-Device Streaming ASR: Foundry Local 1.1’s live transcription capability is powered by NVIDIA’s Nemotron Speech Streaming model re-engineered for efficient ONNX Runtime execution with advanced quantization. The int4 k-quant variant compresses a 2.47 GB PyTorch model to 0.67 GB with near-baseline accuracy (8.2% WER) and latency of 0.56s on CPU.
Unified Multi-SDK Support with Language Bindings: The APIs are consistent and available across Python, JavaScript, C#, and Rust, enabling broad adoption. Notably, the C# SDK targets lower netstandard2.0 for better compatibility with legacy .NET and Xamarin applications.
Responses API Enables Agentic, Multimodal Interactions: Going beyond simple chat, this API supports streaming tokens, multi-turn dialogue referencing previous responses, tool/function calls, and vision-text co-processing via new multimodal models like Qwen3.5 VLM—all running fully offline.
Plugin-Based Execution Providers Reduce Footprint: The WebGPU execution provider is decoupled into an on-demand plugin, reducing the default SDK size and only loading GPU acceleration when necessary without code changes.
JavaScript SDK Package Slimmed with Custom Node-API Addon: Replacing the koffi FFI runtime with a purpose-built native addon cuts ~27MB from install size, speeds initialization, stabilizes ABI compatibility, and removes native build requirements for cross-platform usage.

How It Works

Live Transcription API: Real-Time Streaming Speech Recognition

Foundry Local’s Live Transcription API uses a session-based approach:

Load Streaming Model: Developers retrieve the lightweight nemotron-speech-streaming-en-0.6b model from Foundry’s catalog. If absent, it downloads automatically.
Create a Transcription Session: Configure audio parameters such as sample rate (16kHz), mono channel, and language.
Stream Raw Audio Chunks: Audio captured from a microphone is pushed as PCM frames to the session.
Consume Transcripts Asynchronously: The API returns interim and final transcribed text via an async iterator, marking finalized text clearly.

The following Python snippet illustrates this flow:

session = audio_client.create_live_transcription_session()
session.settings.sample_rate = 16000
session.settings.channels = 1
session.settings.language = "en"
session.start()

for result in session.get_stream():
    if result.is_final:
        print(f"[FINAL] {result.content[0].text}")
    else:
        print(result.content[0].text, end="", flush=True)

# Append raw microphone PCM data in real-time to session.append()

This streaming design enables low-latency captioning and voice UI scenarios compatible with resource-constrained devices.

Embeddings API: Semantic Search and Retrieval

The Embeddings API generates fixed-dimension vector representations for textual inputs supporting both single queries and batches. These embeddings are foundational for semantic search, clustering, and retrieval-augmented generation (RAG).

One powerful use case is local semantic search: documents are embedded and indexed in-memory, then user queries generate embeddings to find closest matches via cosine similarity.

Example using Foundry embeddings combined with ChromaDB for indexing and querying:

batch_response = client.generate_embeddings(documents)
embeddings = [item.embedding for item in batch_response.data]
collection.add(ids=[...], embeddings=embeddings, documents=documents)

query_embedding = client.generate_embedding("What programming language is good for beginners?").data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=3)

The API returns vectors matching OpenAI’s format for cloud-edge compatibility.

Responses API: Agentic, Streaming, and Multimodal

This high-level API abstracts chat completions into multi-turn conversations supporting:

Streaming tokens as server-sent events for realtime UI updates
Tool calling: define and invoke custom functions with input/output round-trips
Vision-language input for models like Qwen3.5 VLM, which process images and text together

An example streams an image description from a local vision-language model:

vision_input = [{
    "type": "message",
    "role": "user",
    "content": [
        {"type": "input_text", "text": "Describe the image."},
        {"type": "input_image", "image_data": image_b64, "media_type": "image/jpeg"}
    ],
}]

stream = client.responses.create(model=model.id, input="placeholder", extra_body={"input": vision_input}, stream=True)
for event in stream:
    if getattr(event, "type", None) == "response.output_text.delta":
        print(event.delta, end="", flush=True)

This end-to-end local vision-language inference removes cloud dependencies from sensitive image-processing workflows.

Quick Tips & Tricks

Leverage Model Catalog for Auto Download: Always use Foundry Local’s catalog abstraction; it transparently caches models locally and handles updates.
Choose Quantized Models for On-Device Efficiency: Prefer int4 k-quant models like Nemotron for best CPU performance and minimal memory footprint with near-native accuracy.
Use Plugin Execution Providers Sparingly: Enable WebGPU or other acceleration plugins only if your app requires GPU processing to keep package size lean.
Stream Transcription Results for Responsiveness: Consume the transcription async stream token-by-token or line-by-line for timely UI feedback.
Pair Embeddings API with Vector DBs: For semantic search at scale, combine Foundry embeddings with in-memory vector databases like ChromaDB for lightning-fast queries.
Start the On-Device Web Service for Responses API: Use the built-in Foundry service to expose Responses API over a local HTTP endpoint, simplifying integration with OpenAI-compatible clients.

Conclusion

Foundry Local 1.1 makes local AI development more accessible and performant than ever, providing live streaming transcription, powerful semantic embeddings, and agentic multi-modal conversational APIs—all running entirely on-device with no cloud dependency.

By advancing model quantization, refining multi-language SDKs, and modularizing plugins, Foundry Local empowers developers to build privacy-preserving, scalable AI applications across diverse platforms and runtime environments. This release lays strong groundwork for future expansion into C++ binding support and further package size reductions.

Local AI is poised to transform how AI interacts with end users and data, and Foundry Local 1.1 is a compelling milestone on that journey.

References

Foundry Local 1.1: Live Transcription, Embeddings, and Responses API | Microsoft Foundry Blog — Official announcement and detailed examples
Pushing the Limits of On-Device Streaming ASR (arXiv) — Benchmark and methodology for Nemotron streaming speech model
Foundry Local GitHub Repository — Source code, samples, and early C++ binding access
ChromaDB Documentation — In-memory vector database for semantic search integration
OpenAI API Documentation — Background on Responses API conventions leveraged locally

Microsoft Foundry Logo
Logo courtesy Microsoft Foundry Blog