Building an Offline AI Interview Coach with Foundry Local, RAG, and SQLite

Date: 2026-04-01

Discover how to create an AI-powered interview coach that runs 100% offline using Microsoft Foundry Local, Retrieval-Augmented Generation, and SQLite.

Tags: ["Azure AI Foundry", "RAG", "Foundry Local", "SQLite", "JavaScript"]

Preparing for job interviews can be daunting, especially when you want targeted, personalized coaching that respects your privacy and can run without a stable internet connection. What if you could leverage the power of AI — a model that understands your CV and the job description intimately — yet keep all your data local, avoiding cloud latency, rate limits, and exposing sensitive information?

Meet Interview Doctor, an offline AI interview coach built fully on your machine using Microsoft Foundry Local, Retrieval-Augmented Generation (RAG), and SQLite for vector storage. This project demonstrates how to combine on-device AI inference with a retrieval system grounded in your documents to generate context-aware interview questions and feedback.

In this post, we’ll explore the architecture behind this innovative offline tool, dissect its core technical components, and guide you through the building blocks that make offline, grounded AI applications not only possible but practical for real-world usage.

Architecture Overview

┌────────────────────────────────────────────┐
│Architecture                                │
├────────────────────────────────────────────┤
│• Enterprise data sources                   │
│• Foundry platform                          │
│• AI applications                           │
└────────────────────────────────────────────┘

Key Technical Observations

Local RAG Enables Privacy and Reliability
By embedding document chunks locally in SQLite and integrating retrieval with on-device AI, the system ensures users’ data never leaves their machines, avoiding privacy issues and unreliable network dependencies.
Use of TF-IDF Vectors in SQLite
Instead of costly neural embeddings, the system leverages classic TF-IDF vectorization stored as JSON in SQLite. This aligns with the offline-first goal and allows fast cosine similarity searches without specialized vector DB infrastructure.
Foundry Local’s OpenAI-Compatible Runtime
Microsoft Foundry Local exposes a compatible API that allows seamless integration of local LLM models like phi-3.5-mini, abstracting away model management complexities while providing solid inference performance on-device.
Chunking with Overlap Improves Retrieval Quality
The chunking strategy splits documents into overlapping token windows to preserve context between chunks, which is essential for grounding responses with coherent, relevant document fragments.
Streaming Responses via Server-Sent Events (SSE)
The chat engine uses SSE to push partial model completions to the UI in real-time, creating a responsive interactive experience despite running entirely offline.
Dual Interfaces Cater to Different User Preferences
Both a polished web UI and a CLI are provided, with an “Edge Mode” for extremely constrained environments—maximizing access and flexibility without sacrificing functionality.

How It Works

Step 1: Setting Up Foundry Local

The core of the system’s AI is Microsoft Foundry Local, a runtime designed to run OpenAI-compatible models on-device. Installation is straightforward across platforms:

# Windows
winget install Microsoft.FoundryLocal

# macOS
brew install microsoft/foundrylocal/foundrylocal

The JavaScript SDK abstracts model lifecycle, downloading, and endpoint discovery:

const manager = new FoundryLocalManager();
const modelInfo = await manager.init("phi-3.5-mini");

// Using OpenAI-compatible API to interact with local model
const openai = new OpenAI({
  baseURL: manager.endpoint,
  apiKey: manager.apiKey,
});

This approach offers a clean developer experience akin to the cloud OpenAI API, but with all calls handled locally.

Step 2: Building the RAG Pipeline

Document Chunking

Text documents are split into manageable, overlapping chunks of ~200 tokens with a 25-token overlap to maintain contextual continuity between pieces:

export function chunkText(text, maxTokens = 200, overlapTokens = 25) {
  const words = text.split(/\s+/).filter(Boolean);
  if (words.length <= maxTokens) return [text.trim()];

  const chunks = [];
  let start = 0;
  while (start < words.length) {
    const end = Math.min(start + maxTokens, words.length);
    chunks.push(words.slice(start, end).join(" "));
    if (end >= words.length) break;
    start = end - overlapTokens;
  }
  return chunks;
}

TF-IDF Vectorization

Each chunk is converted into a term-frequency vector, ignoring short tokens and punctuation. Cosine similarity is used to score relevancy during search:

export function termFrequency(text) {
  const tf = new Map();
  const tokens = text
    .toLowerCase()
    .replace(/[^a-z0-9\-']/g, " ")
    .split(/\s+/)
    .filter((t) => t.length > 1);
  for (const t of tokens) {
    tf.set(t, (tf.get(t) || 0) + 1);
  }
  return tf;
}

export function cosineSimilarity(a, b) {
  let dot = 0, normA = 0, normB = 0;
  for (const [term, freq] of a) {
    normA += freq * freq;
    if (b.has(term)) dot += freq * b.get(term);
  }
  for (const [, freq] of b) normB += freq * freq;
  if (normA === 0 || normB === 0) return 0;
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

SQLite Vector Store

Document chunks and their TF-IDF vectors are persisted in SQLite, enabling fast retrieval via a local database without extra vector infrastructure:

export class VectorStore {
  insert(docId, title, category, chunkIndex, content) {
    const tf = termFrequency(content);
    const tfJson = JSON.stringify([...tf]);
    this.db.run(
      "INSERT INTO chunks (...) VALUES (?, ?, ?, ?, ?, ?)",
      [docId, title, category, chunkIndex, content, tfJson]
    );
    this.save();
  }

  search(query, topK = 5) {
    const queryTf = termFrequency(query);
    // Cosine similarity scoring against stored chunks, returning top-K results
  }
}

Step 3: The RAG Chat Engine

When a user poses a question, the system:

Retrieves the most relevant chunks from SQLite based on cosine similarity to the query.
Builds a prompt by injecting these chunks as context alongside a system-level prompt.
Sends the prompt to Foundry Local’s LLM endpoint, streaming response tokens via an async generator.

Example streaming query code snippet:

async *queryStream(userMessage, history = []) {
  const chunks = this.retrieve(userMessage);
  const context = this._buildContext(chunks);

  const messages = [
    { role: "system", content: SYSTEM_PROMPT },
    { role: "system", content: `Retrieved context:\n\n${context}` },
    ...history,
    { role: "user", content: userMessage },
  ];

  const stream = await this.openai.chat.completions.create({
    model: this.modelId,
    messages,
    temperature: 0.3,
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) yield { type: "text", data: content };
  }
}

This pattern allows real-time, grounded AI-assisted interview coaching with context injection that respects the user’s own documents.

Step 4: Dual Interfaces — Web & CLI

The project ships with two main interfaces for interacting with the model:

Web UI: A sleek, dark-themed React web app serving at http://127.0.0.1:3000 by default, with chat, file upload, and quick action buttons.

Interview Doctor - Landing Page
Interview Doctor’s polished web interface. Source: GitHub repository.

CLI: An interactive command-line interface for fast, keyboard-driven AI interaction ideal for power users or low-resource environments.
Edge Mode: Designed for strictly offline or low-connectivity use cases, ensuring all necessary assets and models are locally available.

Step 5: Testing

Testing is built-in using Node.js’s native test runner to verify core functionality like chunking and vector search without adding dependencies:

describe("chunkText", () => {
  it("returns single chunk for short text", () => {
    const chunks = chunkText("short text", 200, 25);
    assert.equal(chunks.length, 1);
  });

  it("maintains overlap between chunks", () => {
    // Verifies overlapping tokens between consecutive chunks
  });
});

Running npm test validates the system’s internal correctness before use.

Quick Tips & Tricks

Customize Document Ingestion Easily — Replace or add your own .pdf or .md files in the docs/ folder to adapt the AI to your domain.
Tune Chunk Size and Overlap — Adjust config.chunkSize and config.chunkOverlap to balance retrieval granularity and context scope.
Switch Models via Foundry CLI — Use foundry model list and update config.model to experiment with different local LLMs.
Leverage Streaming Responses for UX — SSE-based chat streaming improves responsiveness, especially in offline environments.
Reuse the VectorStore Pattern — Its simple TF-IDF vectorization stored in SQLite can be adapted for other offline knowledge base applications.
Extend with Custom Prompts — Modify src/prompts.js to tailor system instructions and system/user prompts for your interview coaching style.

Conclusion

This project elegantly demonstrates how local AI runtimes like Microsoft Foundry Local combined with clever retrieval mechanisms can enable powerful, privacy-conscious applications that work offline. By blending TF-IDF vector search inside SQLite with on-device generation, users get fast, grounded answers without cloud latency or exposure.

Interview Doctor’s architecture and codebase provide a solid blueprint for anyone interested in building similar offline AI assistants — whether for interview prep, customer support, code review, or compliance checks. As on-device AI runtimes mature and hardware accessibility grows, expect to see many more such applications that empower users with AI capabilities that run entirely in their control.

References

Building an Offline AI Interview Coach with Foundry Local, RAG, and SQLite | Microsoft Community Hub — Original source blog post and detailed walkthrough.
Interview Doctor GitHub Repository — Full source code, demos, and installation instructions.
Microsoft Foundry Local GitHub — Official repo for on-device AI runtime.
Foundry Local SDK (npm) — SDK for interacting with Foundry Local models.
Local RAG Reference — Implementation details on the RAG pipeline used.
Microsoft Tech Community AI Tag — Explore related AI projects and discussions.