Local-First Web Apps with TypeScript: Building Offline AI & Privacy-Respecting Features
Build privacy-first, offline AI in the browser: TypeScript patterns with WebAssembly, IndexedDB, Web Workers, and framework examples inspired by Puma's local AI.
Build local-first, privacy-preserving web apps with TypeScript — inspired by Puma's local AI
Hook: You want AI features in your web app, but you don't want users' data or models shipped to the cloud. In 2026, users expect privacy-by-default and offline-first behavior — and modern browsers plus TypeScript make it possible to run real ML locally. This guide shows how to design and implement local-first web apps that run client-side models via WebAssembly, store artifacts safely in IndexedDB, and remain fast and maintainable with familiar frameworks (React, Vue, Next.js, Node).
Why local-first matters now (2025–2026 context)
Local AI isn't just a buzzword because Puma and other browsers added local agents — it's an architectural shift. By late 2025 and into 2026:
- Browser runtimes improved with WebAssembly SIMD & threads, and WebGPU landed stable support across major engines, enabling much faster on-device inference.
- Wasm-ported LLM inference projects (ggml/llama.cpp ports, ONNX Runtime Web, etc.) matured, allowing smaller quantized models to run in browser sandboxes.
- Users and regulators demand privacy-first defaults — keeping data and models local reduces compliance surface area.
High-level architecture: what a local-first web app looks like
Design local-first apps with these components in mind:
- Model & runtime: WebAssembly or WebGPU-backed inference engine (WASM binary or JS wrapper).
- Persistent store: IndexedDB (with optional client-side encryption) for model shards, caches, and user artifacts.
- Background loader: Service Worker + streams to fetch and store model pieces progressively for offline use.
- Compute boundary: Web Worker or Worklet running the WASM inference to avoid blocking the UI thread.
- UI: Framework layer (React/Vue/Next) that orchestrates model loading and inference calls.
Core building blocks — practical choices in 2026
- Runtimes: ONNX Runtime Web (WASM/WebGPU), Wasm ports of ggml/llama.cpp, TensorFlow.js (WebGPU backend), or custom WASM modules compiled from Rust/C++ using wasm-bindgen or wasm-pack.
- Model formats: ONNX, quantized GGML, or tiny TFLite / custom quantized tensors — choose size over quality for mobile/offline use.
- Persistence: IndexedDB (via the lightweight idb wrapper) plus optional AES-GCM encryption using Web Crypto.
- Bundling: Vite/esbuild for dev speed, with SW-enabled build steps for Service Worker precaching. Next.js apps use the App Router with client/server split for hybrid flows.
Step-by-step: Deliver and store models to the browser
Rather than shipping gigabytes in your build, use progressive delivery and store models in IndexedDB. This allows offline-first behavior and incremental updates.
1) Model packaging & hosting
Host quantized model shards on a CDN. Split large models into small files (chunks) so the app can download only what it needs. Serve with proper HTTP caching headers and range support for resumable downloads.
2) Service Worker + streaming download into IndexedDB
Use a Service Worker to intercept network requests and stream chunks into IndexedDB. This improves resilience and enables offline use.
// simplified: stream a model shard into IndexedDB using idb
import { openDB } from 'idb';
async function storeShard(key: string, url: string) {
const db = await openDB('local-ai', 1, { upgrade(db) { db.createObjectStore('shards'); } });
const res = await fetch(url);
if (!res.body) throw new Error('Stream unsupported');
const reader = res.body.getReader();
const chunks: Uint8Array[] = [];
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(value!);
}
const blob = new Blob(chunks);
await db.put('shards', await blob.arrayBuffer(), key);
}
3) Optional encryption at rest
For extra privacy, encrypt model shards in IndexedDB using the Web Crypto API. Derive a key from a user secret or device-bound key (user password or platform KMS where available).
// AES-GCM encrypt/decrypt helper (simplified)
async function deriveKey(password: string, salt: Uint8Array) {
const enc = new TextEncoder();
const baseKey = await crypto.subtle.importKey('raw', enc.encode(password), 'PBKDF2', false, ['deriveKey']);
return crypto.subtle.deriveKey({ name: 'PBKDF2', salt, iterations: 200_000, hash: 'SHA-256' }, baseKey, { name: 'AES-GCM', length: 256 }, false, ['encrypt','decrypt']);
}
async function encrypt(buffer: ArrayBuffer, key: CryptoKey) {
const iv = crypto.getRandomValues(new Uint8Array(12));
const cipher = await crypto.subtle.encrypt({ name: 'AES-GCM', iv }, key, buffer);
return { iv, cipher };
}
Run inference in WebAssembly inside a Web Worker
Run heavy compute outside the main thread. Use Workers to host WASM engines, provide a simple RPC protocol for requests, and stream logits/results back to the UI.
// worker.ts — outline of worker message handling
self.addEventListener('message', async (ev) => {
const { id, type, payload } = ev.data;
if (type === 'init') {
// load wasm runtime from IndexedDB or URL
await initRuntime(payload.wasmBuffer);
postMessage({ id, type: 'ready' });
}
if (type === 'infer') {
const result = await runInference(payload.input);
postMessage({ id, type: 'result', result });
}
});
WASM runtimes often provide C/JS bindings; initialize them in the worker. If you want threads/SIMD, prefer WASM with threads and SIMD and ensure your bundler and server enable cross-origin isolation (COOP/COEP) when using SharedArrayBuffer.
Framework examples: React, Vue, Next.js, and Node
React + TypeScript: a hook that manages lifecycle
Keep the UI reactive and simple — a hook to manage model load state and inference requests.
// useLocalModel.tsx
import { useEffect, useRef, useState } from 'react';
export function useLocalModel(modelKey: string) {
const workerRef = useRef(null);
const [ready, setReady] = useState(false);
useEffect(() => {
const w = new Worker(new URL('./model.worker.ts', import.meta.url));
workerRef.current = w;
w.postMessage({ type: 'init', payload: { modelKey } });
w.onmessage = (e) => { if (e.data.type === 'ready') setReady(true); };
return () => { w.terminate(); };
}, [modelKey]);
async function infer(input: string) {
return new Promise((resolve) => {
const id = Math.random().toString(36).slice(2);
const onMessage = (e: MessageEvent) => {
if (e.data.id === id && e.data.type === 'result') {
workerRef.current?.removeEventListener('message' as any, onMessage);
resolve(e.data.result);
}
};
workerRef.current?.addEventListener('message' as any, onMessage);
workerRef.current?.postMessage({ id, type: 'infer', payload: { input } });
});
}
return { ready, infer };
}
Vue 3 + TypeScript: composable pattern
Vue's composition API fits the same lifecycle pattern.
// useLocalModel.ts
import { ref, onMounted, onBeforeUnmount } from 'vue';
export function useLocalModel(modelKey: string) {
const ready = ref(false);
let worker: Worker | null = null;
onMounted(() => {
worker = new Worker(new URL('./model.worker.ts', import.meta.url));
worker.postMessage({ type: 'init', payload: { modelKey } });
worker.onmessage = (e) => { if (e.data.type === 'ready') ready.value = true; };
});
onBeforeUnmount(() => worker?.terminate());
const infer = async (input: string) => new Promise((resolve) => {
const id = Math.random().toString(36).slice(2);
const handler = (e: MessageEvent) => { if (e.data.id === id) { worker?.removeEventListener('message' as any, handler); resolve(e.data.result); } };
worker?.addEventListener('message' as any, handler);
worker?.postMessage({ id, type: 'infer', payload: { input } });
});
return { ready, infer };
}
Next.js (App Router) with TypeScript: hybrid local/edge strategy
Next.js apps can keep inference local for interactive features while offering a server fallback for heavy jobs or when the user opts into cloud sync. Use client components for UI and a serverless route for optional offload.
- Client: UI + local worker + IndexedDB for models.
- Serverless API route: accepts encrypted artifacts, runs cloud-only heavy ops, and returns results.
Node/Server considerations
Sometimes you still need a server fallback for larger models or batched processing. Use Node.js with WASI runtimes (Wasmtime or Wasmer) when running portable WASM inference servers. Keep these endpoints opt-in and clearly audit what data leaves the user's device.
Privacy & security best practices
- Data never leaves by default: Make cloud sync opt-in. Default to storing models and logs locally.
- Encrypt at rest: Use Web Crypto AES-GCM with a user-derived key or platform attestation where available.
- Least privilege: Request only the permissions you need (avoid extraneous feature-policy permissions).
- Audit telemetry: If you ship telemetry or crash logs, make them explicit and anonymized.
- COOP/COEP: When you use SharedArrayBuffer or cross-origin isolation for threads, document why and fall back gracefully if unavailable.
Performance tuning and cost tradeoffs
Local-first improves privacy and offline capabilities but has tradeoffs:
- Model size vs latency: Favor aggressively quantized or distilled models for mobile targets.
- Progressive load: Lazy-load model components — e.g., tokenizer first, then a small decoder, then larger context windows.
- Hardware acceleration: Use WebGPU or WebNN when available; fall back to WASM for compatibility.
- Battery & thermal: Provide settings for reduced-power inference or cloud-offload to protect user devices.
Testing, observability, and UX
- Write deterministic unit tests for tokenization and model IO (you can mock the worker RPC layer).
- Measure time-to-first-response and time-to-converge; show progressive UI placeholders during model load.
- Provide clear model provenance: display model name, size, and whether it is quantized locally so users can make informed privacy decisions.
Real-world patterns and case study ideas (practical patterns)
Here are patterns that worked for teams building local-first features in 2025–2026:
- Model sharding + delta updates: Only download model deltas to reduce bandwidth on updates.
- Hybrid compute: Default to local inference; provide a "cloud boost" toggle for heavier tasks where users explicitly opt in.
- Model marketplace: Let users choose models (smaller/privileged) and show tradeoffs in-app.
- Privacy-first defaults: Do not send prompt history off-device unless the user consents to sync to cloud storage.
Common pitfalls and how to avoid them
- Large initial downloads — avoid bundling model binaries in the app package; use CDN shards and progressive streaming.
- Blocked threads — always run WASM inference in a Worker to keep the UI responsive.
- Cross-origin isolation surprises — provide a non-threaded fallback path if COOP/COEP isn't possible on a platform.
- Model poisoning — verify model checksums and sign models to avoid tampering when served from CDNs.
Example: end-to-end flow (React + ONNX Runtime Web + IndexedDB)
- User installs/visits app & Service Worker registers to enable offline caching.
- App checks IndexedDB: if model missing, it downloads shards from CDN and stores them encrypted with Web Crypto.
- App spins up a Web Worker that loads ONNX Runtime Web (WASM or WebGPU backend), initializes the model from the decrypted shard(s), and signals readiness.
- User requests inference; the UI posts the prompt to the worker. The worker runs inference and sends tokenized partial responses streaming back to the UI.
- User chooses to share results to a cloud service: the app requests explicit consent, encrypts data in transit, and uploads if approved.
What to watch in 2026 and beyond — trends & predictions
- Smaller, more efficient quantized models will continue to enable richer local experiences on phones and laptops.
- Browser APIs (WebGPU, WebNN) will keep improving performance and easing developer friction for hardware-accelerated inference.
- Standardization around model metadata (provenance, capabilities, size) will help users pick appropriate models for privacy and performance.
- More browsers (following Puma's example) will ship local AI affordances as first-class features, increasing user expectation for on-device intelligence.
Actionable checklist to get started (quick wins)
- Choose a small quantized model or convert a distilled model to ONNX/ggml for local use.
- Prototype inference in Node or a desktop browser with ORT Web or a wasm-ported runtime.
- Implement IndexedDB storage + Service Worker streaming for model shards.
- Run inference in a Web Worker, and build a minimal React hook or Vue composable for orchestration.
- Add client-side encryption with Web Crypto for any sensitive data at rest.
- Perf-test on target devices (low-end phones matter most) and iterate on quantization and progressive loading.
Quick tip: Prioritize UX: show download progress, estimated remaining time, and allow users to cancel/continue on Wi‑Fi only to avoid surprising mobile bills.
Final thoughts
Local-first web apps in TypeScript are practical in 2026. With WebAssembly, WebGPU, and mature client runtimes, you can ship privacy-respecting AI features that work offline and protect user data. Puma's local AI approach showed the market that users value on-device intelligence — now it's our job as engineers to build it responsibly and efficiently.
Next steps
If you want a jumpstart, scaffold a small demo using Vite + React + ONNX Runtime Web, wire a Service Worker to stream a tiny model into IndexedDB, and run inference in a Worker. Measure latency and battery on real devices, then iterate with quantization and WebGPU acceleration.
Call to action: Start your local-first prototype today — drop your use case (chat, summarization, search, or image analysis) and I’ll suggest a concrete runtime + model combo and a minimal repo to get you running.
Related Reading
- Ticketing, Odds and Spam: Protecting Paid Search and Campaigns from Event-Based Fraud
- Performance Toolkit: 4-Step Routine to Make Your React App Feel Like New
- Which Wearable Tech Helps Gardeners (and Which Is Just Hype)?
- Create a Data Portability Plan: Exporting Followers, Posts, and Pins Before Platforms Change
- Placebo Tech or Real Relief? The Truth About 3D-Scanned Insoles
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
When to Replace Chrome-like Defaults in Your Stack: Choosing Browsers, Engines, and Runtime Targets for TS Apps
Profiling and Speeding Up a TypeScript Web App: A 4-Step Routine Inspired by Phone Cleanups
TypeScript Patterns to Prevent the Most Common Security Bugs (Checklist for Bounties)
How to Run a Bug-Bounty Mindset on Your TypeScript Codebase
Building a TypeScript SDK for an Autonomous Trucking TMS (Design, Types, and Testing)
From Our Network
Trending stories across our publication group