Why local coding copilots matter
Most coding copilots — Cursor's cloud mode, GitHub Copilot, Cody — ship your editor context to a hosted model. For personal projects that's fine. For client work under NDA, regulated industries, source code you don't have permission to send anywhere, or just personal preference, it isn't. A local coding copilot keeps the source where it started.
The 2026 state of the art
The frontier models for code (the GPT-5 / Claude 4.5 / Gemini 3 tier) are still cloud-only. But open-weights models have closed an enormous amount of the gap. As of mid-2026:
- Qwen, DeepSeek-Coder, and Llama-3-Code variants in the 32–70B range run on M-series Macs with enough unified memory and produce genuinely useful code completions.
- Smaller 7–14B models are fast enough for inline completion on any modern Mac.
- Tooling — Ollama, LM Studio, MLX — makes setup a 10-minute task instead of an afternoon.
The honest summary: local is "good enough for a lot of daily work, behind the frontier for the hardest tasks."
Where Cloak fits in
Cloak is not a code completion engine. It's a conversational overlay. But it pairs well with a local model server for "talk to your code" use cases:
- You're pair-programming with yourself out loud. Cloak transcribes; the model responds in the overlay.
- You took a screenshot of a stack trace. Cloak ships it to the local vision-capable model and gets back a fix proposal.
- You're in a code review and want a second opinion. The overlay summarizes the diff and flags risks without anything leaving your machine.
Wiring Cloak to a local model
Any local server that exposes an OpenAI-compatible /v1/chat/completions endpoint
works. The most common setup:
# 1. install ollama: https://ollama.com
ollama pull qwen2.5-coder:32b
ollama serve
# 2. in Cloak: Settings → Models → Custom Provider
# Base URL: http://localhost:11434/v1
# Model: qwen2.5-coder:32b
# API key: (any non-empty string)
With local Whisper enabled in Settings → STT, Cloak now runs end-to-end on your Mac. No network traffic during the interaction.
Performance reality check
On an M4 Max with 64 GB unified memory, a 32B coder model streams tokens at roughly 25–40 tok/sec. That feels like a fast cloud model. On an M2 with 16 GB, expect 8–15 tok/sec with a 7B model and significant fan noise on long generations. Plan accordingly.
What you give up
- Frontier reasoning on hard problems. Multi-file refactors, novel algorithms, and large-scale bug isolation are still better on the hosted frontier models.
- Massive context windows. Local models top out at 32–128k context; frontier hosted are 200k–2M.
- Tool use polish. Local models can call tools, but the orchestration story isn't as clean as the hosted SDK ecosystems.
The hybrid that works
The setup most Cloak users land on:
- Local Whisper for transcription.
- Local 7–14B coder model for chat, completion, and "what does this stack trace mean".
- Cloud frontier model for the few-times-a-week heavy lifts — flagged explicitly in Settings so you know when source leaves the machine.
That gives you a real local coding copilot for 90% of work and a sharp tool when you need it.
Get Cloak
Download from the home page. Configure the custom local provider in Settings; everything else just works.