Four jobs, four hard problems
Strip away marketing and an interview copilot is four pipelines that have to run in lockstep with sub-second latency: capture, transcribe, route, and render. Each one has a non-obvious failure mode.
1. Capture
You need both sides of the call. On macOS that means:
- System audio: what the interviewer is saying, coming out of your speakers
or headphones. Captured via
CoreAudiotap (macOS 14.4+) orScreenCaptureKitaudio. - Microphone: your own voice, for context and for marking when the interviewer is speaking vs you.
- Screen: the editor / question / slides currently visible.
Failure mode: Electron-based overlays often go through web audio APIs that fight with the
meeting tool. Cloak uses the Rust cpal crate for microphone and a native CoreAudio
tap for system audio, so neither competes with Zoom for the device.
2. Transcribe
Latency budget for live transcription is ~250 ms per segment. Anything slower and the answer arrives after you've started talking.
Real-world choices:
- OpenAI Whisper hosted via OpenAI — accurate, ~400–600 ms latency on short clips. Default in Cloak.
- Groq-hosted Whisper — same accuracy, ~150 ms latency. Best option when available.
- ElevenLabs Scribe — strong diarization, ~250 ms.
- Local whisper.cpp — slower (1–2s) but offline.
Cloak's STT layer is pluggable so you can pick per-task. The transcript pipeline also runs a rolling de-duplication pass (the kind of thing where "I think I think" becomes "I think") because real-world STT stutters on word boundaries.
3. Route
"Route" is the unsexy heart of an interview copilot. The user pressed a hotkey. What does the system send to the model?
- The last N seconds of transcript (typically 30–60).
- An optional screenshot encoded as base64.
- A system prompt corresponding to the active persona.
- Resume + JD as a tool-injected context block (Pro feature).
- An intent classifier output: was that a question, a follow-up, or thinking aloud?
Get the routing wrong and the model answers the wrong question. Cloak's intent classifier is a fast small-model call that runs in parallel with the main model call. If intent is "thinking aloud" the main call gets cancelled.
4. Render
The overlay window has to be:
- Always on top.
- Non-capturable by screen sharing (this is the whole point).
- Streaming tokens at ~30 fps so reading feels live.
- Resizable based on content without flicker or scroll-jump.
On macOS the non-capturable property is a real OS-level guarantee. You set
sharingType: .none on an NSPanel via
tauri-nspanel and the window server itself filters the surface out of every
capture API. There is no CSS / JavaScript way to fake this on Windows or Linux — which is why
Cloak is macOS only on purpose.
The streaming token render is harder than it looks. Markdown reflow during stream causes layout
thrash; framer-motion exit animations during the same period cause the window to "remember" a
stale measured size. Cloak's useWindow hook measures the live content root with a
ResizeObserver and dispatches Tauri IPC resize calls clamped to
OVERLAY_MAX_HEIGHT so the overlay never grows past the user's screen.
The thing you can't engineer around
All this delivers you an answer. It cannot deliver you composure. Interview copilots that promise "auto-answer" are lying — there is no way for an external tool to inhabit your voice, your body language, or your willingness to admit you don't know something. The best overlays keep you sharp and let your real ability come through faster, not pretend to replace it.
Built into Cloak
Every architectural choice above is in Cloak's source on GitHub. If you want to see exactly how an interview copilot is built — read the source. If you want to use one — download Cloak.