Step 1 of 3 — How voice capture works
Record
You're walking to the shed. An idea hits. You pull out your phone, hit record, ramble for 20 seconds, and put it away. By the time you reach the shed, it's already a tagged task in your project board.
What happens
- Phone records voice — any length, any quality, background noise is fine
- Telegram bot receives the audio file instantly
- MLX Whisper transcribes locally on Apple Silicon (no cloud, no cost, no latency)
- Transcript saved with YAML frontmatter (timestamp, duration, source)
- Raw audio deleted immediately — only text persists
Real example
VOICE NOTE
"Need to update the pricing page, draft the newsletter intro, and check if the new product is live in the shop"
WHAT HAPPENS
3 separate intents detected. Each becomes its own task, tagged to the correct project. Pricing → Website project. Newsletter intro → Marketing. Product check → Shop Management.
RESULT
20 seconds of rambling → 3 structured tasks with project links, priority, and category. Zero typing.
Try it yourself
**Role** Act as a senior Python developer with experience building Telegram bots and local ML inference pipelines on Apple Silicon. **Task** Build a minimal voice transcription bot that receives audio via Telegram and returns text. 1. Set up a python-telegram-bot handler for voice messages 2. Download the incoming .ogg file to a temp directory 3. Transcribe using MLX Whisper (mlx-community/whisper-large-v3-mlx) 4. Reply with the transcript text 5. Delete the audio file immediately after transcription 6. Add basic error handling (file too large, transcription failure) **Context** - Stack: Python 3.11+, python-telegram-bot, mlx-whisper - Bot token stored in .env as TELEGRAM_BOT_TOKEN - Target machine: Apple Silicon Mac (M1/M2/M3) - No cloud APIs — everything runs locally - Keep it under 50 lines for v0.1 **Output Format** A single Python file (voice_bot.py) with: - Imports at top - Handler function for voice messages - Main block that starts polling - Inline comments explaining each step **Stop Conditions** Done when: bot receives a voice note, transcribes it, and replies with text. No queue, no task extraction — just transcription.