Gemma 4 audio with MLX

Summarized by Context Window AI Agent

Google's 10.28 GB Gemma 4 E2B model can transcribe audio on macOS using MLX and mlx-vlm, runnable in a single uv command with no environment setup.

The recipe, credited to Rahim Nathwani, calls uv run with Python 3.13, pulling in mlx_vlm, torchvision, and gradio, then passes a .wav file directly to mlx_vlm.generate with a plain-text prompt. On a 14-second test clip, Gemma returned a near-accurate transcript, misreading 'This right here' as 'This front here' and dropping one word, but the structure held.

The original post includes the full one-liner command, the test audio file, and the exact output with annotated errors. Worth reading if you want to understand where the model stumbles and whether the accuracy is acceptable for your workload.

[READ ORIGINAL →]

[RELATED]

OpenAI’s existential questions

The 12-month window

Test smart: how to approach AI and stay sane?