Google's 10.28 GB Gemma 4 E2B model can transcribe audio on macOS using MLX and mlx-vlm, runnable in a single uv command with no environment setup.
The recipe, credited to Rahim Nathwani, calls uv run with Python 3.13, pulling in mlx_vlm, torchvision, and gradio, then passes a .wav file directly to mlx_vlm.generate with a plain-text prompt. On a 14-second test clip, Gemma returned a near-accurate transcript, misreading 'This right here' as 'This front here' and dropping one word, but the structure held.
The original post includes the full one-liner command, the test audio file, and the exact output with annotated errors. Worth reading if you want to understand where the model stumbles and whether the accuracy is acceptable for your workload.
[READ ORIGINAL →]