<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Qwen-Local on Mini Fish</title>
    <link>https://blog.minifish.org/tags/qwen-local/</link>
    <description>Recent content in Qwen-Local on Mini Fish</description>
    <image>
      <title>Mini Fish</title>
      <url>https://blog.minifish.org/android-chrome-512x512.png</url>
      <link>https://blog.minifish.org/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo -- 0.161.1</generator>
    <language>en-US</language>
    <copyright>Mini Fish 2014-present. Licensed under CC-BY-NC</copyright>
    <lastBuildDate>Sun, 24 May 2026 20:15:00 +0800</lastBuildDate>
    <atom:link href="https://blog.minifish.org/tags/qwen-local/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>qwen-local: Running an OpenAI-Compatible Model Service on Apple Silicon</title>
      <link>https://blog.minifish.org/posts/qwen-local-on-apple-silicon/</link>
      <pubDate>Sun, 24 May 2026 20:15:00 +0800</pubDate>
      <guid>https://blog.minifish.org/posts/qwen-local-on-apple-silicon/</guid>
      <description>A project note on qwen-local, a local OpenAI-compatible AI service for Apple Silicon using MLX, Qwen, Kokoro, and Whisper.</description>
      <content:encoded><![CDATA[<p><code>qwen-local</code> is an OpenAI-compatible local model service for a 16 GB Apple Silicon Mac. It wraps local MLX models behind a FastAPI service and exposes chat, embeddings, text-to-speech, and speech-to-text through one local endpoint.</p>
<p>The idea is simple: keep local inference usable by normal OpenAI SDK clients.</p>
<h2 id="why-this-exists">Why this exists</h2>
<p>Local models are most useful when they can plug into existing tools. A model that only works through a special command is interesting, but a model that looks like an OpenAI-compatible service can be used by editors, agents, scripts, and gateways.</p>
<p>That is the purpose of <code>qwen-local</code>. It is not a model research project. It is an adapter and runtime boundary.</p>
<p>The default shape is:</p>
<ul>
<li>MLX for Apple Silicon inference</li>
<li>Qwen for chat</li>
<li>Qwen embeddings</li>
<li>Kokoro for text-to-speech</li>
<li>MLX Whisper for speech-to-text</li>
<li>one <code>/v1</code> API surface</li>
</ul>
<p>Once models are cached, inference should not require external API calls.</p>
<h2 id="the-useful-constraint-16-gb">The useful constraint: 16 GB</h2>
<p>The project targets a realistic personal machine rather than a workstation with huge memory. That constraint forces decisions:</p>
<ul>
<li>prefer quantized models</li>
<li>keep concurrency conservative</li>
<li>avoid loading every capability eagerly if it hurts responsiveness</li>
<li>make stuck inference and runtime locks visible</li>
</ul>
<p>Local AI services fail in different ways from hosted APIs. A hosted provider returns rate-limit errors or provider errors. A local process can get memory pressure, model load stalls, file cache problems, or long single-user queues.</p>
<p>That makes operational behavior part of the product.</p>
<h2 id="relationship-to-tailgate">Relationship to tailgate</h2>
<p><code>qwen-local</code> is the local model service. Tailgate is the gateway that decides when to use it.</p>
<p>Keeping those roles separate matters. The local service should focus on model loading, request compatibility, and media endpoints. The gateway can handle policy, provider selection, fallback, and external clients.</p>
<p>That split keeps <code>qwen-local</code> from becoming a general AI router.</p>
<h2 id="what-i-learned">What I learned</h2>
<p>OpenAI-compatible does not mean full OpenAI clone. The useful target is compatibility for the clients I actually use:</p>
<ul>
<li>chat completions</li>
<li>embeddings</li>
<li>speech generation</li>
<li>transcription</li>
<li>predictable model IDs</li>
<li>normal error shapes where possible</li>
</ul>
<p>The second lesson is that local inference needs a health model. It is not enough to expose an endpoint. I need to know whether the service is loaded, busy, stuck, or unavailable, especially when another tool is routing requests into it.</p>
<h2 id="open-source-status">Open source status</h2>
<p>This project is private because it includes local operational assumptions and is tuned for my own machine. The general pattern is public enough to discuss: make local models boring by putting them behind familiar API contracts.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
