<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Openai-Compatible on Mini Fish</title>
    <link>https://blog.minifish.org/tags/openai-compatible/</link>
    <description>Recent content in Openai-Compatible on Mini Fish</description>
    <image>
      <title>Mini Fish</title>
      <url>https://blog.minifish.org/android-chrome-512x512.png</url>
      <link>https://blog.minifish.org/android-chrome-512x512.png</link>
    </image>
    <generator>Hugo -- 0.161.1</generator>
    <language>en-US</language>
    <copyright>Mini Fish 2014-present. Licensed under CC-BY-NC</copyright>
    <lastBuildDate>Sun, 24 May 2026 20:15:00 +0800</lastBuildDate>
    <atom:link href="https://blog.minifish.org/tags/openai-compatible/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>qwen-local: Running an OpenAI-Compatible Model Service on Apple Silicon</title>
      <link>https://blog.minifish.org/posts/qwen-local-on-apple-silicon/</link>
      <pubDate>Sun, 24 May 2026 20:15:00 +0800</pubDate>
      <guid>https://blog.minifish.org/posts/qwen-local-on-apple-silicon/</guid>
      <description>A project note on qwen-local, a local OpenAI-compatible AI service for Apple Silicon using MLX, Qwen, Kokoro, and Whisper.</description>
      <content:encoded><![CDATA[<p><code>qwen-local</code> is an OpenAI-compatible local model service for a 16 GB Apple Silicon Mac. It wraps local MLX models behind a FastAPI service and exposes chat, embeddings, text-to-speech, and speech-to-text through one local endpoint.</p>
<p>The idea is simple: keep local inference usable by normal OpenAI SDK clients.</p>
<h2 id="why-this-exists">Why this exists</h2>
<p>Local models are most useful when they can plug into existing tools. A model that only works through a special command is interesting, but a model that looks like an OpenAI-compatible service can be used by editors, agents, scripts, and gateways.</p>
<p>That is the purpose of <code>qwen-local</code>. It is not a model research project. It is an adapter and runtime boundary.</p>
<p>The default shape is:</p>
<ul>
<li>MLX for Apple Silicon inference</li>
<li>Qwen for chat</li>
<li>Qwen embeddings</li>
<li>Kokoro for text-to-speech</li>
<li>MLX Whisper for speech-to-text</li>
<li>one <code>/v1</code> API surface</li>
</ul>
<p>Once models are cached, inference should not require external API calls.</p>
<h2 id="the-useful-constraint-16-gb">The useful constraint: 16 GB</h2>
<p>The project targets a realistic personal machine rather than a workstation with huge memory. That constraint forces decisions:</p>
<ul>
<li>prefer quantized models</li>
<li>keep concurrency conservative</li>
<li>avoid loading every capability eagerly if it hurts responsiveness</li>
<li>make stuck inference and runtime locks visible</li>
</ul>
<p>Local AI services fail in different ways from hosted APIs. A hosted provider returns rate-limit errors or provider errors. A local process can get memory pressure, model load stalls, file cache problems, or long single-user queues.</p>
<p>That makes operational behavior part of the product.</p>
<h2 id="relationship-to-tailgate">Relationship to tailgate</h2>
<p><code>qwen-local</code> is the local model service. Tailgate is the gateway that decides when to use it.</p>
<p>Keeping those roles separate matters. The local service should focus on model loading, request compatibility, and media endpoints. The gateway can handle policy, provider selection, fallback, and external clients.</p>
<p>That split keeps <code>qwen-local</code> from becoming a general AI router.</p>
<h2 id="what-i-learned">What I learned</h2>
<p>OpenAI-compatible does not mean full OpenAI clone. The useful target is compatibility for the clients I actually use:</p>
<ul>
<li>chat completions</li>
<li>embeddings</li>
<li>speech generation</li>
<li>transcription</li>
<li>predictable model IDs</li>
<li>normal error shapes where possible</li>
</ul>
<p>The second lesson is that local inference needs a health model. It is not enough to expose an endpoint. I need to know whether the service is loaded, busy, stuck, or unavailable, especially when another tool is routing requests into it.</p>
<h2 id="open-source-status">Open source status</h2>
<p>This project is private because it includes local operational assumptions and is tuned for my own machine. The general pattern is public enough to discuss: make local models boring by putting them behind familiar API contracts.</p>
]]></content:encoded>
    </item>
    <item>
      <title>Tailgate: A Private AI Gateway for Local and Remote Models</title>
      <link>https://blog.minifish.org/posts/tailgate-private-ai-gateway/</link>
      <pubDate>Sun, 24 May 2026 20:10:00 +0800</pubDate>
      <guid>https://blog.minifish.org/posts/tailgate-private-ai-gateway/</guid>
      <description>A project note on tailgate, a private AI gateway that centralizes model routing, secrets, provider selection, and local model integration.</description>
      <content:encoded><![CDATA[<p>Tailgate is a personal OpenAI-compatible AI gateway. It gives tools like Codex, Cursor, SDK clients, and local agents one private <code>base_url</code>, while provider keys and routing rules stay on a server I control.</p>
<p>It is not meant to be a public model marketplace. The point is not to replace OpenRouter or any other provider. The point is to make my own AI workflow less scattered.</p>
<h2 id="the-problem">The problem</h2>
<p>Once you use multiple model providers, the configuration spreads quickly:</p>
<ul>
<li>local model endpoint</li>
<li>hosted model provider keys</li>
<li>fallback behavior</li>
<li>model names</li>
<li>pricing assumptions</li>
<li>tool-specific environment variables</li>
<li>different capabilities for chat, embeddings, speech, and transcription</li>
</ul>
<p>Every client wants a slightly different setup. That is annoying for normal use and worse for agents, because agent configuration should be boring and repeatable.</p>
<p>Tailgate puts that complexity behind one OpenAI-compatible surface.</p>
<h2 id="design-shape">Design shape</h2>
<p>The core API follows familiar endpoints:</p>
<ul>
<li><code>GET /v1/models</code></li>
<li><code>POST /v1/chat/completions</code></li>
<li><code>POST /v1/embeddings</code></li>
<li><code>POST /v1/audio/speech</code></li>
<li><code>POST /v1/audio/transcriptions</code></li>
</ul>
<p>Behind that surface, the gateway can route requests to local <code>qwen-local</code>, DeepSeek, OpenRouter, or future compatible providers. It tracks provider health, supports streaming passthrough, and can apply simple route selection rules.</p>
<p>The most useful rule is not fancy AI logic. It is policy:</p>
<ul>
<li>prefer local when the task fits</li>
<li>keep secrets off client machines</li>
<li>avoid sending private work to external providers accidentally</li>
<li>fall back only when the route explicitly allows it</li>
</ul>
<h2 id="why-private">Why private</h2>
<p>Tailgate contains too many assumptions about my own environment to be a clean open source project. It is shaped around private networking, provider credentials, model preferences, and operational defaults.</p>
<p>The public lesson is still useful: an AI gateway does not need to start as a large platform. For one person, it can simply be a policy boundary.</p>
<h2 id="what-i-learned">What I learned</h2>
<p>The biggest value of a gateway is not only key management. It is reducing mental overhead.</p>
<p>Before the gateway, every tool needed to know too much. After the gateway, tools only need:</p>
<ul>
<li>one base URL</li>
<li>one API key or private network policy</li>
<li>normal OpenAI-compatible request shapes</li>
</ul>
<p>That makes experiments cheaper. I can change the provider map without editing every client.</p>
<p>The second lesson is that local models need protection. A small local model service may only handle one heavy inference at a time. A gateway can enforce concurrency and fallback rules so clients do not accidentally overload the local runtime.</p>
<h2 id="current-status">Current status</h2>
<p>Tailgate is active and private. I expect it to stay private unless the configuration model becomes generic enough to be useful outside my own setup.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
