Openai-Compatible on Mini Fish

qwen-local: Running an OpenAI-Compatible Model Service on Apple Silicon

Sun, 24 May 2026 20:15:00 +0800

qwen-local is an OpenAI-compatible local model service for a 16 GB Apple Silicon Mac. It wraps local MLX models behind a FastAPI service and exposes chat, embeddings, text-to-speech, and speech-to-text through one local endpoint.

The idea is simple: keep local inference usable by normal OpenAI SDK clients.

Why this exists

Local models are most useful when they can plug into existing tools. A model that only works through a special command is interesting, but a model that looks like an OpenAI-compatible service can be used by editors, agents, scripts, and gateways.

That is the purpose of qwen-local. It is not a model research project. It is an adapter and runtime boundary.

The default shape is:

MLX for Apple Silicon inference
Qwen for chat
Qwen embeddings
Kokoro for text-to-speech
MLX Whisper for speech-to-text
one /v1 API surface

Once models are cached, inference should not require external API calls.

The useful constraint: 16 GB

The project targets a realistic personal machine rather than a workstation with huge memory. That constraint forces decisions:

prefer quantized models
keep concurrency conservative
avoid loading every capability eagerly if it hurts responsiveness
make stuck inference and runtime locks visible

Local AI services fail in different ways from hosted APIs. A hosted provider returns rate-limit errors or provider errors. A local process can get memory pressure, model load stalls, file cache problems, or long single-user queues.

That makes operational behavior part of the product.

Relationship to tailgate

qwen-local is the local model service. Tailgate is the gateway that decides when to use it.

Keeping those roles separate matters. The local service should focus on model loading, request compatibility, and media endpoints. The gateway can handle policy, provider selection, fallback, and external clients.

That split keeps qwen-local from becoming a general AI router.

What I learned

OpenAI-compatible does not mean full OpenAI clone. The useful target is compatibility for the clients I actually use:

chat completions
embeddings
speech generation
transcription
predictable model IDs
normal error shapes where possible

The second lesson is that local inference needs a health model. It is not enough to expose an endpoint. I need to know whether the service is loaded, busy, stuck, or unavailable, especially when another tool is routing requests into it.

Open source status

This project is private because it includes local operational assumptions and is tuned for my own machine. The general pattern is public enough to discuss: make local models boring by putting them behind familiar API contracts.

Tailgate: A Private AI Gateway for Local and Remote Models

Sun, 24 May 2026 20:10:00 +0800

Tailgate is a personal OpenAI-compatible AI gateway. It gives tools like Codex, Cursor, SDK clients, and local agents one private base_url, while provider keys and routing rules stay on a server I control.

It is not meant to be a public model marketplace. The point is not to replace OpenRouter or any other provider. The point is to make my own AI workflow less scattered.

The problem

Once you use multiple model providers, the configuration spreads quickly:

local model endpoint
hosted model provider keys
fallback behavior
model names
pricing assumptions
tool-specific environment variables
different capabilities for chat, embeddings, speech, and transcription

Every client wants a slightly different setup. That is annoying for normal use and worse for agents, because agent configuration should be boring and repeatable.

Tailgate puts that complexity behind one OpenAI-compatible surface.

Design shape

The core API follows familiar endpoints:

GET /v1/models
POST /v1/chat/completions
POST /v1/embeddings
POST /v1/audio/speech
POST /v1/audio/transcriptions

Behind that surface, the gateway can route requests to local qwen-local, DeepSeek, OpenRouter, or future compatible providers. It tracks provider health, supports streaming passthrough, and can apply simple route selection rules.

The most useful rule is not fancy AI logic. It is policy:

prefer local when the task fits
keep secrets off client machines
avoid sending private work to external providers accidentally
fall back only when the route explicitly allows it

Why private

Tailgate contains too many assumptions about my own environment to be a clean open source project. It is shaped around private networking, provider credentials, model preferences, and operational defaults.

The public lesson is still useful: an AI gateway does not need to start as a large platform. For one person, it can simply be a policy boundary.

What I learned

The biggest value of a gateway is not only key management. It is reducing mental overhead.

Before the gateway, every tool needed to know too much. After the gateway, tools only need:

one base URL
one API key or private network policy
normal OpenAI-compatible request shapes

That makes experiments cheaper. I can change the provider map without editing every client.

The second lesson is that local models need protection. A small local model service may only handle one heavy inference at a time. A gateway can enforce concurrency and fallback rules so clients do not accidentally overload the local runtime.

Current status

Tailgate is active and private. I expect it to stay private unless the configuration model becomes generic enough to be useful outside my own setup.