Flock is the self-hosted control plane for LLMs. One Go binary turns your Macs and Linux boxes into a private inference cluster, exposes one endpoint that speaks both the OpenAI and Anthropic APIs, and gives your team the org layer they need — per-user API keys, daily quotas, audit log, and a built-in admin dashboard. Engine-agnostic: pick Ollama, vLLM, MLX, or llama.cpp-RPC as the backend.

How is Flock different from running an inference engine directly?

Engines like Ollama, vLLM, and MLX-LM are single-machine, single-user inference servers. Flock is the layer above them: it routes requests across a fleet of machines, exposes one consistent API (OpenAI + Anthropic) so your tools don't care which engine is underneath, adds per-user keys + quotas + audit log, orchestrates llama.cpp-RPC sharding so a single model can run split across multiple machines, and can transparently fall back to paid Claude/GPT when you choose.

Can Claude Code use a local model via Flock?

Yes. Set ANTHROPIC_BASE_URL=http://localhost:8080, ANTHROPIC_AUTH_TOKEN=your-flock-key, ANTHROPIC_MODEL=llama-3.2-1b (or any local catalog id). Claude Code now talks to your local model instead of paying for the API.

Does Flock work across multiple machines?

Yes. The leader machine runs flock up. Every other machine runs flock join. The leader's router automatically dispatches requests to whichever worker has the model loaded, and llama.cpp-RPC sharding lets one large model run split across many machines.

Flock — Self-Hosted Control Plane for LLMs · Multi-machine, team-ready, OpenAI + Anthropic compatible

The stack

Tools above. Flock in the middle. Engines + models below.

Every client speaks OpenAI or Anthropic. Every engine speaks its native API. Flock is the one URL + one key in between.

Layer 1 Clients · what your team uses

Claude Code

Cursor

Aider

Continue

Zed

Cline

Qwen-Code

Hermes

OpenClaw

OpenCode

Open WebUI

Open Notebook

Goose

Plandex

OpenHands

Codex CLI

OpenAI SDK

Anthropic SDK

curl / HTTP

one URL · one API key

Layer 2 Flock · the gateway you self-host

⬢ Flock ⬢

single Go binary · embedded UI · no telemetry

API surface

/v1/chat/completions
/v1/messages
/v1/embeddings

Routing

local-first → least-loaded worker · sharded · vendor fallback · latency-aware

Controls

per-user keys · daily quotas · audit · usage · Prometheus + OTLP

native APIs · your hardware

Layer 3a Engines · drive the models

Ollama

Mac · Linux · Windows

vLLM

NVIDIA · throughput

MLX-LM

Apple Silicon

llama.cpp

CPU · GGUF · RPC sharding

Layer 3b Models · 37 curated families + any HF or Ollama tag

Llama 3.x / 4

Qwen 2.5 / 3 / 3.6

Gemma 4

GPT-OSS 20B / 120B

DeepSeek R1 / V4

Mistral Nemo

Phi 4

Nemotron 3

Kimi K2.6

GLM 5.1

Step 3.7

LFM 2.5

Mellum 2

Nomic Embed

+ any HF GGUF

+ any Ollama tag

Mix and match across all three layers. Any client above + any engine below + any model on the bottom row — Flock is the only piece that needs to know about the rest.

Where Flock sits

One layer between your tools and the LLMs

Your tools point at Flock with one URL and one API key. Flock decides — per request — whether to serve from your hardware, fan out across machines, or proxy to a paid vendor. Switching the underlying model is a config change, not a re-wire.

           ┌──────────────────────────────────────────────────────────────┐
           │                       YOUR USE CASES                         │
           │             (the tools your team already uses)               │
           └──────────────────────────────────────────────────────────────┘
                  │           │          │             │            │
                  ▼           ▼          ▼             ▼            ▼
            ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
            │  Cursor  │ │  Claude  │ │  Aider   │ │  Custom  │ │   curl   │
            │          │ │   Code   │ │          │ │ Python   │ │  scripts │
            │          │ │          │ │          │ │   SDK    │ │          │
            └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
                 │  OpenAI    │ Anthropic  │  OpenAI    │  Either    │  HTTP
                 └────────────┴────────────┴────────────┴────────────┘
                                          │
                                          │   ONE URL · ONE API KEY
                                          ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                  ⬢ ⬢ ⬢   FLOCK   ⬢ ⬢ ⬢                              │
      │                  (this is what we built)                             │
      │  ────────────────────────────────────────────────────────────────    │
      │  Gateway     OpenAI + Anthropic + /v1/rerank + /v1/audio/*           │
      │              keys: allowlist · RPM/TPM · $ budgets · TTL expiry      │
      │              guardrails · response cache · callbacks · admin UI      │
      │                                                                      │
      │  Router      Same model on N nodes  → load-balance + sticky session  │
      │              Flaky worker          → placement cooldown (skip)        │
      │              Different models      → route by placement              │
      │              Model bigger than node→ split via llama.cpp-RPC         │
      │              Latency-sensitive     → hedge to top-N workers          │
      │              Claude / GPT / 7 more → proxy to vendor                 │
      │              Engine error/timeout  → typed fallback chain + retries  │
      └─────────────────────────────┬────────────────────────────────────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
       ┌─────────────┐       ┌─────────────┐       ┌─────────────┐
       │   Engines   │       │   Engines   │       │   Egress    │
       │  (any mix)  │       │  (any mix)  │       │   proxy     │
       │  • Ollama   │       │  • Ollama   │       │ Anthropic   │
       │  • vLLM     │       │  • vLLM     │       │ OpenAI      │
       │  • MLX-LM   │       │  • MLX-LM   │       │ Bedrock     │
       │  • llama.cpp│       │  • llama.cpp│       │ OpenRouter  │
       │  • whisper  │       │  • piper    │       │ Groq, Mist- │
       │  • piper    │       │  • whisper  │       │ ral, Cohere │
       └──────┬──────┘       └──────┬──────┘       │ Perplexity… │
              │                     │              └──────┬──────┘
              ▼                     ▼                     ▼
      ┌──────────────────────────────────────────────────────────────────────┐
      │                    UNDERLYING LLMs / WEIGHTS                         │
      │                                                                      │
      │   YOUR HARDWARE                              VENDOR APIs             │
      │   • Mac Studio · Mac Mini                    • Claude (Anthropic)    │
      │   • Linux + RTX GPU                          • GPT, o3, o4 (OpenAI)  │
      │                                                                      │
      │   37 curated catalog models (Qwen 3.6,        Each request routed    │
      │   gpt-oss, Llama 4, Gemma 4, DeepSeek V4,     to EITHER your hard-   │
      │   Kimi K2.6, Nemotron 3 Ultra, vision +       ware OR a vendor —     │
      │   embedding models)                           you pay vendors only   │
      │   + any HuggingFace or Ollama model.          when YOU chose to.     │
      └──────────────────────────────────────────────────────────────────────┘

Without Flock you'd lock into one provider, share one API key, trust the vendor with your prompts, and pay per token. With Flock you change qwen3.6-27b → claude-opus-4-7 in one place — the dev's editor doesn't know or care.

What Flock does, in plain English

Flock is the missing control plane for self-hosted LLMs — multi-machine routing, per-user keys, quotas, audit, and a built-in dashboard, all behind one API your existing tools already speak. Your team's AI tools talk to one endpoint; Flock decides whether to serve from your own machines (free + private), shard a giant model across several of them, or transparently fall back to real Claude / GPT (paid, logged) — your call.

🧠

Local models, one API

Pick the engine that fits your hardware — Ollama, vLLM, MLX-LM, or llama.cpp-RPC. Flock exposes whichever one you run through /v1/chat/completions (OpenAI) and /v1/messages (Anthropic) — so Claude Code works against your local Llama.

🔑

Team-ready out of the box

Per-user API keys with scopes, TTL expiry, model allowlists, RPM/TPM rate limits, $ budgets, audit log, Prometheus + OTLP, embedded admin UI. No nginx, no LiteLLM-plus-Python — just one binary.

🌐

Scales from 1 to N machines

Start on a laptop. Add more machines with one flock join command. Router load-balances replicas. For models too big for any single box, built-in llama.cpp-RPC sharding splits one model across many.

🔁

Switching models is one action

flock model add <id> for catalog models. flock model add hf:owner/repo for anything else on HuggingFace. No hand-written YAML, no manual GGUF downloads, no per-worker setup. Engine, quant, and shard count are picked for you from your hardware (M4-T16 → M4-T20 — shipped).

⚙️

CLI is the source of truth

Every action in the dashboard is one flock command underneath. Anything you can do with the UI, you can do with curl, cron, or an SSH session — same audit log, same validation, same outcome. No web-only knobs.

The honest one-line pitch

The only OSS tool that ships, in one Go binary, all of: OpenAI + Anthropic APIs + 9 vendor passthroughs + per-key allowlists / rate limits / $ budgets / TTL + multi-node routing (sticky / cooldown / hedging) + sharding + response cache + guardrails + webhook / Langfuse callbacks + embedded admin UI — designed for self-hosting on a Mac + Linux team fleet.

Interface

An admin UI that ships with the binary

Embedded via //go:embed. No separate frontend to deploy. Sign in by pasting the admin key Flock prints on first run. Every action also works from the CLI.

http://localhost:8080

Flock

orchestrate open LLMs · your hardware

key: sk-orc-xK9p…

Nodes

4

3 ready · 1 draining

Models

5

1 sharded

Recent requests

2,847

last 200

Tokens served

1.2M

saved ~$340 vs API

# Quick start: paste your admin key into your tools or use curl: $ curl http://localhost:8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-xK9p…' \ -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'

Nodes

ID	Hostname	OS / Arch	RAM	Address	State	Last heartbeat
local	mbp-hadi	darwin/arm64	24 GB	127.0.0.1:8080	ready	just now	drain · remove
n_abc123	mac-mini-office	darwin/arm64	64 GB	192.168.1.42:8081	ready	3 sec ago	drain · remove
n_def456	gpu-tower	linux/amd64	128 GB	192.168.1.50:8081	ready	2 sec ago	drain · remove
n_ghi789	lab-mac	darwin/arm64	32 GB	192.168.1.60:8081	draining	12 sec ago	drain · remove

Installed models

ID	Status	Source	Size	Installed
llama-3.2-1b	ready	ollama:llama3.2:1b	1.3 GB	2 days ago	remove
qwen-coder-7b	ready	ollama:qwen2.5-coder:7b	4.7 GB	1 day ago	remove
qwen-coder-14b	ready	ollama:qwen2.5-coder:14b	9.0 GB	3 hours ago	remove
qwen3-30b	ready	vllm:Qwen/Qwen3-30B-A3B	19 GB	just now	remove
llama-3.3-70b-sharded	sharded	llamacpp:/var/lib/flock/…q4_k_m.gguf	42 GB	5 min ago	remove

Add a model from the catalog

Catalog entry

Sharded models auto-delegate to the shard orchestrator.

Sharded models

One model split across multiple nodes via llama.cpp RPC. The coordinator runs on the leader; rpc-server runs on each shard host.

Create new sharded model

Catalog model id

Shards

llama-3.3-70b-sharded

3 shards

Role	Node	Address	Status	Last seen
coordinator	local	127.0.0.1:9001	ready	just now
rpc	n_abc123	192.168.1.42:50052	ready	just now
rpc	n_def456	192.168.1.50:50052	ready	just now

Prereqs: The leader needs llama-server (brew install llama.cpp); each worker needs rpc-server on PATH; the catalog entry needs sharding.required: true and a local GGUF path.

API keys

Heads-up: "node" scope tokens are the shared secret between leader and worker. They are stored plaintext on the leader. Only issue on a trusted network (LAN or Tailscale). See the Settings tab for the full security model.

ID	Name	Scope	User	Daily quota	Status	Created
k_initial	initial-admin	admin	admin	∞	active	3 days ago	revoke
k_xY3vP	alice	user	alice	100,000	active	1 day ago	revoke
k_n9LqR	bob	user	bob	200,000	active	1 day ago	revoke
k_join01	mac-mini-join	node	—	∞	active	2 hours ago	revoke
k_old	eve-old	user	eve	100,000	revoked	last week

Create a new key

Name

Scope

Daily quota (0=∞)

New keys are shown once in a modal — save them immediately.

Recent requests

Time	User	Model	Protocol	Prompt	Completion	Latency	Outcome
14:32:08	alice	qwen-coder-14b	openai	412	128	1,832 ms	ok
14:31:55	bob	llama-3.2-1b	anthropic	28	45	312 ms	ok
14:31:42	alice	claude-opus-4-7	anthropic	2,840	1,205	8,541 ms	ok
14:31:21	bob	qwen-coder-14b	openai	198	82	1,021 ms	ok
14:30:58	alice	qwen-coder-32b	openai	0	0	0 ms	rate_limited
14:30:31	bob	llama-3.3-70b-sharded	anthropic	512	312	3,712 ms	ok

Audit log

Time	Actor	Action	Target
14:32:18	initial-admin	POST /admin/v1/shards/create	192.168.1.10:54231
14:31:02	initial-admin	POST /admin/v1/tokens	192.168.1.10:54201
14:28:44	initial-admin	POST /admin/v1/nodes/n_ghi789/drain	192.168.1.10:54180
14:15:21	eve	egress.anthropic	claude-opus-4-7
12:02:18	initial-admin	DELETE /admin/v1/tokens/k_old	192.168.1.10:53811
10:48:09	initial-admin	POST /admin/v1/models	192.168.1.10:53120

Mocked preview — click any tab above to navigate. The real UI ships embedded in the Flock binary. View the real source on GitHub →

Get started

Install & first chat in 3 minutes

Pick your platform — 4 commands each.

# 1. install Flock (one Go binary, ~23 MB)
curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"

# 2. install an engine — Ollama is the simplest default on Apple Silicon
#    alternatives: pip install mlx-lm  ·  llama.cpp's llama-server  ·  vLLM in Docker
brew install --cask ollama && open -a Ollama

# 3. start Flock with a small model (~1 GB, fast download)
FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up

# 1. install Flock (one Go binary, ~23 MB)
curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh
echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc

# 2. install an engine — Ollama is the simplest default
#    alternatives: vLLM (NVIDIA)  ·  llama.cpp's llama-server  ·  MLX-LM (Apple Silicon only)
curl -fsSL https://ollama.com/install.sh | sh && sudo systemctl enable --now ollama

# 3. start Flock with a small model (~1 GB, fast download)
FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up

💡 Not sure which engine to install? Run flock doctor after step 1 — it inspects your hardware and tells you the single command to run.

After it boots you'll see

✔ default model: llama-3.2-1b
✔ engine: ollama at http://127.0.0.1:11434

  Flock is ready.

  API:    http://localhost:8080/v1
  Health: http://localhost:8080/healthz

  Admin API key (shown once — store it now):
    sk-orc-xK9p…

Copy the admin key. You'll need it next.

Test it (pick one)

curl

curl :8080/v1/chat/completions -H "Authorization: Bearer sk-orc-..." -d '{"model":"auto","messages":[…]}'

Web UI

http://localhost:8080 → paste the admin key

Claude Code

export ANTHROPIC_BASE_URL=http://localhost:8080 export ANTHROPIC_AUTH_TOKEN=sk-orc-... export ANTHROPIC_MODEL=llama-3.2-1b claude

Team rollout

Wire up Claude Code, Cursor, your whole team

Flock works with every tool that speaks OpenAI or Anthropic. Three ways: flock connect <tool> on the CLI, the Connect tab in the dashboard, or copy a snippet manually. All three use the same code path.

Shipped

CLI

One command per tool

flock connect claude-code
flock connect cursor
flock connect hermes
flock connect open-webui    # ChatGPT-style web UI
flock connect goose         # Block's terminal agent
flock connect plandex       # agentic planner
flock connect openhands     # autonomous coding agent
flock connect codex-cli     # OpenAI's official CLI
flock connect opencode
flock connect --list

Prints config with your base URL + token already substituted. Read from $FLOCK_TOKEN or ~/.flock/admin.key.

Invite a teammate

flock invite hadi \
  --quota 100000

Creates user-scope token + share card with snippets for all 10 clients (paste-into-Slack markdown).

Shipped

Dashboard

Connect tab

Dropdown of 10 tools, pre-filled snippet, one-click Copy, Test-connection button that proves the gateway works end-to-end.

Playground tab

In-browser chat — pick a model, send a message, see streaming output. 10-second sanity check before wiring up Cursor.

Invite teammate (in Tokens tab)

Modal with name + quota + clients form → returns the share card with one-click Copy-as-markdown.

Always works

Manual

Claude Code

export ANTHROPIC_BASE_URL=...
export ANTHROPIC_AUTH_TOKEN=sk-orc-...
claude

Cursor

Settings → Models → Override OpenAI Base URL: .../v1

Full per-client snippets in README · Connecting clients.

The team rollout flow

1.

Admin sets up once

flock up on the leader. Add a worker or two if needed.

2.

Invite each teammate

flock invite <name> → paste the output card into Slack.

3.

Teammates paste & go

They copy the snippet for their tool of choice. Done — Claude Code / Cursor now run against your hardware.

Architecture

What runs where

One machine is enough for most teams. Add more when you want throughput, redundancy, or a model that doesn't fit on a single box.

Single machine

Solo dev or small team sharing one box.

Your computer (Mac or Linux) ┌─────────────────────────────────────────────────┐ │ │ │ Cursor / Claude Code / curl / SDKs │ │ │ │ │ ▼ │ │ FLOCK :8080 │ │ (gateway · auth · UI · audit) │ │ │ │ │ ▼ │ │ Ollama :11434 │ │ (the actual LLM) │ │ │ └─────────────────────────────────────────────────┘

Multiple machines

Leader + workers. Router decides per request.

LEADER WORKER ┌──────────────┐ ┌──────────────┐ │ flock up │ │ flock join │ │ ───────── │ │ ───────── │ │ Router │ ───routes──▶ │ agent │ │ + UI │ :8081 │ + Ollama │ │ + auth │ ◀──heartbeat──│ │ │ + Ollama │ every 5s │ loaded │ │ │ │ models │ └──────────────┘ └──────────────┘ LAN / Tailscale

Sharded model (one big model across many machines)

For models too large for any single box — e.g. Llama 70B Q4 across 2× Mac Mini via llama.cpp RPC. flock shard create <model> <N> does all of this automatically:

Client request for "llama-3.3-70b-sharded" │ ▼ ┌──────────────────────┐ │ LEADER (coordinator)│ ┌──────────────────────┐ │ llama-server │ RPC │ WORKER A │ │ --rpc A:50052, │ ────▶ │ rpc-server :50052 │ (layers 1-40) │ B:50052 │ ◀──── │ (auto-launched │ │ │ │ by Flock) │ │ serves OpenAI API │ └──────────────────────┘ │ to clients │ ┌──────────────────────┐ │ │ ────▶ │ WORKER B │ │ │ ◀──── │ rpc-server :50052 │ (layers 41-80) └──────────────────────┘ └──────────────────────┘

Features

What's in the box

Everything below is in the Go binary you download. No add-ons, no separate services.

🔌 OpenAI + Anthropic APIs

/v1/chat/completions + /v1/messages. SSE streaming. Tool calls. The whole shape both SDKs expect.

🖼️ Vision (image input)

Send image_url content blocks on the same chat endpoint. Works with Gemma 4, Llama 4 Scout, Qwen3-VL, Step-3.7 (Ollama path).

🧮 Embeddings

/v1/embeddings with nomic-embed-text or any Ollama embedding model. OpenAI-compatible response — drops into any RAG stack.

⚙️ Multi-backend engines

Ollama, vLLM, MLX-LM, llama.cpp (single-node + RPC sharding). Hot-swappable via config. Flock auto-launches llama-server when you pick the llama.cpp engine — no second process to manage.

🔁 Hybrid vendor fallback

Set ANTHROPIC_API_KEY / OPENAI_API_KEY. Requests for claude-* / gpt-* transparently proxy upstream, logged the same as local.

☁️ Bedrock + Vertex egress

Set FLOCK_BEDROCK_REGION and anthropic.* model IDs are signed via real SigV4 (aws-sdk-go-v2). FLOCK_VERTEX_PROJECT wires the ADC auth probe for gemini-*. Body translation for the remaining model families is planned.

🔍 OTLP traces (end-to-end)

Set FLOCK_OTLP_ENDPOINT and get spans for every request: http.request → router.Chat → per-fallback-attempt → ollama.Chat with prompt + completion token counts. W3C traceparent propagation always on, even when export is off.

♻️ Catalog fallback chains

Declare fallback: [next-id, …] in catalog YAML. Router walks the chain in order on engine error / 5xx / timeout / model-not-loaded. Transparent to clients; visible in audit log.

🛡️ Hardware-floor refusal

flock model add checks min_ram_gb / min_vram_gb from the catalog and refuses installs that would oversubscribe. --force overrides when you know better.

🧠 Memory-aware model switching

flock model load --swap releases the least-recently-used model (draining in-flight requests first), then loads the new one — your machine is never overcommitted. --pin protects a model; loaded models come back after restart; flock down frees engine RAM by default.

🔑 Multi-tenant auth

Per-user API keys (sha256-hashed). Scopes: admin / user / node. Daily token quotas. Revocation immediate.

📊 Usage + audit

Every request recorded (user, model, tokens, latency, outcome). Admin actions audited. flock usage and flock audit read it back.

📈 Prometheus + Grafana

/metrics exposes RPS, latency, tokens, model-loaded gauges. Three importable Grafana dashboards ship in dashboards/ — cluster overview, per-model, per-node.

🌐 Multi-node routing

flock join. Router picks local-first, then least-loaded worker. Heartbeats reconcile placements every 5s.

🪓 Auto-sharding

flock shard create launches rpc-server on workers + the coordinator llama-server on the leader. One command, full orchestration.

🖥️ Embedded web UI

Tailwind via CDN, vanilla JS, served from /. 7 tabs covering every admin action.

📦 One-line install

Single Go binary. SHA-256 verified. Detects Ollama. Tries user-dirs before sudo.

📖 CLI ↔ UI parity

Every admin action works both ways. Every command has --help with examples.

🆓 Apache-2.0

No open-core gotchas. Commercial use, modification, embedding all OK. Patent grant included.

Multi-machine

Add a second machine

Same install command on every machine. The first becomes the leader; the rest become workers. That's the whole protocol.

1

On the leader

Issue a one-time worker join token.

$ flock token create --node
✔ sk-orc-NodeJoin-AbCd1234…

2

On the new machine

Install Flock + Ollama the same way as before, then:

$ flock join http://leader.local:8080?token=sk-orc-NodeJoin-AbCd1234…

3
Install a model on the worker

So it has something to serve.
```
$ flock model add qwen-coder-7b
```
✓
Back on the leader, verify
```
$ flock node ls
ID         HOSTNAME    OS/ARCH       STATE
local      mbp-hadi    darwin/arm64  ready
n_abc123   mac-mini    darwin/arm64  ready
```
From now on, any request for qwen-coder-7b automatically routes to the worker. Install the same model on multiple workers → automatic load balancing.

37 curated models

Qwen 3.6, gpt-oss, Llama 4, Gemma 4, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra…

Flock ships a curated catalog of 37 open-weight models — chat, code, reasoning, vision, and embeddings — spanning 1 GB edge MoEs to 550 B hybrid Mamba-Transformers and 1 T-parameter sharded frontier. Use any of them, install any other Ollama model, or wire up vLLM / MLX-LM for higher throughput.

Catalog id	What it's for	Size	Min RAM
Embedding — for RAG / retrieval
nomic-embed-text	768-dim, 8K ctx — drop-in for OpenAI `text-embedding-*`	0.27 GB	2 GB
Edge — laptop
llama-3.2-1b	smoke test, fastest	1.3 GB	2 GB
llama-3.2-3b	small fast chat	2.0 GB	4 GB
Small — 8–16 GB box
qwen-coder-7b	code completion + chat	4.7 GB	8 GB
deepseek-r1-8b	distilled reasoning ("thinking")	4.9 GB	12 GB
lfm2.5-8b-a1b ⭐	best on-device MoE (1 B active)	5.0 GB	8 GB
qwen3-8b	general chat, balanced	5.2 GB	12 GB
mellum2-12b	JetBrains coder MoE (2.5 B active, Apache-2.0)	7.0 GB	12 GB
mistral-nemo-12b	128 K context, multilingual	7.1 GB	12 GB
gemma4-12b	multimodal (text + image; audio declared, route pending)	7.6 GB	12 GB
qwen3-14b	more capable Qwen 3 chat	9.0 GB	16 GB
qwen-coder-14b	code + agent (proven)	9.0 GB	16 GB
phi-4-14b	strong reasoning per byte	9.1 GB	12 GB
Mid — 24–32 GB box
gpt-oss-20b ⭐	OpenAI open-weight; adjustable thinking	14 GB	16 GB
qwen3.6-27b ⭐	77 % SWE-bench; top consumer pick	17 GB	24 GB
gemma4-26b	MoE 4 B active; multimodal vision	18 GB	24 GB
qwen3-30b	MoE 3 B active; very fast	19 GB	24 GB
qwen3-coder-30b	MoE 3.3 B active code agent	19 GB	24 GB
qwen-coder-32b	dense code agent (older, proven)	20 GB	32 GB
Power user — single 80 GB GPU / 2-node sharded
llama-3.3-70b-sharded	frontier-ish, ≥ 2 nodes	43 GB	48 GB
gpt-oss-120b	≈ o4-mini reasoning, single H100	65 GB	80 GB
llama-4-scout	10 M context, multimodal (109 B MoE)	67 GB	80 GB
Frontier — multi-machine sharded
step-3.7-flash-sharded ⭐	198 B MoE / 11 B active VLM — Apache-2.0, ~400 tok/s	100 GB	128 GB
deepseek-v4-flash-sharded ⭐	284 B MoE / 13 B active — cost-efficient frontier	150 GB	160 GB
nemotron-3-ultra-sharded	550 B hybrid Mamba-MoE / 55 B active — 1 M context, MMLU 89.1	280 GB	320 GB
glm-5.1-sharded	754 B MoE / 40 B active — best agentic coder	400 GB	416 GB
kimi-k2.6-sharded	1 T MoE / 32 B active — # 1 open coding	500 GB	512 GB

# Install a catalog model $ flock model add qwen3.6-27b # Use it via the API $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3.6-27b","messages":[…]}' # Or in Claude Code $ export ANTHROPIC_MODEL=qwen3.6-27b $ claude

# Use any Ollama model (no catalog entry needed) $ ollama pull qwen3:0.6b $ curl :8080/v1/chat/completions \ -H 'Authorization: Bearer sk-orc-...' \ -d '{"model":"qwen3:0.6b","messages":[…]}' # Or swap engines entirely $ export FLOCK_ENGINE=vllm $ export FLOCK_VLLM_ENDPOINT=http://gpu:8000 $ flock up

For the complete per-model walkthrough — picker table with code/chat/reasoning/vision ratings, install + use snippets for every client (curl / Cursor / Claude Code / SDKs) — see MODELS.md.

Start here

qwen3.6-27b ⭐

The single best default if you have ≥ 24 GB RAM. 77 % SWE-bench, Apache-2.0, strong code + agent. Works great with Claude Code and Cursor.

flock model add qwen3.6-27b

Tight on RAM

gpt-oss-20b

OpenAI's open-weight model, Apache-2.0, adjustable reasoning effort, fits a 16 GB box. ≈ o3-mini quality on reasoning benchmarks.

flock model add gpt-oss-20b

Frontier tier

deepseek-v4-flash-sharded

Frontier reasoning quality at consumer cost — 284 B MoE / 13 B active means fast inference. Splits cleanly across 2 nodes via llama.cpp RPC.

flock shard create \
  deepseek-v4-flash-sharded 2

Reversible

Try Flock without commitment

Pointing Claude Code at Flock is just three env vars. Going back to api.anthropic.com is unsetting them.

Switch to Flock

export ANTHROPIC_BASE_URL=\
  http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=\
  sk-orc-...
export ANTHROPIC_MODEL=\
  llama-3.2-1b

claude

Switch back to real Anthropic

flock disconnect claude-code

# prints the exact unset + export
# commands — same for every
# supported client.

Manually: unset ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN ANTHROPIC_MODEL, then export ANTHROPIC_API_KEY=sk-ant-.... Or just open a fresh terminal — Claude Code defaults to api.anthropic.com when the BASE_URL var isn't set.

Hybrid (recommended)

# Keep Flock vars set,
# add real Anthropic key:
export ANTHROPIC_API_KEY=\
  sk-ant-...

flock up  # restart

Now --model claude-opus-4-7 transparently proxies to real Anthropic. Local models stay free. Same claude, you pick per-prompt.

Compare

vs the alternatives

Flock sits at the intersection of three categories that mostly don't overlap.

Feature	Flock	Ollama	LiteLLM	exo	LocalAI
OpenAI-compatible API	✓	✓	✓	✓	✓
Anthropic-compatible API (Claude Code)	✓	✗	✓	✗	✗
Per-user API keys + quotas	✓	✗	✓	✗	✗
Audit log	✓	✗	✓	✗	✗
Multi-machine routing	✓	✗	✗	✓	✗
Auto-sharding (one model across N machines)	✓	✗	✗	✓	✗
Hybrid local + vendor fallback	✓	✗	✓	✗	✗
Embedded admin UI	✓	✗	✗	✗	partial
Single binary (no Python/Docker/k8s)	✓	✓	✗	✗	partial
Apache-2.0	✓	✓	✓	✓	✓

Honest framing: any single feature above is available in one of the alternatives. The combination — OpenAI + Anthropic + multi-tenant + multi-node + sharding + UI, all in one Go binary — is what Flock uniquely offers.

Docs

CLI reference

Every command supports --help with examples.

Lifecycle

flock up: Start local node (leader on first run)
flock down: Stop the local node
flock status: Cluster status summary
flock join <url>?token=…: Join as a worker
flock doctor: Diagnose common problems
flock update: In-place upgrade to latest release
flock version: Print version

Nodes

flock node ls: List nodes
flock node show <id>: Inspect a node
flock node drain <id>: Stop routing to it
flock node remove <id>: Forget a node

Models

flock model search [q]: Browse catalog
flock model add <id>: Install (auto-delegates if sharded)
flock model ls: List installed
flock model remove <id>: Uninstall

Sharded models

flock shard create <m> [N]: Orchestrate sharded model across N workers
flock shard ls: List shards
flock shard remove <m>: Tear down

Tokens / users

flock token create [name]: Issue API key (--admin, --node)
flock token ls: List API keys
flock token revoke <id>: Revoke a key

Observability + config

flock usage [--limit N]: Recent inference records
flock audit [--limit N]: Recent admin actions
flock config show: Effective config (secrets redacted)

Connect your tools

Cursor / Continue / Aider

Set OpenAI base URL:

http://localhost:8080/v1

API key: sk-orc-…

Claude Code

ANTHROPIC_BASE_URL=http://localhost:8080
ANTHROPIC_AUTH_TOKEN=sk-orc-…
ANTHROPIC_MODEL=llama-3.2-1b

OpenAI / Anthropic SDK

OpenAI(
  base_url="http://localhost:8080/v1",
  api_key="sk-orc-…"
)

QUICKSTART.md →

3-min new-user landing page with diagrams

README.md →

Full reference: API, config, troubleshooting

ARCHITECTURE.md →

Contributor deep-dive into internals

Security

Security model

Flock assumes a trusted network (LAN or Tailscale) for cluster traffic. Honest about what's protected and what isn't.

What's strongly protected

User API keys stored as sha256 hashes — plaintext shown only at creation
Worker HTTP servers bind to the mesh address (LAN / tailnet IP), never 0.0.0.0
Web UI auth by pasted admin key (in browser localStorage)
Quotas + audit log limit damage from a leaked key
Vendor fallback uses team-scoped vendor keys, never the user's

What requires LAN trust

Worker tokens are stored plaintext in nodes.worker_token on the leader's SQLite
Anyone with read access to the leader's DB can impersonate a worker
HMAC-SHA256 mutual auth between leader and workers is shipped — signatures travel instead of tokens; set FLOCK_REJECT_BEARER=1 on workers to require it
For hostile networks run the cluster behind Tailscale or a zero-trust overlay

$0

Free. Open source. No telemetry.

Flock is released under the Apache License 2.0. You can use it commercially, modify it, embed it in your own products, redistribute it. No "open core" gotchas. No "free for personal use only" clauses. No SaaS plan to upgrade to.

$0

No license fee. Ever.

The binary you download from GitHub Releases is the same binary a Fortune-500 would use. There is no Pro / Enterprise / Cloud tier hiding the features you actually want.

⚖️

Apache-2.0 — actually permissive

✓ Commercial use · ✓ Modification · ✓ Distribution · ✓ Patent grant included · ✓ Private use. The only requirements: keep the license + notice, state significant changes you made.

🔒

Your data stays yours

No phone-home. No analytics. No "anonymized" telemetry that's actually fingerprinted. The binary doesn't open outbound connections except to engines you configure (Ollama / vendor APIs you opt in to).

What it could save you

A team of 10 devs running modern AI tools heavily can burn $200–500 per dev per month in API tokens. That's $30–60k/year scaling linearly with usage. Flock moves the 80% of "easy" calls to your own hardware — for free — and keeps the optional escape hatch to real Claude / GPT for the 20% that actually need it.

Download free View LICENSE

Rough monthly cost (10 devs · heavy use)

Claude API (Sonnet)	~$3,000
OpenAI API (gpt-4o)	~$2,500
OpenRouter / vendor proxy	$2,500 + markup
Flock + your own hardware	~$50 (electricity)

Hardware (~$16k for the team-of-10 build) pays back in ~5 months. Stack works for years after.

Roadmap

What's shipped & what's next

✓ Shipped

Core gateway
• OpenAI + Anthropic API surface, streaming, tool use, vision (image_url), embeddings (/v1/embeddings)
• /v1/rerank (Cohere shape, llama-server passthrough) + /v1/audio/transcriptions + /v1/audio/speech shells (whisper / piper endpoints)
• Ollama / vLLM / MLX / llama.cpp drivers (single-node + RPC)
• Multi-node routing with heartbeat + placements (LAN mesh)
• Sharding auto-orchestration + shard crash auto-restart + auto-distribution of GGUF weights (sha256-verified)
• 37-model catalog + license metadata + flock model add hf:/ollama:/file: + --from <my.yaml> for non-catalog installs
Multi-tenancy & auth
• Per-user API keys, scopes (admin / user / node), TTL expiry (--ttl 7d, renew, expire)
• Per-key model allowlist (literal + claude-* glob) — 403 model_not_allowed + audit row
• Per-key RPM + TPM rate limits (leaky-bucket) with reconciliation on actual usage
• Daily token quotas + dollar budgets (day/week/month windows; multiple budgets compose with AND)
• Per-call $ cost tracking (vendor pricing table + catalog override; cost_usd snapshotted on every usage row)
• Standard X-RateLimit-* response headers + X-Flock-Request-Id correlation
Router intelligence
• Failure-based catalog fallback + typed chains (fallback_on_context_length, fallback_on_content_policy) with error classification
• Per-request overrides (flock.fallbacks, num_retries, retry_backoff_ms, hedge) via body or X-Flock-* headers
• Sticky sessions (KV-cache locality on multi-turn chats) + placement cooldown (circuit breaker for flaky workers)
• Request hedging — fire to top-N least-loaded workers, return whichever responds first
• Latency-aware fallback (p95 trigger)
Hybrid local + cloud
• Anthropic + OpenAI passthrough (claude-*, gpt-*)
• 7 new vendor passthroughs: openrouter/, groq/, together/, fireworks/, cohere/, mistral/, perplexity/ — slash-prefix stripped before forwarding
• Bedrock SigV4 (anthropic.* family) + Vertex ADC probe
Observability & policy
• Prometheus metrics, OTLP traces end-to-end across all four engine drivers, reference Grafana dashboards
• Webhook + Langfuse callbacks — usage / audit fan-out with HMAC-signed payloads, bounded queues
• Guardrails framework — pre-call webhook hook that can block / rewrite / flag (PII redaction, prompt-injection checks)
• Response cache for embeddings (memory or SQLite-backed; canonical key; Cache-Control opt-out)
• Time-bucketed usage breakdown (/admin/v1/usage/breakdown?group_by=user,model&bucket=day)
• HMAC mutual auth for worker token (no plaintext on wire)
• Typed engine_unreachable + guardrail_blocked + budget_exceeded errors with actionable hints
DX
• Embedded web UI: live SSE event stream, modal-based confirm/prompt, audit filter, per-row "models / rates / budgets / expiry" editors, $ today KPI, breakdown panel
• 19-client flock connect roster (golden-tested), interactive picker, shell completion, --json / --summary / --dry-run everywhere
• Install one-liner + signed binaries + .deb / .rpm packages + 2-node smoke + nightly single-node e2e

→ Next

• Semantic cache — chromem-go embedded vector store, per-namespace threshold
• OIDC login for the web UI (Google / GitHub / Okta)
• Chat completion caching with streaming replay (today: embeddings only)
• Post-call guardrails on streamed responses (today: pre-call only)
• Vertex body translation (OpenAI / Anthropic → generateContent) — ADC probe wired, translation queued
• Bedrock streaming + non-Anthropic body shapes (amazon, meta, mistral)
• Whisper / Piper engine drivers (today: endpoint proxies; auto-launch + catalog entries queued)
• Router fallback callback events (today: usage + audit sinks ship; fallback is in the audit log only)
• Tailscale (tsnet) mesh for NAT traversal + mTLS
• LoRA adapter loading + live model migration
• Postgres backend for HA control plane
• AMD ROCm path · NAS .spk packages (Synology DSM)

Self-hosted AIfor your team. One endpoint. Your hardware.

Tools above. Flock in the middle. Engines + models below.

One layer between your tools and the LLMs

What Flock does, in plain English

Local models, one API

Team-ready out of the box

Scales from 1 to N machines

Switching models is one action

CLI is the source of truth

The honest one-line pitch

An admin UI that ships with the binary

Nodes

Installed models

Sharded models

API keys

Recent requests

Audit log

Settings

Install & first chat in 3 minutes

After it boots you'll see

Test it (pick one)

Wire up Claude Code, Cursor, your whole team

CLI

Dashboard

Manual

The team rollout flow

What runs where

Single machine

Multiple machines

Sharded model (one big model across many machines)

What's in the box

Add a second machine

Qwen 3.6, gpt-oss, Llama 4, Gemma 4, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra…

Try Flock without commitment

vs the alternatives

CLI reference

Lifecycle

Nodes

Models

Sharded models

Tokens / users

Observability + config

Connect your tools

Security model

What's strongly protected

What requires LAN trust

Free. Open source. No telemetry.

No license fee. Ever.

Apache-2.0 — actually permissive

Your data stays yours

What it could save you

What's shipped & what's next

✓ Shipped

→ Next

Get started in 3 minutes

Self-hosted AI
for your team.
One endpoint. Your hardware.