Self-hosted AI
for your team.
One endpoint. Your hardware.
Flock is the self-hosted control plane for LLMs. One Go binary turns your Macs and Linux boxes into a private inference cluster — multi-machine routing, per-user keys, daily quotas, full audit log, and a built-in admin dashboard, behind one endpoint that speaks both the OpenAI and Anthropic APIs. Engine-agnostic: bring Ollama, vLLM, MLX, or llama.cpp-RPC. Fall back to paid Claude/GPT only when you choose.
curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh flock doctor # tells you the one command to install an engine, if you don't have one flock up # starts your private LLM gateway, prints your admin API key
Tools above. Flock in the middle. Engines + models below.
Every client speaks OpenAI or Anthropic. Every engine speaks its native API. Flock is the one URL + one key in between.
/v1/messages
/v1/embeddings
Mix and match across all three layers. Any client above + any engine below + any model on the bottom row — Flock is the only piece that needs to know about the rest.
One layer between your tools and the LLMs
Your tools point at Flock with one URL and one API key. Flock decides — per request — whether to serve from your hardware, fan out across machines, or proxy to a paid vendor. Switching the underlying model is a config change, not a re-wire.
┌──────────────────────────────────────────────────────────────┐
│ YOUR USE CASES │
│ (the tools your team already uses) │
└──────────────────────────────────────────────────────────────┘
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Cursor │ │ Claude │ │ Aider │ │ Custom │ │ curl │
│ │ │ Code │ │ │ │ Python │ │ scripts │
│ │ │ │ │ │ │ SDK │ │ │
└────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘
│ OpenAI │ Anthropic │ OpenAI │ Either │ HTTP
└────────────┴────────────┴────────────┴────────────┘
│
│ ONE URL · ONE API KEY
▼
┌──────────────────────────────────────────────────────────────────────┐
│ ⬢ ⬢ ⬢ FLOCK ⬢ ⬢ ⬢ │
│ (this is what we built) │
│ ──────────────────────────────────────────────────────────────── │
│ Gateway OpenAI + Anthropic + /v1/rerank + /v1/audio/* │
│ keys: allowlist · RPM/TPM · $ budgets · TTL expiry │
│ guardrails · response cache · callbacks · admin UI │
│ │
│ Router Same model on N nodes → load-balance + sticky session │
│ Flaky worker → placement cooldown (skip) │
│ Different models → route by placement │
│ Model bigger than node→ split via llama.cpp-RPC │
│ Latency-sensitive → hedge to top-N workers │
│ Claude / GPT / 7 more → proxy to vendor │
│ Engine error/timeout → typed fallback chain + retries │
└─────────────────────────────┬────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Engines │ │ Engines │ │ Egress │
│ (any mix) │ │ (any mix) │ │ proxy │
│ • Ollama │ │ • Ollama │ │ Anthropic │
│ • vLLM │ │ • vLLM │ │ OpenAI │
│ • MLX-LM │ │ • MLX-LM │ │ Bedrock │
│ • llama.cpp│ │ • llama.cpp│ │ OpenRouter │
│ • whisper │ │ • piper │ │ Groq, Mist- │
│ • piper │ │ • whisper │ │ ral, Cohere │
└──────┬──────┘ └──────┬──────┘ │ Perplexity… │
│ │ └──────┬──────┘
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────┐
│ UNDERLYING LLMs / WEIGHTS │
│ │
│ YOUR HARDWARE VENDOR APIs │
│ • Mac Studio · Mac Mini • Claude (Anthropic) │
│ • Linux + RTX GPU • GPT, o3, o4 (OpenAI) │
│ │
│ 37 curated catalog models (Qwen 3.6, Each request routed │
│ gpt-oss, Llama 4, Gemma 4, DeepSeek V4, to EITHER your hard- │
│ Kimi K2.6, Nemotron 3 Ultra, vision + ware OR a vendor — │
│ embedding models) you pay vendors only │
│ + any HuggingFace or Ollama model. when YOU chose to. │
└──────────────────────────────────────────────────────────────────────┘
Without Flock you'd lock into one provider, share one API key, trust the vendor with your prompts, and pay per token. With Flock you change qwen3.6-27b → claude-opus-4-7 in one place — the dev's editor doesn't know or care.
Plus any model on HuggingFace via flock model add hf:owner/repo
or any Ollama tag via flock model add ollama:<tag>.
Vision (image_url content blocks on /v1/chat/completions) and embeddings (/v1/embeddings) ship via the Ollama engine path. Speech endpoints are on the roadmap (see
ROADMAP.md).
What Flock does, in plain English
Flock is the missing control plane for self-hosted LLMs — multi-machine routing, per-user keys, quotas, audit, and a built-in dashboard, all behind one API your existing tools already speak. Your team's AI tools talk to one endpoint; Flock decides whether to serve from your own machines (free + private), shard a giant model across several of them, or transparently fall back to real Claude / GPT (paid, logged) — your call.
Local models, one API
Pick the engine that fits your hardware — Ollama, vLLM, MLX-LM, or llama.cpp-RPC. Flock exposes whichever one you run through /v1/chat/completions (OpenAI) and /v1/messages (Anthropic) — so Claude Code works against your local Llama.
Team-ready out of the box
Per-user API keys with scopes, TTL expiry, model allowlists, RPM/TPM rate limits, $ budgets, audit log, Prometheus + OTLP, embedded admin UI. No nginx, no LiteLLM-plus-Python — just one binary.
Scales from 1 to N machines
Start on a laptop. Add more machines with one flock join command. Router load-balances replicas. For models too big for any single box, built-in llama.cpp-RPC sharding splits one model across many.
Switching models is one action
flock model add <id> for catalog models. flock model add hf:owner/repo for anything else on HuggingFace. No hand-written YAML, no manual GGUF downloads, no per-worker setup. Engine, quant, and shard count are picked for you from your hardware (M4-T16 → M4-T20 — shipped).
CLI is the source of truth
Every action in the dashboard is one flock command underneath. Anything you can do with the UI, you can do with curl, cron, or an SSH session — same audit log, same validation, same outcome. No web-only knobs.
The honest one-line pitch
The only OSS tool that ships, in one Go binary, all of: OpenAI + Anthropic APIs + 9 vendor passthroughs + per-key allowlists / rate limits / $ budgets / TTL + multi-node routing (sticky / cooldown / hedging) + sharding + response cache + guardrails + webhook / Langfuse callbacks + embedded admin UI — designed for self-hosting on a Mac + Linux team fleet.
An admin UI that ships with the binary
Embedded via //go:embed. No separate frontend to deploy.
Sign in by pasting the admin key Flock prints on first run. Every action also works from the CLI.
Mocked preview — click any tab above to navigate. The real UI ships embedded in the Flock binary. View the real source on GitHub →
Install & first chat in 3 minutes
Pick your platform — 4 commands each.
# 1. install Flock (one Go binary, ~23 MB) curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh export PATH="$HOME/.local/bin:$PATH" # 2. install an engine — Ollama is the simplest default on Apple Silicon # alternatives: pip install mlx-lm · llama.cpp's llama-server · vLLM in Docker brew install --cask ollama && open -a Ollama # 3. start Flock with a small model (~1 GB, fast download) FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up
# 1. install Flock (one Go binary, ~23 MB) curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh echo 'export PATH="$HOME/.local/bin:$PATH"' >> ~/.bashrc && source ~/.bashrc # 2. install an engine — Ollama is the simplest default # alternatives: vLLM (NVIDIA) · llama.cpp's llama-server · MLX-LM (Apple Silicon only) curl -fsSL https://ollama.com/install.sh | sh && sudo systemctl enable --now ollama # 3. start Flock with a small model (~1 GB, fast download) FLOCK_DEFAULT_MODEL=llama-3.2-1b flock up
💡 Not sure which engine to install? Run flock doctor after step 1 — it inspects your hardware and tells you the single command to run.
After it boots you'll see
✔ default model: llama-3.2-1b
✔ engine: ollama at http://127.0.0.1:11434
Flock is ready.
API: http://localhost:8080/v1
Health: http://localhost:8080/healthz
Admin API key (shown once — store it now):
sk-orc-xK9p…
Copy the admin key. You'll need it next.
Test it (pick one)
curl :8080/v1/chat/completions -H "Authorization: Bearer sk-orc-..." -d '{"model":"auto","messages":[…]}'
http://localhost:8080
→ paste the admin key
export ANTHROPIC_BASE_URL=http://localhost:8080
export ANTHROPIC_AUTH_TOKEN=sk-orc-...
export ANTHROPIC_MODEL=llama-3.2-1b
claude
Wire up Claude Code, Cursor, your whole team
Flock works with every tool that speaks OpenAI or Anthropic. Three ways: flock connect <tool> on the CLI, the Connect tab in the dashboard, or copy a snippet manually. All three use the same code path.
CLI
flock connect claude-code flock connect cursor flock connect hermes flock connect open-webui # ChatGPT-style web UI flock connect goose # Block's terminal agent flock connect plandex # agentic planner flock connect openhands # autonomous coding agent flock connect codex-cli # OpenAI's official CLI flock connect opencode flock connect --list
Prints config with your base URL + token already substituted. Read from $FLOCK_TOKEN or ~/.flock/admin.key.
flock invite hadi \ --quota 100000
Creates user-scope token + share card with snippets for all 10 clients (paste-into-Slack markdown).
Dashboard
Dropdown of 10 tools, pre-filled snippet, one-click Copy, Test-connection button that proves the gateway works end-to-end.
In-browser chat — pick a model, send a message, see streaming output. 10-second sanity check before wiring up Cursor.
Modal with name + quota + clients form → returns the share card with one-click Copy-as-markdown.
Manual
export ANTHROPIC_BASE_URL=... export ANTHROPIC_AUTH_TOKEN=sk-orc-... claude
Settings → Models → Override OpenAI Base URL: .../v1
The team rollout flow
flock up on the leader. Add a worker or two if needed.flock invite <name> → paste the output card into Slack.What runs where
One machine is enough for most teams. Add more when you want throughput, redundancy, or a model that doesn't fit on a single box.
Single machine
Solo dev or small team sharing one box.
Multiple machines
Leader + workers. Router decides per request.
Sharded model (one big model across many machines)
For models too large for any single box — e.g. Llama 70B Q4 across 2× Mac Mini via llama.cpp RPC. flock shard create <model> <N> does all of this automatically:
What's in the box
Everything below is in the Go binary you download. No add-ons, no separate services.
/v1/chat/completions + /v1/messages. SSE streaming. Tool calls. The whole shape both SDKs expect.
Send image_url content blocks on the same chat endpoint. Works with Gemma 4, Llama 4 Scout, Qwen3-VL, Step-3.7 (Ollama path).
/v1/embeddings with nomic-embed-text or any Ollama embedding model. OpenAI-compatible response — drops into any RAG stack.
Ollama, vLLM, MLX-LM, llama.cpp (single-node + RPC sharding). Hot-swappable via config. Flock auto-launches llama-server when you pick the llama.cpp engine — no second process to manage.
Set ANTHROPIC_API_KEY / OPENAI_API_KEY. Requests for claude-* / gpt-* transparently proxy upstream, logged the same as local.
Set FLOCK_BEDROCK_REGION and anthropic.* model IDs are signed via real SigV4 (aws-sdk-go-v2). FLOCK_VERTEX_PROJECT wires the ADC auth probe for gemini-*. Body translation for the remaining model families is planned.
Set FLOCK_OTLP_ENDPOINT and get spans for every request: http.request → router.Chat → per-fallback-attempt → ollama.Chat with prompt + completion token counts. W3C traceparent propagation always on, even when export is off.
Declare fallback: [next-id, …] in catalog YAML. Router walks the chain in order on engine error / 5xx / timeout / model-not-loaded. Transparent to clients; visible in audit log.
flock model add checks min_ram_gb / min_vram_gb from the catalog and refuses installs that would oversubscribe. --force overrides when you know better.
flock model load --swap releases the least-recently-used model (draining in-flight requests first), then loads the new one — your machine is never overcommitted. --pin protects a model; loaded models come back after restart; flock down frees engine RAM by default.
Per-user API keys (sha256-hashed). Scopes: admin / user / node. Daily token quotas. Revocation immediate.
Every request recorded (user, model, tokens, latency, outcome). Admin actions audited. flock usage and flock audit read it back.
/metrics exposes RPS, latency, tokens, model-loaded gauges. Three importable Grafana dashboards ship in dashboards/ — cluster overview, per-model, per-node.
flock join. Router picks local-first, then least-loaded worker. Heartbeats reconcile placements every 5s.
flock shard create launches rpc-server on workers + the coordinator llama-server on the leader. One command, full orchestration.
Tailwind via CDN, vanilla JS, served from /. 7 tabs covering every admin action.
Single Go binary. SHA-256 verified. Detects Ollama. Tries user-dirs before sudo.
Every admin action works both ways. Every command has --help with examples.
No open-core gotchas. Commercial use, modification, embedding all OK. Patent grant included.
Add a second machine
Same install command on every machine. The first becomes the leader; the rest become workers. That's the whole protocol.
-
1
On the leader
Issue a one-time worker join token.
$ flock token create --node ✔ sk-orc-NodeJoin-AbCd1234…
-
2
On the new machine
Install Flock + Ollama the same way as before, then:
$ flock join http://leader.local:8080?token=sk-orc-NodeJoin-AbCd1234…
-
3
Install a model on the worker
So it has something to serve.
$ flock model add qwen-coder-7b
-
✓
Back on the leader, verify
$ flock node ls ID HOSTNAME OS/ARCH STATE local mbp-hadi darwin/arm64 ready n_abc123 mac-mini darwin/arm64 ready
From now on, any request for
qwen-coder-7bautomatically routes to the worker. Install the same model on multiple workers → automatic load balancing.
Qwen 3.6, gpt-oss, Llama 4, Gemma 4, DeepSeek V4, Kimi K2.6, Nemotron 3 Ultra…
Flock ships a curated catalog of 37 open-weight models — chat, code, reasoning, vision, and embeddings — spanning 1 GB edge MoEs to 550 B hybrid Mamba-Transformers and 1 T-parameter sharded frontier. Use any of them, install any other Ollama model, or wire up vLLM / MLX-LM for higher throughput.
| Catalog id | What it's for | Size | Min RAM |
|---|---|---|---|
| Embedding — for RAG / retrieval | |||
| nomic-embed-text | 768-dim, 8K ctx — drop-in for OpenAI text-embedding-* | 0.27 GB | 2 GB |
| Edge — laptop | |||
| llama-3.2-1b | smoke test, fastest | 1.3 GB | 2 GB |
| llama-3.2-3b | small fast chat | 2.0 GB | 4 GB |
| Small — 8–16 GB box | |||
| qwen-coder-7b | code completion + chat | 4.7 GB | 8 GB |
| deepseek-r1-8b | distilled reasoning ("thinking") | 4.9 GB | 12 GB |
| lfm2.5-8b-a1b ⭐ | best on-device MoE (1 B active) | 5.0 GB | 8 GB |
| qwen3-8b | general chat, balanced | 5.2 GB | 12 GB |
| mellum2-12b | JetBrains coder MoE (2.5 B active, Apache-2.0) | 7.0 GB | 12 GB |
| mistral-nemo-12b | 128 K context, multilingual | 7.1 GB | 12 GB |
| gemma4-12b | multimodal (text + image; audio declared, route pending) | 7.6 GB | 12 GB |
| qwen3-14b | more capable Qwen 3 chat | 9.0 GB | 16 GB |
| qwen-coder-14b | code + agent (proven) | 9.0 GB | 16 GB |
| phi-4-14b | strong reasoning per byte | 9.1 GB | 12 GB |
| Mid — 24–32 GB box | |||
| gpt-oss-20b ⭐ | OpenAI open-weight; adjustable thinking | 14 GB | 16 GB |
| qwen3.6-27b ⭐ | 77 % SWE-bench; top consumer pick | 17 GB | 24 GB |
| gemma4-26b | MoE 4 B active; multimodal vision | 18 GB | 24 GB |
| qwen3-30b | MoE 3 B active; very fast | 19 GB | 24 GB |
| qwen3-coder-30b | MoE 3.3 B active code agent | 19 GB | 24 GB |
| qwen-coder-32b | dense code agent (older, proven) | 20 GB | 32 GB |
| Power user — single 80 GB GPU / 2-node sharded | |||
| llama-3.3-70b-sharded | frontier-ish, ≥ 2 nodes | 43 GB | 48 GB |
| gpt-oss-120b | ≈ o4-mini reasoning, single H100 | 65 GB | 80 GB |
| llama-4-scout | 10 M context, multimodal (109 B MoE) | 67 GB | 80 GB |
| Frontier — multi-machine sharded | |||
| step-3.7-flash-sharded ⭐ | 198 B MoE / 11 B active VLM — Apache-2.0, ~400 tok/s | 100 GB | 128 GB |
| deepseek-v4-flash-sharded ⭐ | 284 B MoE / 13 B active — cost-efficient frontier | 150 GB | 160 GB |
| nemotron-3-ultra-sharded | 550 B hybrid Mamba-MoE / 55 B active — 1 M context, MMLU 89.1 | 280 GB | 320 GB |
| glm-5.1-sharded | 754 B MoE / 40 B active — best agentic coder | 400 GB | 416 GB |
| kimi-k2.6-sharded | 1 T MoE / 32 B active — # 1 open coding | 500 GB | 512 GB |
For the complete per-model walkthrough — picker table with code/chat/reasoning/vision ratings, install + use snippets for every client (curl / Cursor / Claude Code / SDKs) — see MODELS.md.
The single best default if you have ≥ 24 GB RAM. 77 % SWE-bench, Apache-2.0, strong code + agent. Works great with Claude Code and Cursor.
flock model add qwen3.6-27b
OpenAI's open-weight model, Apache-2.0, adjustable reasoning effort, fits a 16 GB box. ≈ o3-mini quality on reasoning benchmarks.
flock model add gpt-oss-20b
Frontier reasoning quality at consumer cost — 284 B MoE / 13 B active means fast inference. Splits cleanly across 2 nodes via llama.cpp RPC.
flock shard create \ deepseek-v4-flash-sharded 2
Try Flock without commitment
Pointing Claude Code at Flock is just three env vars. Going back to api.anthropic.com is unsetting them.
export ANTHROPIC_BASE_URL=\ http://localhost:8080 export ANTHROPIC_AUTH_TOKEN=\ sk-orc-... export ANTHROPIC_MODEL=\ llama-3.2-1b claude
flock disconnect claude-code # prints the exact unset + export # commands — same for every # supported client.
Manually: unset ANTHROPIC_BASE_URL ANTHROPIC_AUTH_TOKEN ANTHROPIC_MODEL, then export ANTHROPIC_API_KEY=sk-ant-.... Or just open a fresh terminal — Claude Code defaults to api.anthropic.com when the BASE_URL var isn't set.
# Keep Flock vars set, # add real Anthropic key: export ANTHROPIC_API_KEY=\ sk-ant-... flock up # restart
Now --model claude-opus-4-7 transparently proxies to real Anthropic. Local models stay free. Same claude, you pick per-prompt.
vs the alternatives
Flock sits at the intersection of three categories that mostly don't overlap.
| Feature | Flock | Ollama | LiteLLM | exo | LocalAI |
|---|---|---|---|---|---|
| OpenAI-compatible API | ✓ | ✓ | ✓ | ✓ | ✓ |
| Anthropic-compatible API (Claude Code) | ✓ | ✗ | ✓ | ✗ | ✗ |
| Per-user API keys + quotas | ✓ | ✗ | ✓ | ✗ | ✗ |
| Audit log | ✓ | ✗ | ✓ | ✗ | ✗ |
| Multi-machine routing | ✓ | ✗ | ✗ | ✓ | ✗ |
| Auto-sharding (one model across N machines) | ✓ | ✗ | ✗ | ✓ | ✗ |
| Hybrid local + vendor fallback | ✓ | ✗ | ✓ | ✗ | ✗ |
| Embedded admin UI | ✓ | ✗ | ✗ | ✗ | partial |
| Single binary (no Python/Docker/k8s) | ✓ | ✓ | ✗ | ✗ | partial |
| Apache-2.0 | ✓ | ✓ | ✓ | ✓ | ✓ |
CLI reference
Every command supports --help with examples.
Lifecycle
- flock up
- Start local node (leader on first run)
- flock down
- Stop the local node
- flock status
- Cluster status summary
- flock join <url>?token=…
- Join as a worker
- flock doctor
- Diagnose common problems
- flock update
- In-place upgrade to latest release
- flock version
- Print version
Nodes
- flock node ls
- List nodes
- flock node show <id>
- Inspect a node
- flock node drain <id>
- Stop routing to it
- flock node remove <id>
- Forget a node
Models
- flock model search [q]
- Browse catalog
- flock model add <id>
- Install (auto-delegates if sharded)
- flock model ls
- List installed
- flock model remove <id>
- Uninstall
Sharded models
- flock shard create <m> [N]
- Orchestrate sharded model across N workers
- flock shard ls
- List shards
- flock shard remove <m>
- Tear down
Tokens / users
- flock token create [name]
- Issue API key (--admin, --node)
- flock token ls
- List API keys
- flock token revoke <id>
- Revoke a key
Observability + config
- flock usage [--limit N]
- Recent inference records
- flock audit [--limit N]
- Recent admin actions
- flock config show
- Effective config (secrets redacted)
Connect your tools
Set OpenAI base URL:
http://localhost:8080/v1
API key: sk-orc-…
ANTHROPIC_BASE_URL=http://localhost:8080 ANTHROPIC_AUTH_TOKEN=sk-orc-… ANTHROPIC_MODEL=llama-3.2-1b
OpenAI( base_url="http://localhost:8080/v1", api_key="sk-orc-…" )
Security model
Flock assumes a trusted network (LAN or Tailscale) for cluster traffic. Honest about what's protected and what isn't.
What's strongly protected
- User API keys stored as sha256 hashes — plaintext shown only at creation
- Worker HTTP servers bind to the mesh address (LAN / tailnet IP), never
0.0.0.0 - Web UI auth by pasted admin key (in browser
localStorage) - Quotas + audit log limit damage from a leaked key
- Vendor fallback uses team-scoped vendor keys, never the user's
What requires LAN trust
- Worker tokens are stored plaintext in
nodes.worker_tokenon the leader's SQLite - Anyone with read access to the leader's DB can impersonate a worker
- HMAC-SHA256 mutual auth between leader and workers is shipped — signatures travel instead of tokens; set
FLOCK_REJECT_BEARER=1on workers to require it - For hostile networks run the cluster behind Tailscale or a zero-trust overlay
Free. Open source. No telemetry.
Flock is released under the Apache License 2.0. You can use it commercially, modify it, embed it in your own products, redistribute it. No "open core" gotchas. No "free for personal use only" clauses. No SaaS plan to upgrade to.
No license fee. Ever.
The binary you download from GitHub Releases is the same binary a Fortune-500 would use. There is no Pro / Enterprise / Cloud tier hiding the features you actually want.
Apache-2.0 — actually permissive
✓ Commercial use · ✓ Modification · ✓ Distribution · ✓ Patent grant included · ✓ Private use. The only requirements: keep the license + notice, state significant changes you made.
Your data stays yours
No phone-home. No analytics. No "anonymized" telemetry that's actually fingerprinted. The binary doesn't open outbound connections except to engines you configure (Ollama / vendor APIs you opt in to).
What it could save you
A team of 10 devs running modern AI tools heavily can burn $200–500 per dev per month in API tokens. That's $30–60k/year scaling linearly with usage. Flock moves the 80% of "easy" calls to your own hardware — for free — and keeps the optional escape hatch to real Claude / GPT for the 20% that actually need it.
| Claude API (Sonnet) | ~$3,000 |
| OpenAI API (gpt-4o) | ~$2,500 |
| OpenRouter / vendor proxy | $2,500 + markup |
| Flock + your own hardware | ~$50 (electricity) |
Hardware (~$16k for the team-of-10 build) pays back in ~5 months. Stack works for years after.
What's shipped & what's next
✓ Shipped
- Core gateway
- • OpenAI + Anthropic API surface, streaming, tool use, vision (
image_url), embeddings (/v1/embeddings) - •
/v1/rerank(Cohere shape, llama-server passthrough) +/v1/audio/transcriptions+/v1/audio/speechshells (whisper / piper endpoints) - • Ollama / vLLM / MLX / llama.cpp drivers (single-node + RPC)
- • Multi-node routing with heartbeat + placements (LAN mesh)
- • Sharding auto-orchestration + shard crash auto-restart + auto-distribution of GGUF weights (sha256-verified)
- • 37-model catalog + license metadata +
flock model add hf:/ollama:/file:+--from <my.yaml>for non-catalog installs - Multi-tenancy & auth
- • Per-user API keys, scopes (admin / user / node), TTL expiry (
--ttl 7d,renew,expire) - • Per-key model allowlist (literal +
claude-*glob) — 403model_not_allowed+ audit row - • Per-key RPM + TPM rate limits (leaky-bucket) with reconciliation on actual usage
- • Daily token quotas + dollar budgets (day/week/month windows; multiple budgets compose with AND)
- • Per-call $ cost tracking (vendor pricing table + catalog override;
cost_usdsnapshotted on every usage row) - • Standard
X-RateLimit-*response headers +X-Flock-Request-Idcorrelation - Router intelligence
- • Failure-based catalog fallback + typed chains (
fallback_on_context_length,fallback_on_content_policy) with error classification - • Per-request overrides (
flock.fallbacks,num_retries,retry_backoff_ms,hedge) via body orX-Flock-*headers - • Sticky sessions (KV-cache locality on multi-turn chats) + placement cooldown (circuit breaker for flaky workers)
- • Request hedging — fire to top-N least-loaded workers, return whichever responds first
- • Latency-aware fallback (p95 trigger)
- Hybrid local + cloud
- • Anthropic + OpenAI passthrough (
claude-*,gpt-*) - • 7 new vendor passthroughs:
openrouter/,groq/,together/,fireworks/,cohere/,mistral/,perplexity/— slash-prefix stripped before forwarding - • Bedrock SigV4 (anthropic.* family) + Vertex ADC probe
- Observability & policy
- • Prometheus metrics, OTLP traces end-to-end across all four engine drivers, reference Grafana dashboards
- • Webhook + Langfuse callbacks — usage / audit fan-out with HMAC-signed payloads, bounded queues
- • Guardrails framework — pre-call webhook hook that can block / rewrite / flag (PII redaction, prompt-injection checks)
- • Response cache for embeddings (memory or SQLite-backed; canonical key;
Cache-Controlopt-out) - • Time-bucketed usage breakdown (
/admin/v1/usage/breakdown?group_by=user,model&bucket=day) - • HMAC mutual auth for worker token (no plaintext on wire)
- • Typed
engine_unreachable+guardrail_blocked+budget_exceedederrors with actionable hints - DX
- • Embedded web UI: live SSE event stream, modal-based confirm/prompt, audit filter, per-row "models / rates / budgets / expiry" editors, $ today KPI, breakdown panel
- • 19-client
flock connectroster (golden-tested), interactive picker, shell completion,--json/--summary/--dry-runeverywhere - • Install one-liner + signed binaries + .deb / .rpm packages + 2-node smoke + nightly single-node e2e
→ Next
- • Semantic cache — chromem-go embedded vector store, per-namespace threshold
- • OIDC login for the web UI (Google / GitHub / Okta)
- • Chat completion caching with streaming replay (today: embeddings only)
- • Post-call guardrails on streamed responses (today: pre-call only)
- • Vertex body translation (OpenAI / Anthropic →
generateContent) — ADC probe wired, translation queued - • Bedrock streaming + non-Anthropic body shapes (amazon, meta, mistral)
- • Whisper / Piper engine drivers (today: endpoint proxies; auto-launch + catalog entries queued)
- • Router fallback callback events (today: usage + audit sinks ship; fallback is in the audit log only)
- • Tailscale (
tsnet) mesh for NAT traversal + mTLS - • LoRA adapter loading + live model migration
- • Postgres backend for HA control plane
- • AMD ROCm path · NAS .spk packages (Synology DSM)
Get started in 3 minutes
No signup. No SaaS. Just a binary.
$ curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh