Self-Hosting Your AI Stack: A Practical Guide
Commissioned, Curated and Published by Russ. Researched and written with AI. This is the living version of this post. View versioned snapshots in the changelog below.
Disclaimer: This reflects my own setup and opinions as of early March 2026. Hardware prices, model capabilities, and software maturity change fast. Verify specs before you buy anything.
This is a living document. As the stack evolves, I’ll publish versioned snapshots and update this index.
What’s New This Week (5 March 2026)
Two things worth noting this week, both directly relevant to the stack.
Voice inference just got a lot more local. NVIDIA PersonaPlex 7B is trending on HN today (328 points, 109 comments) for a reason: it runs full-duplex speech-to-speech on Apple Silicon via MLX, quantized to 5.3 GB, faster than real-time (RTF 0.87, ~68ms/step). The traditional voice pipeline is ASR + LLM + TTS – three models, three handoffs, cumulative latency. PersonaPlex collapses that into a single model that processes audio tokens directly and produces audio out. Native Swift, no Python, no server. If you’re on Apple Silicon and have been treating voice as “use an API” territory, that’s no longer the default answer. The “When to Still Use an API” section below has been updated to reflect this.
AMD Ryzen AI 400 Series for AM5 desktop is official – launched at MWC 2026. First desktop chips to hit Microsoft Copilot+ certification, with up to 50 TOPS from an integrated NPU alongside Zen 5 cores and RDNA 3.5 graphics on a standard AM5 socket. This doesn’t replace a discrete GPU for running Qwen3.5-35B-A3B – NPU handles specific workloads, VRAM still governs LLM inference – but it moves the floor for on-device AI workloads significantly lower on mainstream hardware.
Changelog
| Date | Summary |
|---|---|
| 5 Mar 2026 | PersonaPlex 7B enables local full-duplex voice on Apple Silicon; AMD Ryzen AI 400 desktop (50 TOPS NPU, AM5). |
| 4 Mar 2026 | Apple MacBook Neo ($599, A18 Pro, ships Mar 11). |
| 2 Mar 2026 | Initial publication |
Why Self-Host in 2026
The arguments for self-hosting AI shifted significantly this year. It’s no longer just ideology – the economics and reliability case is now concrete.
Cost. Qwen3.5-35B-A3B runs on 32GB VRAM (a single consumer GPU, or two older cards via NVLink) and benchmarks ahead of GPT-5-mini on most coding and reasoning tasks. The model is free. The GPU, amortised over three years, costs less than two months of a mid-tier API plan if you’re running any serious workload. At scale the numbers are obvious. Even for a single developer, the crossover point hits faster than most people expect.
No rate limits. API rate limits are a tax on momentum. You hit a good flow, your agent starts doing real work, and then you’re throttled. Running local means the limit is your hardware – which you can reason about and upgrade.
Data sovereignty. When your agent runs against an API, every prompt, every document, every query you send is processed on infrastructure you don’t control. Most terms of service reserve the right to use that data for model improvement. “Opt-out” settings exist, but they depend on the vendor honoring them, and you have no audit trail.
Vendor lock-in is real. The Google OAuth crackdown in early 2026 demonstrated this clearly. Accounts restricted without warning, with appeals processes that take weeks and often go nowhere. Developers who had built auth flows, data pipelines, or automation on Google infrastructure had no recourse. If your AI stack depends on an API key from a company that can revoke your access unilaterally, you have a single point of failure you don’t control.
The surveillance architecture problem. The OpenAI Persona system – and similar approaches from other frontier labs – is designed to build persistent profiles across users and sessions. This is not speculation. It’s a business model. If you’re running an agent that handles personal data, business logic, or anything sensitive, routing it through a commercial API means feeding that data into a surveillance architecture optimised for product telemetry. Self-hosting removes that layer entirely.
None of this means you never use an API. There are still tasks where frontier models are genuinely better. But the default should be local, with APIs reserved for specific cases.
The Inference Layer
This is the foundation. Get it wrong and nothing else matters.
Model Selection: The Current Sweet Spot
Qwen3.5-35B-A3B is the model I’d recommend for most self-hosters right now. Released by Alibaba in February 2026, it’s a 35B parameter Mixture-of-Experts model with a 1 million token context window. It fits in 32GB VRAM quantised to Q4_K_M, which puts it within reach of a single RTX 3090, RTX 4090, or equivalent AMD card. On standard benchmarks it beats GPT-5-mini across coding, reasoning, and instruction following. This is the crossover point people were waiting for – a model that’s genuinely competitive with commercial offerings, running entirely on your own hardware.
One caveat worth flagging: the core Qwen team has fractured. Junyang Lin – the tech lead who built Qwen – stepped down this week. Two other key colleagues departed in the same window. Alibaba has formed a task force to continue development. The models exist, perform well, and the hardware requirements are unchanged. But for engineers making long-term infrastructure bets on Qwen, watching the task force’s first major release will be informative. The Mistral, Gemma, and Phi families deserve a place in your local evaluation pipeline as a hedge.
For lighter tasks, smaller models (Qwen3.5-7B, Phi-4, Gemma-3-12B) are fast and cheap to run. They’re good for agent tool-use, classification, summarisation, and anything that doesn’t need deep reasoning. Keep a small model running as your default and only route hard tasks to the 35B.
Inference Servers
Ollama is the easiest entry point. Single binary, pulls models from its registry, exposes an OpenAI-compatible API on localhost:11434. For most people this is the right choice. The overhead is minimal, the model management is clean, and it handles quantisation automatically. If you already know Docker, there’s an official image. ollama run qwen3.5:35b-a3b and you’re running.
The limitations: Ollama doesn’t give you much control over quantisation format, batching strategy, or GPU memory allocation. For a single-user setup it doesn’t matter. For anything with concurrent requests, you’ll want more control.
llama.cpp (now under the HuggingFace umbrella after ggml.ai’s acquisition) is the lower-level option. You compile it yourself, choose your quantisation explicitly, and get full control over inference parameters. The llama-server binary exposes an OpenAI-compatible HTTP API. It’s more work to set up but it’s the right choice if you’re running multiple models, need precise VRAM management, or want to squeeze performance.
Worth knowing: ggml.ai joining HuggingFace is a consolidation signal. The local inference toolchain is maturing. Expect llama.cpp to become more integrated with HuggingFace Hub over the next year.
vLLM is the production option – continuous batching, PagedAttention, proper throughput for concurrent users. Overkill for personal use, but if you’re running inference for a team or building something others will hit, look at it.
Hardware
The GPU question is the most common one. Here’s the honest breakdown:
- RTX 4090 (24GB): Fits most 7-13B models comfortably at high quality. Qwen3.5-35B-A3B needs two of these or aggressive quantisation. Fast. Expensive (~$1600-1800 used).
- RTX 3090 (24GB): Same VRAM as 4090, meaningfully slower, but much cheaper used (~$600-800). Excellent value if you’re patient.
- Two RTX 3090s or 4090s: 48GB combined via NVLink or tensor parallelism in llama.cpp. Gets you the 35B model at good quality. This is the sweet spot for power users right now.
- AMD RX 7900 XTX (24GB): ROCm support is better than it was. Linux only really. Cheaper than NVIDIA equivalents. Worth considering if you’re comfortable with the driver situation.
- CPU fallback: llama.cpp runs on CPU. A modern server CPU (Threadripper, EPYC) with fast RAM can do 3-8 tokens/sec on a 7B model. Usable for non-interactive tasks. Not usable for conversation. Don’t buy CPU hardware for inference – it’s just there if you already have it.
- Apple Silicon: Surprisingly good. The M-series unified memory architecture means a MacBook Pro M4 Max (128GB) can run Qwen3.5-35B-A3B well. Ollama on macOS uses the Metal backend and achieves decent throughput. If you’re already on Apple Silicon and have the RAM, this is a legitimate inference setup. The Mac Mini M4 Pro at 64GB is the cheapest way into the 35B model. Apple Silicon is also now a viable voice inference platform: NVIDIA PersonaPlex 7B runs full-duplex speech-to-speech at 5.3 GB quantized, faster than real-time, via MLX – no Python, no server required.
Taalas and similar purpose-built inference hardware is showing 17k tokens/sec is achievable. That’s the direction the market is going. Consumer GPU hardware will start looking slow by comparison within 18 months. If you’re planning a major hardware investment, factor that in.
When to Still Use an API
Be honest with yourself. Local models are good. Frontier models are better at the frontier.
Use an API for:
- Multi-step reasoning tasks where you need Claude Opus 4.6 or Gemini 3.1 Pro level capability
- Tasks requiring very recent knowledge (local models are frozen at training cutoff)
- Image generation if you don’t want to self-host Flux or SDXL
- Voice (TTS/STT) – unless you’re on Apple Silicon, where PersonaPlex 7B now makes fully local full-duplex voice practical at 5.3 GB and faster than real-time
Use local for everything else – drafting, coding assistance, tool use in agents, classification, summarisation, embeddings.
Structure your agent so switching the underlying model is a config change, not a code change. Use the OpenAI-compatible API format everywhere so you can swap between Ollama locally and a frontier API without touching application logic.
The Agent Layer
Inference is the brain. The agent layer is what makes it useful.
Runtime Options
OpenClaw is a full-featured personal agent runtime. Multi-channel (Telegram, Discord, SMS), skill ecosystem, persistent memory, tool use, scheduling. It’s designed to run as a daemon – on a VPS, home server, or Pi – and be your persistent AI interface across whatever channels you use. The skill system means you can extend it without touching the core. If you want something that works out of the box and covers the full range of what a personal AI agent should do, this is the current best option.
NanoClaw is the minimal version – around 4,000 lines of readable code. It does the core loop: receive message, run inference, use tools, respond. No magic, no abstraction layers. If you want to understand exactly what your agent is doing, or you’re building something custom and want a starting point you can actually read in a sitting, NanoClaw is the right choice. It’s also the right choice for constrained hardware where you want predictable resource usage.
The tradeoff is real: OpenClaw gives you more features with less work; NanoClaw gives you more understanding with more work. Neither is wrong. I’d start with OpenClaw and drop to NanoClaw if you find yourself fighting its abstractions.
Where to Run It
Raspberry Pi 5 (8GB): Legitimate agent node. The Pi 5 with 8GB RAM runs OpenClaw or NanoClaw comfortably. It won’t run local inference worth anything (no GPU), but it can route to your inference server or a remote API. Use it as the always-on agent node – handles channels, scheduling, tool use – while your inference box sits elsewhere. At current prices you can put a Pi 5 in a small colocation rack for very little ongoing cost. The recent Pi stock surge reflects exactly this use case being discovered at scale.
Hetzner CAX11 (~3.29 EUR/month): ARM VPS, 2 vCPUs, 4GB RAM. Runs an agent runtime without breaking a sweat. No GPU, so same pattern as the Pi – route inference elsewhere. This is my recommendation for people who don’t want to think about home networking, port forwarding, or power outages. Hetzner’s reliability is good and the price is hard to argue with.
Home server: If you already have one, obvious choice. Put the GPU in it, run inference and the agent runtime on the same box, avoid network latency between them.
Colo: If you have access to colocation, this is the premium option. Your hardware, your control, better connectivity than home, no cloud dependency. The cost is only justifiable if you’re already paying for it or sharing it with others.
Memory and Persistence
Don’t overthink this.
SQLite for Most Things
SQLite handles more than people give it credit for. Agent conversation history, skill state, tool outputs, user preferences – all of this works fine in SQLite. It’s a single file, it’s fast for single-writer workloads, it supports WAL mode for concurrent reads, and it’s trivial to back up (rsync a file). OpenClaw uses SQLite by default. NanoClaw uses SQLite. There’s a reason for this.
Use SQLite until you have a concrete reason not to.
PostgreSQL When You Need It
Listmonk (self-hosted newsletter, covered below) requires Postgres. pgvector (vector search extension) runs on Postgres. If you need either of those, you’re running Postgres anyway. Run it in Docker, use a named volume for the data, done.
Don’t run Postgres just because it feels more “production.” SQLite is production. Write speed and concurrency at the scale a personal AI stack operates at are not your bottleneck.
Vector Storage
For semantic search and embeddings:
pgvector is the pragmatic choice if you’re already running Postgres. One extension, no additional service, integrates with anything that speaks SQL.
Chroma is the dedicated option – easier to get started with if you’re not already on Postgres, has a clean Python API, runs as a server or embedded. Docker image is small.
File-based for small scale: if you have fewer than 100k vectors, a flat file of numpy arrays with a cosine similarity search loop is fine. Really. Don’t add infrastructure until the simple thing stops working.
Backup Strategy
rsync to an offsite target on a cron job. That’s it.
# /etc/cron.d/ai-stack-backup
0 3 * * * root rsync -az --delete /opt/ai-stack/ user@backup-host:/backups/ai-stack/
Back up: SQLite files, Postgres dumps (pg_dump), your config files, your agent skill directories, your Hugo content directory. Exclude model weights – those are re-downloadable. If you need point-in-time recovery, enable SQLite WAL and snapshot the WAL file. For Postgres, pg_basebackup or WAL archiving if you’re serious about it. For a personal stack, daily dumps are probably fine.
Test your restores. Untested backups are not backups.
Publishing: The Blog Stack
This is the meta-layer – how you publish what you build and learn.
Hugo
Static site generator, markdown source, git-native. The build output is a folder of HTML and static assets. You can host it on any CDN, object storage, or basic web server. There’s no database to manage, no runtime to secure, no CMS to update.
The content workflow: write markdown in your editor, commit to git, push, deploy. With a post-receive hook or a simple CI pipeline (Woodpecker CI, Gitea Actions, even a bare shell script), the site rebuilds on push automatically.
Hugo’s build speed is its defining characteristic. A site with thousands of pages builds in under a second. This matters when you’re iterating on templates or running automated content pipelines.
Listmonk
Self-hosted newsletter. Single Go binary, Postgres backend, clean web UI. Handles subscriber management, campaigns, transactional emails, list segmentation. It’s what this newsletter runs on.
docker-compose.yml with a Postgres service and a Listmonk service. Map port 9000, put Nginx or Caddy in front, done. The Docker image is small, the resource usage is low, and it does everything a newsletter needs without subscription fees.
Email Delivery: SES
Amazon SES for SMTP relay. ~$0.10 per 1,000 emails. Verify your domain, set up DKIM and SPF, configure Listmonk to use SES as the SMTP backend. For a newsletter under 100k subscribers, this is the cheapest reliable option by a significant margin.
The only alternative worth considering is Mailgun or Postmark if you want better deliverability analytics out of the box. SES is cheapest; the others have better tooling.
DNS and TLS
Cloudflare for DNS. Free, fast, good API for automation, works with Certbot for DNS-01 challenges. Set your nameservers to Cloudflare, manage records there.
Certbot with Let’s Encrypt for TLS. Standard. Use the DNS-01 challenge via the Cloudflare plugin if you’re running services that aren’t publicly accessible on port 80:
certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
-d yourdomain.com \
-d *.yourdomain.com
Wildcard cert covers everything. Renews automatically. Set up a cron job or systemd timer for renewal and don’t think about it again.
Deployment
For Hugo, a bare git repository on your server with a post-receive hook:
#!/bin/bash
# /opt/site.git/hooks/post-receive
GIT_WORK_TREE=/var/www/site git checkout -f
cd /var/www/site && hugo --minify
Push to the remote, the hook fires, Hugo builds, the new site is live. No CI server required.
For Listmonk and other services: Docker Compose files in a git repo. git pull && docker-compose up -d is a deploy. Simple enough to do by hand, simple enough to automate.
Monitoring and Security
You don’t need a full observability stack. You need to know when things break.
What to Actually Monitor
Inference latency: If your model server starts returning responses slowly, you want to know before your agent times out. Expose the /metrics endpoint from Ollama or llama.cpp and scrape it with Prometheus. Or just write a cron job that hits the inference endpoint every 5 minutes and logs the response time. Whatever is proportionate to how much you care.
Memory and disk: Local inference eats RAM. Disk fills up with model weights, logs, and SQLite files. Set alerts at 85% for both. Grafana + Prometheus is one option; a simple shell script that sends you a Telegram message when disk hits a threshold is another. The latter takes 10 minutes to write.
Failed agent runs: Your agent should log errors. Parse the logs. Alert on repeated failures. If your agent is silently failing to complete tasks, you want to know.
Service health: Simple HTTP health checks. A cron job that curls your services every minute and sends an alert if it gets a non-200 response is enough for a personal stack.
Security Basics
UFW. Default deny inbound. Open only what you need:
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
If your inference server is on the same host as the agent runtime, bind it to localhost only. Don’t expose Ollama to the internet.
Fail2ban. Install it, configure it for SSH (the default config is fine), let it run. It will ban IPs that fail authentication repeatedly. Not a substitute for key-only auth, but an additional layer.
SSH keys only:
# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
No exceptions. Password auth over SSH is not acceptable in 2026.
Secrets management. This one catches people. Your AGENTS.md, your skill configs, your environment files – these should never contain API keys, database passwords, or anything sensitive. Use environment variables loaded from a file that is not in your git repository. Use .gitignore aggressively. Audit your repos periodically with git log -p | grep -i "api_key\|password\|secret". Better: use a .env file pattern with dotenv, or use a secrets manager if you’re running on a VPS that has one.
Keep model weights, SQLite databases, and any files containing personal data out of git entirely. They don’t belong there.
The Honest Tradeoffs
Self-hosting is the right call for a lot of people. It’s not the right call for everyone. Here’s what you’re actually signing up for.
Local inference is slower than frontier APIs for hard tasks. Qwen3.5-35B-A3B at 25-35 tokens/sec (on a single 4090) is slower than GPT-5-mini via API. For interactive use, that’s fine – you won’t notice the difference. For tasks where you need Claude Opus 4.6 level capability, your local model will underperform. That gap is closing – Taalas at 17k tokens/sec shows where hardware is going – but it exists today.
You own the ops burden. Services need updating. Disks fill up. Models get superseded and you need to pull new weights. A VPS goes down during a kernel update. Your GPU driver breaks after a system update. None of this is hard, but it requires attention. If you want zero ops, use APIs.
Updates require care. A new version of Ollama or OpenClaw might change an API or config format. Postgres major version upgrades require migration. Hugo themes drift. Staying current is a job, even if it’s a small one.
But. Costs collapse at scale. Your data stays on your hardware. No surprise account restrictions, no rate limits during a critical workflow, no terms-of-service change that makes your use case newly prohibited. The Google OAuth crackdown and OpenAI’s surveillance architecture are not edge cases – they’re predictable consequences of depending on services you don’t control. Every month you run your own stack, the counter-argument to “just use the API” gets weaker.
The configuration that makes sense for most competent engineers right now: a mid-range GPU (RTX 3090 or 4090) for local inference running Ollama with Qwen3.5-35B-A3B, a Pi 5 or cheap VPS running an agent runtime, SQLite for state, Cloudflare for DNS, Certbot for TLS, and a fallback API key for frontier tasks. The whole thing can be running in a weekend.
Start there. Tune it to your needs from a working baseline.
Sources
- Alibaba Qwen3.5-35B-A3B release notes and benchmark results (February 2026) – huggingface.co/Qwen
- ggml.ai acquisition by HuggingFace announcement (2025/2026) – huggingface.co
- Taalas inference hardware benchmarks – taalas.ai
- Raspberry Pi 5 (8GB) product page and stock availability – raspberrypi.com
- OpenClaw documentation – openclaw.ai
- Listmonk documentation – listmonk.app
- Amazon SES pricing – aws.amazon.com/ses/pricing
- Hetzner Cloud pricing – hetzner.com/cloud
- llama.cpp project – github.com/ggerganov/llama.cpp
- Ollama documentation – ollama.ai
- NVIDIA PersonaPlex 7B on Apple Silicon (February 2026) – blog.ivan.digital
- AMD Ryzen AI 400 Series AM5 announcement, MWC 2026
Published 2 March 2026. This post will be updated as the stack evolves. Versioned snapshots are linked in the changelog above.
Commissioned, Curated and Published by Russ. Researched and written with AI. You are reading the latest version of this post. View all snapshots.