Self-Hosting Your AI Stack: A Practical Guide

2 March 2026 - 18 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI. This is a versioned snapshot of this post, archived on 5 March 2026. View the living version

Disclaimer: This reflects my own setup and opinions as of early March 2026. Hardware prices, model capabilities, and software maturity change fast. Verify specs before you buy anything.

This is a living document. As the stack evolves, I’ll publish versioned snapshots and update this index.

What’s New This Week (4 March 2026)

Apple announced the MacBook Neo today – $599 starting price, powered by the A18 Pro chip with a 16-core Neural Engine. Apple’s benchmarks claim 3x faster on-device AI workloads compared to the bestselling Intel Core Ultra 5 competitor, and battery life comes in at 16 hours. It ships March 11. At $499 for education, this is Apple Silicon at a price point that previously did not exist – the most affordable Apple laptop ever, and the first that makes on-device AI a realistic proposition for people who wouldn’t previously have considered it.

For the self-hosting calculus this matters in a specific way. The Mac mini M4 Pro at 64GB remains the right recommendation for serious local inference – nothing in this announcement changes that. The MacBook Neo has a 16GB unified memory ceiling and runs an A18 Pro rather than an M-series chip, so it won’t run Qwen3.5-35B-A3B. But it runs 7B models well via Ollama’s Metal backend, and it does it in a fanless $599 laptop. The entry point for “I want to try running a model locally without buying dedicated hardware” just dropped significantly. That matters for adoption, and adoption matters for the ecosystem.

Unsloth published a full Qwen3.5 fine-tuning guide today, covering the entire model family from 0.8B to 122B-A10B. The numbers are worth noting: 50% less VRAM than standard FA2 setups, 1.5x faster training. Practical VRAM requirements come in at 3GB for the 0.8B, 5GB for the 2B, 10GB for the 4B, and 56GB for the 27B. Vision fine-tuning is supported. Export targets include GGUF (Ollama, llama.cpp, LM Studio) and vLLM – meaning you fine-tune once and deploy to whatever inference backend you’re already running. The local fine-tuning pipeline now has real documentation and accessible hardware requirements. Customising a model to your domain is no longer a research-grade operation.

Also worth flagging: Qwen3.5-35B-A3B’s 1 million token context window is proving out in practice. On coding benchmarks it outperforms GPT-5-mini, and the 32GB VRAM requirement puts it on a single consumer GPU. The combination of the Unsloth fine-tuning guide, the extended context, and the competitive benchmark performance makes this the model to watch for serious self-hosted workloads right now.

Changelog

Date	Version	Notes
2 Mar 2026	20260302	Initial publication
4 Mar 2026	20260304	Apple MacBook Neo ($599, A18 Pro, ships Mar 11); Unsloth Qwen3.5 fine-tuning guide (50% less VRAM, GGUF export); Qwen3.5-35B-A3B beats GPT-5-mini on coding, 1M token context

Why Self-Host in 2026

The arguments for self-hosting AI shifted significantly this year. It’s no longer just ideology – the economics and reliability case is now concrete.

Cost. Qwen3.5-35B-A3B runs on 32GB VRAM (a single consumer GPU, or two older cards via NVLink) and benchmarks ahead of GPT-5-mini on most coding and reasoning tasks. The model is free. The GPU, amortised over three years, costs less than two months of a mid-tier API plan if you’re running any serious workload. At scale the numbers are obvious. Even for a single developer, the crossover point hits faster than most people expect.

No rate limits. API rate limits are a tax on momentum. You hit a good flow, your agent starts doing real work, and then you’re throttled. Running local means the limit is your hardware – which you can reason about and upgrade.

Data sovereignty. When your agent runs against an API, every prompt, every document, every query you send is processed on infrastructure you don’t control. Most terms of service reserve the right to use that data for model improvement. “Opt-out” settings exist, but they depend on the vendor honoring them, and you have no audit trail.

Vendor lock-in is real. The Google OAuth crackdown in early 2026 demonstrated this clearly. Accounts restricted without warning, with appeals processes that take weeks and often go nowhere. Developers who had built auth flows, data pipelines, or automation on Google infrastructure had no recourse. If your AI stack depends on an API key from a company that can revoke your access unilaterally, you have a single point of failure you don’t control.

The surveillance architecture problem. The OpenAI Persona system – and similar approaches from other frontier labs – is designed to build persistent profiles across users and sessions. This is not speculation. It’s a business model. If you’re running an agent that handles personal data, business logic, or anything sensitive, routing it through a commercial API means feeding that data into a surveillance architecture optimised for product telemetry. Self-hosting removes that layer entirely.

None of this means you never use an API. There are still tasks where frontier models are genuinely better. But the default should be local, with APIs reserved for specific cases.

The Inference Layer

This is the foundation. Get it wrong and nothing else matters.

Model Selection: The Current Sweet Spot

Qwen3.5-35B-A3B is the model I’d recommend for most self-hosters right now. Released by Alibaba in February 2026, it’s a 35B parameter Mixture-of-Experts model with a 1 million token context window. It fits in 32GB VRAM quantised to Q4_K_M, which puts it within reach of a single RTX 3090, RTX 4090, or equivalent AMD card. On standard benchmarks it beats GPT-5-mini across coding, reasoning, and instruction following. This is the crossover point people were waiting for – a model that’s genuinely competitive with commercial offerings, running entirely on your own hardware.

For lighter tasks, smaller models (Qwen3.5-7B, Phi-4, Gemma-3-12B) are fast and cheap to run. They’re good for agent tool-use, classification, summarisation, and anything that doesn’t need deep reasoning. Keep a small model running as your default and only route hard tasks to the 35B.

Inference Servers

Ollama is the easiest entry point. Single binary, pulls models from its registry, exposes an OpenAI-compatible API on localhost:11434. For most people this is the right choice. The overhead is minimal, the model management is clean, and it handles quantisation automatically. If you already know Docker, there’s an official image. ollama run qwen3.5:35b-a3b and you’re running.

The limitations: Ollama doesn’t give you much control over quantisation format, batching strategy, or GPU memory allocation. For a single-user setup it doesn’t matter. For anything with concurrent requests, you’ll want more control.

llama.cpp (now under the HuggingFace umbrella after ggml.ai’s acquisition) is the lower-level option. You compile it yourself, choose your quantisation explicitly, and get full control over inference parameters. The llama-server binary exposes an OpenAI-compatible HTTP API. It’s more work to set up but it’s the right choice if you’re running multiple models, need precise VRAM management, or want to squeeze performance.

Worth knowing: ggml.ai joining HuggingFace is a consolidation signal. The local inference toolchain is maturing. Expect llama.cpp to become more integrated with HuggingFace Hub over the next year.

vLLM is the production option – continuous batching, PagedAttention, proper throughput for concurrent users. Overkill for personal use, but if you’re running inference for a team or building something others will hit, look at it.

Hardware

The GPU question is the most common one. Here’s the honest breakdown:

RTX 4090 (24GB): Fits most 7-13B models comfortably at high quality. Qwen3.5-35B-A3B needs two of these or aggressive quantisation. Fast. Expensive (~$1600-1800 used).
RTX 3090 (24GB): Same VRAM as 4090, meaningfully slower, but much cheaper used (~$600-800). Excellent value if you’re patient.
Two RTX 3090s or 4090s: 48GB combined via NVLink or tensor parallelism in llama.cpp. Gets you the 35B model at good quality. This is the sweet spot for power users right now.
AMD RX 7900 XTX (24GB): ROCm support is better than it was. Linux only really. Cheaper than NVIDIA equivalents. Worth considering if you’re comfortable with the driver situation.
CPU fallback: llama.cpp runs on CPU. A modern server CPU (Threadripper, EPYC) with fast RAM can do 3-8 tokens/sec on a 7B model. Usable for non-interactive tasks. Not usable for conversation. Don’t buy CPU hardware for inference – it’s just there if you already have it.
Apple Silicon: Surprisingly good. The M-series unified memory architecture means a MacBook Pro M4 Max (128GB) can run Qwen3.5-35B-A3B well. Ollama on macOS uses the Metal backend and achieves decent throughput. If you’re already on Apple Silicon and have the RAM, this is a legitimate inference setup. The Mac Mini M4 Pro at 64GB is the cheapest way into the 35B model.

Taalas and similar purpose-built inference hardware is showing 17k tokens/sec is achievable. That’s the direction the market is going. Consumer GPU hardware will start looking slow by comparison within 18 months. If you’re planning a major hardware investment, factor that in.

When to Still Use an API

Be honest with yourself. Local models are good. Frontier models are better at the frontier.

Use an API for:

Multi-step reasoning tasks where you need Claude Opus 4.6 or Gemini 3.1 Pro level capability
Tasks requiring very recent knowledge (local models are frozen at training cutoff)
Image generation if you don’t want to self-host Flux or SDXL
Voice (TTS/STT) unless you want to run Whisper + a TTS model yourself

Use local for everything else – drafting, coding assistance, tool use in agents, classification, summarisation, embeddings.

Structure your agent so switching the underlying model is a config change, not a code change. Use the OpenAI-compatible API format everywhere so you can swap between Ollama locally and a frontier API without touching application logic.

The Agent Layer

Inference is the brain. The agent layer is what makes it useful.

Runtime Options

OpenClaw is a full-featured personal agent runtime. Multi-channel (Telegram, Discord, SMS), skill ecosystem, persistent memory, tool use, scheduling. It’s designed to run as a daemon – on a VPS, home server, or Pi – and be your persistent AI interface across whatever channels you use. The skill system means you can extend it without touching the core. If you want something that works out of the box and covers the full range of what a personal AI agent should do, this is the current best option.

NanoClaw is the minimal version – around 4,000 lines of readable code. It does the core loop: receive message, run inference, use tools, respond. No magic, no abstraction layers. If you want to understand exactly what your agent is doing, or you’re building something custom and want a starting point you can actually read in a sitting, NanoClaw is the right choice. It’s also the right choice for constrained hardware where you want predictable resource usage.

The tradeoff is real: OpenClaw gives you more features with less work; NanoClaw gives you more understanding with more work. Neither is wrong. I’d start with OpenClaw and drop to NanoClaw if you find yourself fighting its abstractions.

Where to Run It

Raspberry Pi 5 (8GB): Legitimate agent node. The Pi 5 with 8GB RAM runs OpenClaw or NanoClaw comfortably. It won’t run local inference worth anything (no GPU), but it can route to your inference server or a remote API. Use it as the always-on agent node – handles channels, scheduling, tool use – while your inference box sits elsewhere. At current prices you can put a Pi 5 in a small colocation rack for very little ongoing cost. The recent Pi stock surge reflects exactly this use case being discovered at scale.

Hetzner CAX11 (~3.29 EUR/month): ARM VPS, 2 vCPUs, 4GB RAM. Runs an agent runtime without breaking a sweat. No GPU, so same pattern as the Pi – route inference elsewhere. This is my recommendation for people who don’t want to think about home networking, port forwarding, or power outages. Hetzner’s reliability is good and the price is hard to argue with.

Home server: If you already have one, obvious choice. Put the GPU in it, run inference and the agent runtime on the same box, avoid network latency between them.

Colo: If you have access to colocation, this is the premium option. Your hardware, your control, better connectivity than home, no cloud dependency. The cost is only justifiable if you’re already paying for it or sharing it with others.

Memory and Persistence

Don’t overthink this.

SQLite for Most Things

SQLite handles more than people give it credit for. Agent conversation history, skill state, tool outputs, user preferences – all of this works fine in SQLite. It’s a single file, it’s fast for single-writer workloads, it supports WAL mode for concurrent reads, and it’s trivial to back up (rsync a file). OpenClaw uses SQLite by default. NanoClaw uses SQLite. There’s a reason for this.

Use SQLite until you have a concrete reason not to.

PostgreSQL When You Need It

Listmonk (self-hosted newsletter, covered below) requires Postgres. pgvector (vector search extension) runs on Postgres. If you need either of those, you’re running Postgres anyway. Run it in Docker, use a named volume for the data, done.

Don’t run Postgres just because it feels more “production.” SQLite is production. Write speed and concurrency at the scale a personal AI stack operates at are not your bottleneck.

Vector Storage

For semantic search and embeddings:

pgvector is the pragmatic choice if you’re already running Postgres. One extension, no additional service, integrates with anything that speaks SQL.

Chroma is the dedicated option – easier to get started with if you’re not already on Postgres, has a clean Python API, runs as a server or embedded. Docker image is small.

File-based for small scale: if you have fewer than 100k vectors, a flat file of numpy arrays with a cosine similarity search loop is fine. Really. Don’t add infrastructure until the simple thing stops working.

Backup Strategy

rsync to an offsite target on a cron job. That’s it.

# /etc/cron.d/ai-stack-backup
0 3 * * * root rsync -az --delete /opt/ai-stack/ user@backup-host:/backups/ai-stack/

Back up: SQLite files, Postgres dumps (pg_dump), your config files, your agent skill directories, your Hugo content directory. Exclude model weights – those are re-downloadable. If you need point-in-time recovery, enable SQLite WAL and snapshot the WAL file. For Postgres, pg_basebackup or WAL archiving if you’re serious about it. For a personal stack, daily dumps are probably fine.

Test your restores. Untested backups are not backups.

Publishing: The Blog Stack

This is the meta-layer – how you publish what you build and learn.

Hugo

Static site generator, markdown source, git-native. The build output is a folder of HTML and static assets. You can host it on any CDN, object storage, or basic web server. There’s no database to manage, no runtime to secure, no CMS to update.

The content workflow: write markdown in your editor, commit to git, push, deploy. With a post-receive hook or a simple CI pipeline (Woodpecker CI, Gitea Actions, even a bare shell script), the site rebuilds on push automatically.

Hugo’s build speed is its defining characteristic. A site with thousands of pages builds in under a second. This matters when you’re iterating on templates or running automated content pipelines.

Listmonk

Self-hosted newsletter. Single Go binary, Postgres backend, clean web UI. Handles subscriber management, campaigns, transactional emails, list segmentation. It’s what this newsletter runs on.

docker-compose.yml with a Postgres service and a Listmonk service. Map port 9000, put Nginx or Caddy in front, done. The Docker image is small, the resource usage is low, and it does everything a newsletter needs without subscription fees.

Email Delivery: SES

Amazon SES for SMTP relay. ~$0.10 per 1,000 emails. Verify your domain, set up DKIM and SPF, configure Listmonk to use SES as the SMTP backend. For a newsletter under 100k subscribers, this is the cheapest reliable option by a significant margin.

The only alternative worth considering is Mailgun or Postmark if you want better deliverability analytics out of the box. SES is cheapest; the others have better tooling.

DNS and TLS

Cloudflare for DNS. Free, fast, good API for automation, works with Certbot for DNS-01 challenges. Set your nameservers to Cloudflare, manage records there.

Certbot with Let’s Encrypt for TLS. Standard. Use the DNS-01 challenge via the Cloudflare plugin if you’re running services that aren’t publicly accessible on port 80:

certbot certonly \
  --dns-cloudflare \
  --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
  -d yourdomain.com \
  -d *.yourdomain.com

Wildcard cert covers everything. Renews automatically. Set up a cron job or systemd timer for renewal and don’t think about it again.

Deployment

For Hugo, a bare git repository on your server with a post-receive hook:

#!/bin/bash
# /opt/site.git/hooks/post-receive
GIT_WORK_TREE=/var/www/site git checkout -f
cd /var/www/site && hugo --minify

Push to the remote, the hook fires, Hugo builds, the new site is live. No CI server required.

For Listmonk and other services: Docker Compose files in a git repo. git pull && docker-compose up -d is a deploy. Simple enough to do by hand, simple enough to automate.

Monitoring and Security

You don’t need a full observability stack. You need to know when things break.

What to Actually Monitor

Inference latency: If your model server starts returning responses slowly, you want to know before your agent times out. Expose the /metrics endpoint from Ollama or llama.cpp and scrape it with Prometheus. Or just write a cron job that hits the inference endpoint every 5 minutes and logs the response time. Whatever is proportionate to how much you care.

Memory and disk: Local inference eats RAM. Disk fills up with model weights, logs, and SQLite files. Set alerts at 85% for both. Grafana + Prometheus is one option; a simple shell script that sends you a Telegram message when disk hits a threshold is another. The latter takes 10 minutes to write.

Failed agent runs: Your agent should log errors. Parse the logs. Alert on repeated failures. If your agent is silently failing to complete tasks, you want to know.

Service health: Simple HTTP health checks. A cron job that curls your services every minute and sends an alert if it gets a non-200 response is enough for a personal stack.

Security Basics

UFW. Default deny inbound. Open only what you need:

ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable

If your inference server is on the same host as the agent runtime, bind it to localhost only. Don’t expose Ollama to the internet.

Fail2ban. Install it, configure it for SSH (the default config is fine), let it run. It will ban IPs that fail authentication repeatedly. Not a substitute for key-only auth, but an additional layer.

SSH keys only:

# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no

No exceptions. Password auth over SSH is not acceptable in 2026.

Secrets management. This one catches people. Your AGENTS.md, your skill configs, your environment files – these should never contain API keys, database passwords, or anything sensitive. Use environment variables loaded from a file that is not in your git repository. Use .gitignore aggressively. Audit your repos periodically with git log -p | grep -i "api_key\|password\|secret". Better: use a .env file pattern with dotenv, or use a secrets manager if you’re running on a VPS that has one.

Keep model weights, SQLite databases, and any files containing personal data out of git entirely. They don’t belong there.

The Honest Tradeoffs

Self-hosting is the right call for a lot of people. It’s not the right call for everyone. Here’s what you’re actually signing up for.

Local inference is slower than frontier APIs for hard tasks. Qwen3.5-35B-A3B at 25-35 tokens/sec (on a single 4090) is slower than GPT-5-mini via API. For interactive use, that’s fine – you won’t notice the difference. For tasks where you need Claude Opus 4.6 level capability, your local model will underperform. That gap is closing – Taalas at 17k tokens/sec shows where hardware is going – but it exists today.

You own the ops burden. Services need updating. Disks fill up. Models get superseded and you need to pull new weights. A VPS goes down during a kernel update. Your GPU driver breaks after a system update. None of this is hard, but it requires attention. If you want zero ops, use APIs.

Updates require care. A new version of Ollama or OpenClaw might change an API or config format. Postgres major version upgrades require migration. Hugo themes drift. Staying current is a job, even if it’s a small one.

But. Costs collapse at scale. Your data stays on your hardware. No surprise account restrictions, no rate limits during a critical workflow, no terms-of-service change that makes your use case newly prohibited. The Google OAuth crackdown and OpenAI’s surveillance architecture are not edge cases – they’re predictable consequences of depending on services you don’t control. Every month you run your own stack, the counter-argument to “just use the API” gets weaker.

The configuration that makes sense for most competent engineers right now: a mid-range GPU (RTX 3090 or 4090) for local inference running Ollama with Qwen3.5-35B-A3B, a Pi 5 or cheap VPS running an agent runtime, SQLite for state, Cloudflare for DNS, Certbot for TLS, and a fallback API key for frontier tasks. The whole thing can be running in a weekend.

Start there. Tune it to your needs from a working baseline.

Sources

Alibaba Qwen3.5-35B-A3B release notes and benchmark results (February 2026) – huggingface.co/Qwen
ggml.ai acquisition by HuggingFace announcement (2025/2026) – huggingface.co
Taalas inference hardware benchmarks – taalas.ai
Raspberry Pi 5 (8GB) product page and stock availability – raspberrypi.com
OpenClaw documentation – openclaw.ai
Listmonk documentation – listmonk.app
Amazon SES pricing – aws.amazon.com/ses/pricing
Hetzner Cloud pricing – hetzner.com/cloud
llama.cpp project – github.com/ggerganov/llama.cpp
Ollama documentation – ollama.ai

Published 2 March 2026. This post will be updated as the stack evolves. Versioned snapshots are linked in the changelog above.

Commissioned, Curated and Published by Russ. Researched and written with AI. This is a versioned snapshot of this post, archived on 5 March 2026. View the living version