Self-Hosting Your AI Stack: A Practical Guide
Commissioned, Curated and Published by Russ. Researched and written with AI. This is a versioned snapshot of the post as it stood on 24 March 2026. View the latest version.
Disclaimer: This reflects my own setup and opinions as of 24 March 2026. Hardware prices, model capabilities, and software maturity change fast. Verify specs before you buy anything.
This is a versioned snapshot. The living version is at /ai-self-hosting/.
What’s New This Week (24 March 2026)
LiteLLM versions 1.82.7 and 1.82.8, published to PyPI today (24 March 2026), were confirmed compromised in a supply chain attack attributed to threat actor TeamPCP, likely via a poisoned Trivy CI/CD workflow. The malicious packages contain a credential-stealing .pth file (litellm_init.pth) that harvests SSH keys, cloud credentials, Kubernetes configs, crypto wallets, and API keys, encrypts them, and exfiltrates them to an attacker-controlled domain. LiteLLM is one of the most widely used Python libraries for routing API calls across LLM providers, used extensively in self-hosted AI stacks as an OpenAI-compatible proxy layer. Coverage: The Hacker News (182 HN points), The Hacker News (thehackernews.com), XDA Developers, futuresearch.ai; confirmed by Endor Labs and JFrog. Users on the proxy Docker image were not impacted as that build pins requirements separately. A second story reinforces the Apple Silicon inference thread: Hypura (Show HN, 148 points) is a new storage-tier-aware LLM inference scheduler for Apple Silicon that profiles hardware and places model tensors across GPU, RAM, and NVMe tiers automatically, enabling models too large for physical RAM to run without crashing; it achieves 2.2 tok/s on Mixtral 8x7B (31GB) and 0.3 tok/s on Llama 70B (40GB) on a 32GB M1 Max that would otherwise OOM kill under naive mmap.
Changelog
| Date | Summary |
|---|---|
| 24 Mar 2026 | 24 Mar 2026 – LiteLLM 1.82.7/1.82.8 on PyPI compromised with credential-stealing payload (182 HN points); supply-chain risk hits AI middleware directly; Hypura adds storage-tier-aware Apple Silicon inference scheduler for models exceeding RAM. |
| 23 Mar 2026 | 23 Mar 2026 – iPhone 17 Pro demonstrated running 400B MoE via SSD streaming at 0.6 tok/s on A18 Pro, extending the Flash-MoE SSD streaming story from MacBook Pro to mobile Apple Silicon. |
| 22 Mar 2026 | 22 Mar 2026 – Flash-MoE runs Qwen3.5-397B on a MacBook Pro M3 Max 48GB via SSD streaming at 4.4 tok/s; Apple Silicon inference ceiling updated. |
| 21 Mar 2026 | 21 Mar 2026 – Quiet day, thesis holds. |
| 20 Mar 2026 | 20 Mar 2026 – Supermicro co-founder charged in $2.5B smuggling plot adds supply-chain risk signal; MacBook M5 Pro running Qwen3.5 extends the Apple Silicon inference case. |
| 19 Mar 2026 | 19 Mar 2026 – KittenTTS (<25MB) extends local voice inference beyond Apple Silicon to any hardware. |
| 18 Mar 2026 | 18 Mar 2026 – Quiet day, thesis holds. |
| 17 Mar 2026 | 17 Mar 2026 – Quiet day, thesis holds. |
| 16 Mar 2026 | 16 Mar 2026 – Quiet day, thesis holds. |
| 15 Mar 2026 | 15 Mar 2026 – Quiet day, thesis holds. |
| 14 Mar 2026 | 14 Mar 2026 – Quiet day, thesis holds. |
| 13 Mar 2026 | 13 Mar 2026 – Qatar helium shutdown adds near-term GPU supply risk to the hardware calculus; CanIRun.ai (463 HN points) is a useful model-compatibility checker for self-hosters. |
| 12 Mar 2026 | 12 Mar 2026 – RTX 5090 (32GB VRAM) adds a single-card path to Qwen3.5-35B-A3B; hardware section updated. |
| 11 Mar 2026 | 11 Mar 2026 – Microsoft BitNet shows 100B 1-bit models running at readable speed on a single CPU, adding nuance to the CPU inference section. |
| 10 Mar 2026 | 10 Mar 2026 – Quiet day, thesis holds. |
| 9 Mar 2026 | 9 Mar 2026 – US Appeals Court rules TOS updates by email bind users via implied consent, strengthening the vendor lock-in argument for self-hosting. |
| 8 Mar 2026 | Quiet day, thesis holds. |
| 7 Mar 2026 | Quiet day, thesis holds. |
| 6 Mar 2026 | A quieter day – nothing today that shifts the thesis. |
| 5 Mar 2026 | PersonaPlex 7B enables local full-duplex voice on Apple Silicon; AMD Ryzen AI 400 desktop (50 TOPS NPU, AM5). |
| 4 Mar 2026 | Apple MacBook Neo ($599, A18 Pro, ships Mar 11). |
| 2 Mar 2026 | Initial publication |
Why Self-Host in 2026
The arguments for self-hosting AI shifted significantly this year. It’s no longer just ideology – the economics and reliability case is now concrete.
Cost. Qwen3.5-35B-A3B runs on 32GB VRAM (a single consumer GPU, or two older cards via NVLink) and benchmarks ahead of GPT-5-mini on most coding and reasoning tasks. The model is free. The GPU, amortised over three years, costs less than two months of a mid-tier API plan if you’re running any serious workload. At scale the numbers are obvious. Even for a single developer, the crossover point hits faster than most people expect.
No rate limits. API rate limits are a tax on momentum. You hit a good flow, your agent starts doing real work, and then you’re throttled. Running local means the limit is your hardware – which you can reason about and upgrade.
Data sovereignty. When your agent runs against an API, every prompt, every document, every query you send is processed on infrastructure you don’t control. Most terms of service reserve the right to use that data for model improvement. “Opt-out” settings exist, but they depend on the vendor honoring them, and you have no audit trail.
Vendor lock-in is real. The Google OAuth crackdown in early 2026 demonstrated this clearly. Accounts restricted without warning, with appeals processes that take weeks and often go nowhere. Developers who had built auth flows, data pipelines, or automation on Google infrastructure had no recourse. If your AI stack depends on an API key from a company that can revoke your access unilaterally, you have a single point of failure you don’t control.
A March 2026 US Court of Appeals ruling (9th Circuit) adds a further dimension: the court found that Terms of Service may be updated by email and that continued use of a service implies consent to those changes. For developers running agent workloads against cloud AI APIs, this means the terms governing your data, your usage rights, and your recourse options can shift without any explicit agreement on your part. Self-hosting removes this exposure entirely.
The surveillance architecture problem. The OpenAI Persona system – and similar approaches from other frontier labs – is designed to build persistent profiles across users and sessions. This is not speculation. It’s a business model. If you’re running an agent that handles personal data, business logic, or anything sensitive, routing it through a commercial API means feeding that data into a surveillance architecture optimised for product telemetry. Self-hosting removes that layer entirely.
None of this means you never use an API. There are still tasks where frontier models are genuinely better. But the default should be local, with APIs reserved for specific cases.
The Inference Layer
This is the foundation. Get it wrong and nothing else matters.
Model Selection: The Current Sweet Spot
Qwen3.5-35B-A3B is the model I’d recommend for most self-hosters right now. Released by Alibaba in February 2026, it’s a 35B parameter Mixture-of-Experts model with a 1 million token context window. It fits in 32GB VRAM quantised to Q4_K_M, which puts it within reach of a single RTX 3090, RTX 4090, or equivalent AMD card. On standard benchmarks it beats GPT-5-mini across coding, reasoning, and instruction following. This is the crossover point people were waiting for – a model that’s genuinely competitive with commercial offerings, running entirely on your own hardware.
One caveat worth flagging: the core Qwen team has fractured. Junyang Lin – the tech lead who built Qwen – stepped down this week. Two other key colleagues departed in the same window. Alibaba has formed a task force to continue development. The models exist, perform well, and the hardware requirements are unchanged. But for engineers making long-term infrastructure bets on Qwen, watching the task force’s first major release will be informative. The Mistral, Gemma, and Phi families deserve a place in your local evaluation pipeline as a hedge.
For lighter tasks, smaller models (Qwen3.5-7B, Phi-4, Gemma-3-12B) are fast and cheap to run. They’re good for agent tool-use, classification, summarisation, and anything that doesn’t need deep reasoning. Keep a small model running as your default and only route hard tasks to the 35B.
Inference Servers
Ollama is the easiest entry point. Single binary, pulls models from its registry, exposes an OpenAI-compatible API on localhost:11434. For most people this is the right choice. The overhead is minimal, the model management is clean, and it handles quantisation automatically. If you already know Docker, there’s an official image. ollama run qwen3.5:35b-a3b and you’re running.
The limitations: Ollama doesn’t give you much control over quantisation format, batching strategy, or GPU memory allocation. For a single-user setup it doesn’t matter. For anything with concurrent requests, you’ll want more control.
llama.cpp (now under the HuggingFace umbrella after ggml.ai’s acquisition) is the lower-level option. You compile it yourself, choose your quantisation explicitly, and get full control over inference parameters. The llama-server binary exposes an OpenAI-compatible HTTP API. It’s more work to set up but it’s the right choice if you’re running multiple models, need precise VRAM management, or want to squeeze performance.
Worth knowing: ggml.ai joining HuggingFace is a consolidation signal. The local inference toolchain is maturing. Expect llama.cpp to become more integrated with HuggingFace Hub over the next year.
vLLM is the production option – continuous batching, PagedAttention, proper throughput for concurrent users. Overkill for personal use, but if you’re running inference for a team or building something others will hit, look at it.
Hardware
The GPU question is the most common one. Here’s the honest breakdown:
RTX 5090 (32GB): The first consumer single GPU with enough VRAM to run Qwen3.5-35B-A3B at Q4_K_M without NVLink. Faster than the 4090 and now the cleanest single-card path to the 35B model. Pricing ~$1999-2200 new. If you are buying new hardware and want one card, this is the current answer.
RTX 4090 (24GB): Fits most 7-13B models comfortably at high quality. Qwen3.5-35B-A3B needs two of these or aggressive quantisation. Fast. Expensive (~$1600-1800 used).
RTX 3090 (24GB): Same VRAM as 4090, meaningfully slower, but much cheaper used (~$600-800). Excellent value if you’re patient.
Two RTX 3090s or 4090s: 48GB combined via NVLink or tensor parallelism in llama.cpp. Gets you the 35B model at good quality. This is the sweet spot for power users right now.
AMD RX 7900 XTX (24GB): ROCm support is better than it was. Linux only really. Cheaper than NVIDIA equivalents. Worth considering if you’re comfortable with the driver situation.
CPU fallback: llama.cpp runs on CPU. A modern server CPU (Threadripper, EPYC) with fast RAM can do 3-8 tokens/sec on a 7B model. Usable for non-interactive tasks. Not usable for conversation. Don’t buy CPU hardware for inference – it’s just there if you already have it.
One caveat worth tracking: Microsoft’s BitNet b1.58 framework (on HN today, 250+ points) demonstrates a 100B parameter 1-bit model running at 5-7 tokens/sec on a single CPU – borderline conversational speed. The energy savings are significant too: 55-70% lower consumption on ARM, up to 82% on x86. The limitation is that BitNet models must be trained natively in 1-bit; you cannot quantize a standard model down to this format. None of the current recommended models (Qwen3.5, Phi-4, Gemma-3) are BitNet-native. But this is the direction CPU inference is heading, and the ’not usable for conversation’ ceiling may shift materially within the next model generation.
One supply-side risk to watch as of mid-March 2026: Qatar has shut down helium production, removing approximately 30% of global supply. Helium is used in semiconductor fabrication cooling and lithography – a sustained shortage could tighten GPU supply and push prices up from the ranges quoted above. Chipmakers including SK Hynix are already responding. If you are planning a GPU purchase, the next 4-6 weeks carry more price uncertainty than usual. Source: Tom’s Hardware, March 2026.
A second supply-chain signal arrived the same week: Supermicro, one of the primary vendors of GPU server hardware used for AI inference deployments, saw its stock drop 25% after its co-founder was charged in a $2.5 billion chip smuggling plot. The direct hardware impact is unclear at time of writing, but Supermicro is a major part of the server GPU supply chain. Combined with the Qatar helium situation, anyone pricing up a serious self-hosted inference build in late March 2026 is navigating unusual supply-side uncertainty. If your purchase can wait 4-6 weeks, it may be worth doing so.
Practical tool: CanIRun.ai lets you check which models your hardware can actually run, covering VRAM requirements and quantisation options for most major open-weight models. Worth bookmarking before you buy.
Apple Silicon: Surprisingly good. The M-series unified memory architecture means a MacBook Pro M4 Max (128GB) can run Qwen3.5-35B-A3B well. Ollama on macOS uses the Metal backend and achieves decent throughput. If you’re already on Apple Silicon and have the RAM, this is a legitimate inference setup. The Mac Mini M4 Pro at 64GB is the cheapest way into the 35B model. Apple Silicon is also now a viable voice inference platform: NVIDIA PersonaPlex 7B runs full-duplex speech-to-speech at 5.3 GB quantized, faster than real-time, via MLX – no Python, no server required.
The M5 Pro generation extends this further: a MacBook M5 Pro running Qwen3.5 as a local AI security system was demonstrated on HN this week (113 points), with reported performance meaningfully ahead of M4. If you are evaluating Apple Silicon as an inference platform, M5 Pro is now the reference data point to benchmark against.
A new proof-of-concept pushes this further still: Flash-MoE (Show HN, 236 points, 22 March 2026) is a pure C and Metal inference engine that streams Qwen3.5-397B-A17B from SSD on a MacBook Pro M3 Max with 48GB unified RAM, achieving 4.4 tokens/sec at 4-bit quantisation with full tool-calling support. Only the 4 active experts per transformer layer (~6.75MB each) are loaded per token; the OS page cache handles the rest of the 209GB model naturally. This is not a toy demo – it produces production-quality output including reliable JSON and tool use. The practical implication for self-hosters: Apple Silicon’s ceiling is no longer VRAM-limited in the traditional sense. With fast NVMe and the unified memory architecture, the 397B model is now accessible on 48GB hardware that costs less than a mid-range NVIDIA GPU build.
The same SSD streaming technique was demonstrated the following day on an iPhone 17 Pro (A18 Pro chip, 24GB unified memory) running the same 400B MoE model at 0.6 tokens per second. Coverage: wccftech.com, gizmochina.com; 325 HN points. At 0.6 tok/s it is not practical for interactive use, but it confirms the technique generalises across Apple Silicon form factors beyond the MacBook Pro and further supports the point that the inference ceiling is no longer strictly VRAM-limited on Apple hardware.
A new tool worth knowing: Hypura (Show HN, 148 HN points, 24 March 2026) is a storage-tier-aware LLM inference scheduler for Apple Silicon that profiles your hardware and places model tensors across GPU, RAM, and NVMe tiers based on access patterns and bandwidth costs. It makes models too large for physical RAM runnable without the OS swap-thrashing that kills vanilla llama.cpp: a 31GB Mixtral 8x7B runs at 2.2 tok/s on a 32GB M1 Max; a 40GB Llama 70B runs at 0.3 tok/s on the same hardware. The approach complements Flash-MoE’s SSD expert streaming but generalises beyond MoE architectures to dense models. It selects the optimal inference mode automatically. For self-hosters on constrained Apple Silicon, this expands the practical model ceiling further.
Taalas and similar purpose-built inference hardware is showing 17k tokens/sec is achievable. That’s the direction the market is going. Consumer GPU hardware will start looking slow by comparison within 18 months. If you’re planning a major hardware investment, factor that in.
When to Still Use an API
Be honest with yourself. Local models are good. Frontier models are better at the frontier.
Use an API for:
- Multi-step reasoning tasks where you need Claude Opus 4.6 or Gemini 3.1 Pro level capability
- Tasks requiring very recent knowledge (local models are frozen at training cutoff)
- Image generation if you don’t want to self-host Flux or SDXL
- Voice (TTS/STT) – unless you’re on Apple Silicon, where PersonaPlex 7B makes fully local full-duplex voice practical at 5.3 GB and faster than real-time. A lighter option is now viable on any hardware: KittenTTS (March 2026) ships three models with the smallest under 25MB, making local TTS practical on Raspberry Pi, low-end VPS, and budget laptops with no GPU required.
Use local for everything else – drafting, coding assistance, tool use in agents, classification, summarisation, embeddings.
Structure your agent so switching the underlying model is a config change, not a code change. Use the OpenAI-compatible API format everywhere so you can swap between Ollama locally and a frontier API without touching application logic.
The Agent Layer
Inference is the brain. The agent layer is what makes it useful.
Runtime Options
OpenClaw is a full-featured personal agent runtime. Multi-channel (Telegram, Discord, SMS), skill ecosystem, persistent memory, tool use, scheduling. It’s designed to run as a daemon – on a VPS, home server, or Pi – and be your persistent AI interface across whatever channels you use. The skill system means you can extend it without touching the core. If you want something that works out of the box and covers the full range of what a personal AI agent should do, this is the current best option.
NanoClaw is the minimal version – around 4,000 lines of readable code. It does the core loop: receive message, run inference, use tools, respond. No magic, no abstraction layers. If you want to understand exactly what your agent is doing, or you’re building something custom and want a starting point you can actually read in a sitting, NanoClaw is the right choice. It’s also the right choice for constrained hardware where you want predictable resource usage.
The tradeoff is real: OpenClaw gives you more features with less work; NanoClaw gives you more understanding with more work. Neither is wrong. I’d start with OpenClaw and drop to NanoClaw if you find yourself fighting its abstractions.
Where to Run It
Raspberry Pi 5 (8GB): Legitimate agent node. The Pi 5 with 8GB RAM runs OpenClaw or NanoClaw comfortably. It won’t run local inference worth anything (no GPU), but it can route to your inference server or a remote API. Use it as the always-on agent node – handles channels, scheduling, tool use – while your inference box sits elsewhere. At current prices you can put a Pi 5 in a small colocation rack for very little ongoing cost. The recent Pi stock surge reflects exactly this use case being discovered at scale.
Hetzner CAX11 (~3.29 EUR/month): ARM VPS, 2 vCPUs, 4GB RAM. Runs an agent runtime without breaking a sweat. No GPU, so same pattern as the Pi – route inference elsewhere. This is my recommendation for people who don’t want to think about home networking, port forwarding, or power outages. Hetzner’s reliability is good and the price is hard to argue with.
Home server: If you already have one, obvious choice. Put the GPU in it, run inference and the agent runtime on the same box, avoid network latency between them.
Colo: If you have access to colocation, this is the premium option. Your hardware, your control, better connectivity than home, no cloud dependency. The cost is only justifiable if you’re already paying for it or sharing it with others.
Memory and Persistence
Don’t overthink this.
SQLite for Most Things
SQLite handles more than people give it credit for. Agent conversation history, skill state, tool outputs, user preferences – all of this works fine in SQLite. It’s a single file, it’s fast for single-writer workloads, it supports WAL mode for concurrent reads, and it’s trivial to back up (rsync a file). OpenClaw uses SQLite by default. NanoClaw uses SQLite. There’s a reason for this.
Use SQLite until you have a concrete reason not to.
PostgreSQL When You Need It
Listmonk (self-hosted newsletter, covered below) requires Postgres. pgvector (vector search extension) runs on Postgres. If you need either of those, you’re running Postgres anyway. Run it in Docker, use a named volume for the data, done.
Don’t run Postgres just because it feels more “production.” SQLite is production. Write speed and concurrency at the scale a personal AI stack operates at are not your bottleneck.
Vector Storage
For semantic search and embeddings:
pgvector is the pragmatic choice if you’re already running Postgres. One extension, no additional service, integrates with anything that speaks SQL.
Chroma is the dedicated option – easier to get started with if you’re not already on Postgres, has a clean Python API, runs as a server or embedded. Docker image is small.
File-based for small scale: if you have fewer than 100k vectors, a flat file of numpy arrays with a cosine similarity search loop is fine. Really. Don’t add infrastructure until the simple thing stops working.
Backup Strategy
rsync to an offsite target on a cron job. That’s it.
# /etc/cron.d/ai-stack-backup
0 3 * * * root rsync -az --delete /opt/ai-stack/ user@backup-host:/backups/ai-stack/
Back up: SQLite files, Postgres dumps (pg_dump), your config files, your agent skill directories, your Hugo content directory. Exclude model weights – those are re-downloadable. If you need point-in-time recovery, enable SQLite WAL and snapshot the WAL file. For Postgres, pg_basebackup or WAL archiving if you’re serious about it. For a personal stack, daily dumps are probably fine.
Test your restores. Untested backups are not backups.
Publishing: The Blog Stack
This is the meta-layer – how you publish what you build and learn.
Hugo
Static site generator, markdown source, git-native. The build output is a folder of HTML and static assets. You can host it on any CDN, object storage, or basic web server. There’s no database to manage, no runtime to secure, no CMS to update.
The content workflow: write markdown in your editor, commit to git, push, deploy. With a post-receive hook or a simple CI pipeline (Woodpecker CI, Gitea Actions, even a bare shell script), the site rebuilds on push automatically.
Hugo’s build speed is its defining characteristic. A site with thousands of pages builds in under a second. This matters when you’re iterating on templates or running automated content pipelines.
Listmonk
Self-hosted newsletter. Single Go binary, Postgres backend, clean web UI. Handles subscriber management, campaigns, transactional emails, list segmentation. It’s what this newsletter runs on.
docker-compose.yml with a Postgres service and a Listmonk service. Map port 9000, put Nginx or Caddy in front, done. The Docker image is small, the resource usage is low, and it does everything a newsletter needs without subscription fees.
Email Delivery: SES
Amazon SES for SMTP relay. ~$0.10 per 1,000 emails. Verify your domain, set up DKIM and SPF, configure Listmonk to use SES as the SMTP backend. For a newsletter under 100k subscribers, this is the cheapest reliable option by a significant margin.
The only alternative worth considering is Mailgun or Postmark if you want better deliverability analytics out of the box. SES is cheapest; the others have better tooling.
DNS and TLS
Cloudflare for DNS. Free, fast, good API for automation, works with Certbot for DNS-01 challenges. Set your nameservers to Cloudflare, manage records there.
Certbot with Let’s Encrypt for TLS. Standard. Use the DNS-01 challenge via the Cloudflare plugin if you’re running services that aren’t publicly accessible on port 80:
certbot certonly \
--dns-cloudflare \
--dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
-d yourdomain.com \
-d *.yourdomain.com
Wildcard cert covers everything. Renews automatically. Set up a cron job or systemd timer for renewal and don’t think about it again.
Deployment
For Hugo, a bare git repository on your server with a post-receive hook:
#!/bin/bash
# /opt/site.git/hooks/post-receive
GIT_WORK_TREE=/var/www/site git checkout -f
cd /var/www/site && hugo --minify
Push to the remote, the hook fires, Hugo builds, the new site is live. No CI server required.
For Listmonk and other services: Docker Compose files in a git repo. git pull && docker-compose up -d is a deploy. Simple enough to do by hand, simple enough to automate.
Monitoring and Security
You don’t need a full observability stack. You need to know when things break.
What to Actually Monitor
Inference latency: If your model server starts returning responses slowly, you want to know before your agent times out. Expose the /metrics endpoint from Ollama or llama.cpp and scrape it with Prometheus. Or just write a cron job that hits the inference endpoint every 5 minutes and logs the response time. Whatever is proportionate to how much you care.
Memory and disk: Local inference eats RAM. Disk fills up with model weights, logs, and SQLite files. Set alerts at 85% for both. Grafana + Prometheus is one option; a simple shell script that sends you a Telegram message when disk hits a threshold is another. The latter takes 10 minutes to write.
Failed agent runs: Your agent should log errors. Parse the logs. Alert on repeated failures. If your agent is silently failing to complete tasks, you want to know.
Service health: Simple HTTP health checks. A cron job that curls your services every minute and sends an alert if it gets a non-200 response is enough for a personal stack.
Security Basics
UFW. Default deny inbound. Open only what you need:
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw allow 80/tcp
ufw allow 443/tcp
ufw enable
If your inference server is on the same host as the agent runtime, bind it to localhost only. Don’t expose Ollama to the internet.
Fail2ban. Install it, configure it for SSH (the default config is fine), let it run. It will ban IPs that fail authentication repeatedly. Not a substitute for key-only auth, but an additional layer.
SSH keys only:
# /etc/ssh/sshd_config
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
No exceptions. Password auth over SSH is not acceptable in 2026.
Secrets management. This one catches people. Your AGENTS.md, your skill configs, your environment files – these should never contain API keys, database passwords, or anything sensitive. Use environment variables loaded from a file that is not in your git repository. Use .gitignore aggressively. Audit your repos periodically with git log -p | grep -i "api_key\|password\|secret". Better: use a .env file pattern with dotenv, or use a secrets manager if you’re running on a VPS that has one.
Keep model weights, SQLite databases, and any files containing personal data out of git entirely. They don’t belong there.
Python package supply chain. Today’s LiteLLM compromise (versions 1.82.7 and 1.82.8 on PyPI, 24 March 2026) is a concrete example of the risk. A credential-stealing payload was injected via a poisoned CI/CD dependency (Trivy), not a direct repo compromise, and harvested SSH keys, cloud credentials, API keys, and crypto wallets from any machine that installed those versions. If you use Python-based AI tooling, pin your dependencies explicitly (pip freeze > requirements.txt), use a lockfile, audit pip list against known-good versions before deploying to any machine with credentials, and watch the advisory feeds for packages in your stack. The Docker proxy image was unaffected because it pins requirements – a concrete reason to prefer containerised deployments for anything touching credentials.
The Honest Tradeoffs
Self-hosting is the right call for a lot of people. It’s not the right call for everyone. Here’s what you’re actually signing up for.
Local inference is slower than frontier APIs for hard tasks. Qwen3.5-35B-A3B at 25-35 tokens/sec (on a single 4090) is slower than GPT-5-mini via API. For interactive use, that’s fine – you won’t notice the difference. For tasks where you need Claude Opus 4.6 level capability, your local model will underperform. That gap is closing – Taalas at 17k tokens/sec shows where hardware is going – but it exists today.
You own the ops burden. Services need updating. Disks fill up. Models get superseded and you need to pull new weights. A VPS goes down during a kernel update. Your GPU driver breaks after a system update. None of this is hard, but it requires attention. If you want zero ops, use APIs.
Updates require care. A new version of Ollama or OpenClaw might change an API or config format. Postgres major version upgrades require migration. Hugo themes drift. Staying current is a job, even if it’s a small one.
But. Costs collapse at scale. Your data stays on your hardware. No surprise account restrictions, no rate limits during a critical workflow, no terms-of-service change that makes your use case newly prohibited. The Google OAuth crackdown and OpenAI’s surveillance architecture are not edge cases – they’re predictable consequences of depending on services you don’t control. Every month you run your own stack, the counter-argument to “just use the API” gets weaker.
The configuration that makes sense for most competent engineers right now: a mid-range GPU (RTX 3090 or 4090) for local inference running Ollama with Qwen3.5-35B-A3B, a Pi 5 or cheap VPS running an agent runtime, SQLite for state, Cloudflare for DNS, Certbot for TLS, and a fallback API key for frontier tasks. The whole thing can be running in a weekend.
Start there. Tune it to your needs from a working baseline.
Sources
- Alibaba Qwen3.5-35B-A3B release notes and benchmark results (February 2026) – huggingface.co/Qwen
- ggml.ai acquisition by HuggingFace announcement (2025/2026) – huggingface.co
- Taalas inference hardware benchmarks – taalas.ai
- Raspberry Pi 5 (8GB) product page and stock availability – raspberrypi.com
- OpenClaw documentation – openclaw.ai
- Listmonk documentation – listmonk.app
- Amazon SES pricing – aws.amazon.com/ses/pricing
- Hetzner Cloud pricing – hetzner.com/cloud
- llama.cpp project – github.com/ggerganov/llama.cpp
- Ollama documentation – ollama.ai
- NVIDIA PersonaPlex 7B on Apple Silicon (February 2026) – blog.ivan.digital
- AMD Ryzen AI 400 Series AM5 announcement, MWC 2026
- US Court of Appeals (9th Circuit), No. 25-403, decided 3 March 2026 – TOS may be updated by email; continued use implies consent. cdn.ca9.uscourts.gov/datastore/memoranda/2026/03/03/25-403.pdf
- Microsoft BitNet – official inference framework for 1-bit LLMs, bitnet.cpp. github.com/microsoft/BitNet
- RTX 5090 (32GB GDDR7) self-hosted LLM benchmarks and hardware requirements, createaiagent.net, March 2026 – createaiagent.net/self-hosted-llm/
- Qatar helium shutdown puts chip supply chain on a two-week clock – SK Hynix forced to diversify after 30% of global supply removed from the market. Tom’s Hardware, March 2026. tomshardware.com/tech-industry/qatar-helium-shutdown-puts-chip-supply-chain-on-a-two-week-clock
- CanIRun.ai – Can your machine run AI models? Hardware compatibility checker for local LLM inference. canirun.ai
- KittenML/KittenTTS – Three new local TTS models, smallest under 25MB. github.com/KittenML/KittenTTS (Show HN, March 2026)
- Super Micro Shares Plunge 25% After Co-Founder Charged in $2.5B Smuggling Plot. Forbes, 20 March 2026. forbes.com/sites/tylerroush/2026/03/20/super-micro-shares-plunge-25-after-co-founder-charged-in-25-billion-ai-chip-smuggling-plot/
- MacBook M5 Pro and Qwen3.5 = Local AI Security System. SharpAI, March 2026. sharpai.org/benchmark/ (Show HN, 113 points)
- Flash-MoE – Pure C/Metal inference engine running Qwen3.5-397B-A17B on MacBook Pro M3 Max 48GB at 4.4 tok/s via SSD expert streaming. github.com/danveloper/flash-moe (Show HN, 236 points, 22 March 2026)
- iPhone 17 Pro demonstrated running 400B LLM on-device via SSD streaming (@anemll, Twitter/X, 23 March 2026). Coverage: wccftech.com/iphone-17-pro-successfully-runs-400b-llm-locally/ and gizmochina.com/2026/03/23/iphone-17-pro-runs-a-400b-ai-model-locally-which-needs-over-200gb-of-ram/
- TeamPCP Backdoors LiteLLM Versions 1.82.7-1.82.8 Likely via Trivy CI/CD Compromise. The Hacker News, 24 March 2026. thehackernews.com/2026/03/teampcp-backdoors-litellm-versions.html
- A popular Python library just became a backdoor to your entire machine. XDA Developers, 24 March 2026. xda-developers.com/popular-python-library-backdoor-machine/
- Hypura – Storage-tier-aware LLM inference scheduler for Apple Silicon. github.com/t8/hypura (Show HN, 148 points, 24 March 2026)
Published 2 March 2026. Versioned snapshot: 24 March 2026.
Commissioned, Curated and Published by Russ. Researched and written with AI. You are reading a versioned snapshot of this post. View the latest version.