Commissioned, Curated and Published by Russ. Researched and written with AI.
What’s New This Week
Initial publication. Nvidia’s RTX 50-series (Blackwell consumer) is rolling out but pricing has not stabilised – not yet recommended over established 40-series. The RX 7900 XTX remains the VRAM-per-pound leader at the high end. Hold on Blackwell unless you specifically need the latest architecture.
Changelog
| Date | Summary |
|---|---|
| 23 Mar 2026 | Initial publication. |
Cloud API costs compound. Rate limits are a real constraint when you’re running agent loops. And some workloads shouldn’t leave your infrastructure. If any of those reasons apply to you, a local inference machine makes sense – and in 2026 you can build something genuinely capable for £800.
This is not a gaming guide. The priorities are different: VRAM over clock speed, model compatibility over benchmark scores, sustained inference throughput over peak frame rates. If you want to run local AI agents, this is what you actually need.
The One Thing That Matters: VRAM
VRAM is the primary constraint. Everything else follows from it.
The rule of thumb: roughly 2GB of VRAM per billion parameters at standard quantisation (Q4/Q5). A 13B model needs around 8-10GB. A 70B model needs around 40GB. These are floors, not ceilings – context window size and batch size push the number up.
The practical thresholds in 2026:
- 8GB: 3-7B models comfortably. Fine for experimentation. Not recommended for agent work.
- 12GB: 7-8B models well, 13B in a squeeze. You’ll feel the limit quickly.
- 16GB: 13-14B models smoothly, 30B with aggressive quantisation. The practical entry point for agent work.
- 24GB: 30-34B comfortably, quantised 70B possible. Where serious inference lives.
- 48GB+: 70B cleanly, 120B with quantisation. Research or production pipeline territory.
Any GPU you buy in 2026 should have at least 12GB VRAM. Ideally 16GB. Models are not getting smaller.
£500 “The Enabler” – The APU Build
| Component | Choice | Approx. Cost |
|---|---|---|
| CPU | AMD Ryzen 5 8600G | £190 |
| Motherboard | B650 budget | £80 |
| RAM | 64GB DDR5 | £90 |
| Storage | 1TB NVMe SSD | £55 |
| Case + 550W 80+ Gold PSU | £70 | |
| Total | ~£500 |
No discrete GPU. The AMD Ryzen 5 8600G has an integrated Radeon 760M that can allocate up to 8GB of shared VRAM from system RAM.
The 64GB RAM is not optional here – it feeds the iGPU with headroom and also enables CPU inference on larger models via llama.cpp. With CPU inference, a 13B Q4 model is usable. Slow (a few tokens per second), but usable. The iGPU gets you 7B at something approaching interactive speed.
What it runs: 7B-13B models via CPU inference, 7B reasonably fast via iGPU.
What it doesn’t: Fast inference on anything over 13B. Real agent loops.
Who it’s for: First local AI machine. Experimenting with models before committing more budget. Tight constraints.
Be honest with yourself about this tier: it’s for learning the stack, not running production agent work. If you’re planning to run a coding assistant or an agent loop all day, start at the next tier.
£800 “The Agent Rig” – The Recommendation
| Component | Choice | Approx. Cost |
|---|---|---|
| CPU | AMD Ryzen 7 7700X | £185 |
| GPU | RTX 4060 Ti 16GB | £300 |
| Motherboard | B650 mid-range | £100 |
| RAM | 32GB DDR5 | £60 |
| Storage | 2TB NVMe Gen4 | £80 |
| Case + 750W 80+ Gold PSU | £95 | |
| Total | ~£800 |
This is the sweet spot for 2026. The RTX 4060 Ti 16GB is the specific recommendation: 16GB VRAM at the lowest price point of any 16GB discrete card, full CUDA support, 13-14B models at 20-30 tokens/sec. That’s fast enough for real agent work – coding assistants, document processing, automated pipelines running continuously.
One note on the GPU: the RTX 4060 Ti also exists in an 8GB variant at a similar price. Do not buy the 8GB version. The 16GB variant is the specific card.
The Intel i7-14700F at around £200 is a reasonable CPU alternative if you find a better deal. The GPU matters far more than the CPU for this workload.
What it runs: 13-14B models smoothly, 30B quantised, full agent loops, ComfyUI for image work.
What it doesn’t: 70B at useful speed.
Who it’s for: Engineers running daily agent work, local coding assistants, anyone who wants a capable inference machine without pushing into the premium tier.
£1500 “The Serious Setup” – 34B and Beyond
| Component | Choice | Approx. Cost |
|---|---|---|
| CPU | AMD Ryzen 9 7950X | £380 |
| GPU | RX 7900 XTX 24GB | £625 |
| Motherboard | X670E | £180 |
| RAM | 64GB DDR5 | £110 |
| Storage | 2TB NVMe Gen4 + 2TB data | £140 |
| Case + 1000W 80+ Gold PSU | £140 | |
| Total | ~£1500 |
The GPU choice at this tier is the interesting decision. The RX 7900 XTX at £625 gives you 24GB GDDR6 at roughly half the price of an RTX 4090. ROCm support with llama.cpp is now production-ready, which wasn’t confidently true a year ago.
The Ryzen 9 7950X earns its place here. Sixteen cores handles parallel inference requests, heavy compilation, and fine-tuning runs. When the GPU is waiting on CPU-side preprocessing, core count matters.
If you prefer the CUDA ecosystem: the RTX 4080 Super at around £750 is the alternative. You get 16GB instead of 24GB, but better Tensor cores, deeper framework support, and no ROCm dependency. The right choice depends on your tooling. If you’re doing pure inference with Ollama or llama.cpp, AMD gives you more VRAM for less money. If you’re fine-tuning, or your stack assumes CUDA, take the 4080 Super.
What it runs: 34B models cleanly, 70B quantised at ~15-20 tokens/sec, small model fine-tuning.
Who it’s for: Running multiple models simultaneously, production agent pipelines, teams sharing a local inference server, anyone who wants 34B to feel fast.
GPU Quick Reference
| Budget | Pick | VRAM | Notes |
|---|---|---|---|
| Under £250 | Intel Arc B580 | 12GB | Surprise pick. ROCm + llama.cpp work. Best VRAM:price at this tier. |
| £300-400 | RTX 4060 Ti 16GB | 16GB | Recommended. RTX 4070 12GB if CUDA perf matters more than VRAM. |
| £450-600 | RX 7900 GRE 16GB | 16GB | Solid AMD option. ROCm production-ready. |
| £550-650 | RTX 4070 Super 16GB | 16GB | Good CUDA card, competitive at this range. |
| £650-750 | RX 7900 XTX 24GB | 24GB | Best VRAM:price at the high end for AI workloads specifically. |
| £1,200+ | RTX 4090 24GB | 24GB | Fastest consumer card. RX 7900 XTX closes the gap on LLM inference specifically. |
AMD vs Nvidia in 2026
The CUDA ecosystem remains deeper for AI tooling – fine-tuning, Triton kernels, some research code, and many commercial tools are CUDA-first. For pure inference with llama.cpp or Ollama, AMD ROCm is now production-ready and the gap is small. If you’re doing inference only, buy AMD for the VRAM. If you’re fine-tuning or your stack assumes CUDA, buy Nvidia and accept the cost per GB of VRAM.
What to Avoid
Any GPU with less than 12GB VRAM for new purchases in 2026. Models are not getting smaller.
The RTX 4060 8GB specifically. The 16GB version exists at a similar price. There is no reason to buy the 8GB variant.
Pre-built “AI PCs” from OEMs. They ship mediocre GPUs at premium prices with inadequate cooling for sustained inference load. The “AI PC” label is marketing.
Mining GPUs on the secondary market. VRAM runs hot under sustained load. Mining rigs run sustained load continuously. Degraded VRAM, no warranty, and the seller has every incentive to not mention this.
What to Run On It
Ollama is the easiest starting point. Install it, pull a model, start querying. It handles model management, serves an OpenAI-compatible API on localhost, and gets out of your way.
llama.cpp is the underlying engine most things are built on. Worth knowing if you want control over quantisation, GPU layer counts, or you’re running on AMD and need ROCm directly.
Suggested models by tier:
- £500 APU build: Qwen2.5-7B, Llama 3.2-8B. Keep to 7-8B Q4 models.
- £800 Agent Rig: Qwen2.5-14B, Mistral-Nemo, DeepSeek-Coder-V2-Lite. The 16GB VRAM handles 13-14B cleanly.
- £1500 Serious Setup: Qwen3 72B quantised, Llama 4 Scout or Maverick for multimodal. At 24GB you can run 34B unquantised or push 70B with Q4 quantisation.
For agent frameworks: most people start with a simple Ollama + Python setup, then graduate to something like LangChain, LlamaIndex, or a purpose-built harness depending on their use case.
The bottleneck on your first build will not be framework choice. It will be VRAM. Start there.