The Z420 is a 2012–2013 HP professional workstation. They're everywhere on eBay and second-hand markets for next to nothing — because corporations cycled them out and nobody thought to run 120B parameter AI models on them. That was a mistake on their part.
// Why this machine
The Z420's secret weapon isn't the CPU — it's the RAM ceiling. With the right Xeon (E5-16xx or E5-26xx series), it takes up to 64GB of ECC DDR3. Large language models run in RAM when there's no GPU to speak of. More RAM means larger models, less swapping, faster inference. The Z420 gives you 64GB on a budget that would embarrass a new gaming PC.
CPU
Xeon E5-1650 v2
6 cores, 12 threads. 3.5GHz base, 3.9GHz boost. Ivy Bridge-EP architecture. More than enough for inference orchestration.
↗ v2 suffix matters — LGA2011 socket, check compatibility
RAM — the important part
64GB DDR3 ECC 1866MHz
8× 8GB sticks in quad-channel configuration. ECC is a bonus — the capacity is the point. This is your VRAM substitute.
↗ Don't go below 32GB. 64GB is the build.
Storage
SSD for OS + HDD for weights
Any modern SSD for Ubuntu. A large spinning disk or second SSD for model files — Nemotron-3-Super 120B in Q4 quantisation is ~60GB.
↗ Weights don't need speed. Capacity matters.
GPU
Optional — CPU inference is the play
The Z420's proprietary 625W PSU limits GPU options. Avoid high-TDP cards. A GTX 1060 6GB or Quadro P2000 are safe choices. But honestly — not required.
↗ CPU-only inference works. Don't block on GPU.
Power Supply
625W HP Proprietary PSU
Non-standard connector. Don't try to swap it out unless you know what you're doing. Factor this into any GPU plans.
↗ This is the Z420's one real constraint.
Operating System
Ubuntu 24.04 LTS
Ollama has excellent Linux support. Ubuntu 24.04 is stable and well-documented. Avoid kernel bleeding-edge — driver conflicts multiply fast.
↗ Stick to LTS. Seriously.
// What to buy and where
Search eBay, Facebook Marketplace, Gumtree, or local government surplus auctions. Keywords: "HP Z420 workstation", "Xeon workstation", "E5-1650". Budget AUD $150–300 for the base unit. Add $80–150 for RAM if it needs topping up. Total outlay for the Fortress: under $400.
Nvidia Nemotron-3-Super 120B is a Mixture-of-Experts model with 120 billion total parameters and 12 billion active per forward pass — meaning it reasons like a large model but runs cheaper than its parameter count implies. It's the engine behind both Arktron and Arbor, accessed two different ways.
Arktron — Cloud path
openrouter/nvidia/nemotron-3-super-120b-a12b:free
Routed through OpenRouter's free tier. No local resources consumed. Arktron uses this for tasks where latency matters less than throughput, or when the local model is under load.
↗ Free tier — rate limited but zero cost
Arbor — Local path
nemotron-3-super:120b (Ollama)
Running directly on the Fortress via Ollama. Private, offline-capable, no rate limits. Arbor owns this instance — it's the model that lives in the machine.
↗ CPU inference — 1–3 tok/s on the Z420
Architecture
Mixture of Experts — 12B active
120B total parameters, but only ~12B active per token. This is why it's runnable locally — the actual compute per forward pass is far less than the total parameter count suggests.
↗ MoE is the reason this works at all
Ollama is the runtime layer — it handles model downloading, quantisation loading, and exposes a local REST API that everything else talks to. It's the part that makes this actually work without pain.
1
Install Ollama
One-line installer on Ubuntu. This handles the binary, systemd service, and default paths.
2
Pull the model
Downloads directly from the Ollama registry. ~28GB — make sure you have the space and a decent connection. It resumes if interrupted.
3
Configure for CPU inference
Set OLLAMA_NUM_GPU=0 if you want pure CPU. On the Z420 with a low-end GPU, Ollama will usually figure this out — but being explicit avoids surprises.
4
Configure RAM allocation
Set OLLAMA_MAX_LOADED_MODELS=1 to avoid loading multiple models simultaneously. With 64GB, you can technically load two smaller ones, but for 120B, single-model is essential.
5
Test the endpoint
Ollama serves on http://localhost:11434 by default. Hit it with a quick curl to confirm it's alive before building anything on top.
6
Enable as a systemd service
Ollama installs a systemd unit. Enable it so the model server starts on boot — important if this machine is running headless or via RDP.
$ curl -fsSL https://ollama.com/install.sh | sh
$ ollama pull nemotron-3-super:120b
$ curl http://localhost:11434/api/generate \
-d '{"model":"nemotron-3-super:120b","prompt":"Hello."}'
// Environment config — /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_NUM_GPU=0"
Environment="OLLAMA_MAX_LOADED_MODELS=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
// On swap and RAM pressure
With 64GB and a ~28GB model, you have plenty of headroom. But if you're on 32GB, the model will use swap. This works — but throughput drops to roughly 0.2–0.5 tok/s. Still technically functional for offline batch work, but painful for interactive use. The 64GB target isn't perfectionism — it's the usability threshold.
OpenClaw is the API gateway layer — it sits between your agents and the model backends, handling multi-provider routing, prompt caching, model fallbacks, and unified API access. Nemotron via Ollama becomes one provider among several. Claude, Gemini, Qwen — all addressable through a single interface.
◌ Chapter In Progress
AGENT PERFORMANCE TESTING UNDERWAY
The OpenClaw chapter is being written as the system is put through its paces. Real numbers from real workloads — not benchmarks. This section will cover installation, provider configuration, routing logic, and prompt cache setup.
Multi-provider routing
Prompt caching config
Model fallback logic
Ollama integration
openclaw.json structure
// Why OpenClaw matters for the Fortress
Without a gateway, you're choosing one model for every task. With OpenClaw, the Fortress runs Nemotron locally for private or heavy reasoning tasks, routes to Claude or Gemini for tasks that benefit from cloud scale, and caches repeated context so you're not re-uploading the same 50k-token prompt for every request. The local model stops being a limitation and starts being a deliberate choice.