NVIDIA Tesla P4 — 8GB VRAM - simple relic qwen3.5:9b

So I’ve started messing around with running modern LLMs on old, borderline e-waste hardware just to see how far you can push it, one setup that’s quietly been holding up better than expected is a NVIDIA Tesla P4 .

8GB VRAM, no display output, I needed to custom print a cooling solution, but less than $100 to get your hands on.

Setup is pretty standard:

Pop!_OS host, Docker with NVIDIA runtime, nothing fancy. Container is just:

docker run --rm -it \ --gpus all \ -v ollama:/root/.ollama \ -p 11434:11434 \ ollama/ollama Then:

ollama run qwen3.5:9b Pop!_OS, Docker + NVIDIA runtime behaving, nvidia-smi clean, drivers were not 100% there by default but didn't take much effort to pull the missing parts. I grabbed a model from thingiverse and wrapped some electrical tape around a small server fan to create a seal with the print.

Its a little workhorse, its been running for several months without issue, it’s actually stable.

I can run prompts, get responses, its on 24/7 and I added it to the brain network for my orchestrator.

VRAM sits around ~7.5–7.6GB, so yeah, it’s close to the ceiling, there is probably a bit of overflow into system RAM but I haven't measured, I use it as a background worker running a queue of tasks, no parallel, it wont respond when crunching already.

But for single-user, moderate context, local agent stuff, it’s actually fantastic.

Also worth noting this is Pascal-era hardware, so:

no modern tensor core advantages

lower memory bandwidth than newer cards

Not really an issue in my use case as it works mostly passively with no time pressure.

The interesting takeaway for me is that this setup lands in a weird middle ground.

It’s not fast, scalable, future-proof, but it is reliable enough to actually use daily if your expectations are realistic, and was a fun experiment that worked out to be useful.

If anything, it makes me think the “minimum viable hardware” for useful local LLMs is lower than people assume if you’re willing to trade speed for independence.

If anyone else is running models on older datacenter cards or random scrap GPUs, I’d be keen to compare notes.

Feels like there’s a whole layer of practical setups out there that don’t get talked about much.

[留言]

为什么值得关注

原内容本身有足够细节，不是表面信息；符合当前抓取需求；原内容本身有足够细节，不是标题党或空洞总结

来源：reddit，领域：tech，保留分：0.59