深度指南 · 2026-05-18 · by @zayuerweb-dev
The Complete Guide to Running Open-Source LLMs Locally: GPU, VRAM, Frameworks, Models
"How big a model can my GPU run?" "Ollama or vLLM?" "Does DeepSeek 70B really need a server?" These come up constantly. Running models locally stopped being an enthusiast-only thing well before 2026: DeepSeek pushed distilled small models close to mid-tier cloud quality, quantization fits a 7B into 4GB of VRAM, and Ollama gets you running with one command. This guide uses real VRAM numbers, framework trade-offs, GPU tiers, and an electricity-cost breakdown to settle "how do I run a model on my own machine" in one read.
30-second verdict
- Entry (8-16GB VRAM): RTX 4060 / 4060 Ti, runs 7B-14B INT4. DeepSeek-R1-Distill-Qwen-7B is the sweet spot.
- Main (24-32GB): RTX 4090 / 5090, runs 32B INT4 or 14B FP16, gets real work done.
- Professional (48GB+): RTX 6000 Ada / dual cards, runs 70B INT4.
- Mac users: M3/M4 Max with 64-128GB unified memory + MLX runs 70B quantized, quiet and power-efficient.
- No discrete GPU: llama.cpp + CPU + lots of RAM, runs but slowly, only for low frequency.
- Framework: Ollama for personal use, vLLM for serving, LM Studio / llama.cpp on Mac.
- Model: DeepSeek-R1 distilled all-round, Qwen3 for Chinese, GLM-4-9B for agents, Qwen3 Coder for code.
- When in doubt: get Ollama + DeepSeek-R1-Distill-Qwen-7B running first, then upgrade hardware when it can't keep up.
Compare open-source models' specs and capabilities on Check.AI →
When to run locally (and when not to)
A cold splash first: most people don't need local deployment. The DeepSeek API costs a few cents per million tokens, light use won't run up much in a year, and you skip buying a GPU, setting up the environment, and tuning a framework.
Three cases where local is genuinely worth it:
- Data can't leave the premises: healthcare, legal, internal corporate data, personal privacy. This is the hardest reason; however expensive, it has to be local.
- High-frequency, high-volume calls: tens of thousands a day, where the API bill snowballs and local marginal cost approaches electricity.
- Offline / weak network / full control: no dependence on any vendor, no fear of price hikes, throttling, shutdowns, or censorship.
Conversely, if you just want to play with AI or call it a few dozen times a day, skip local, use the API. The money for a GPU buys years of API credit.
How to size VRAM: quantization is the key
VRAM is the first hurdle in local deployment. The formula is simple: parameter count × bytes per parameter + 30% headroom (activations, KV cache). Bytes per parameter is set by the quantization precision:
| Precision | Per param | Quality | 7B VRAM | Best for |
|---|---|---|---|---|
| FP16 | 2 bytes | Original | ~16GB | Production / benchmarking |
| INT8 | 1 byte | Minimal loss | ~8GB | Quality-sensitive |
| INT4 (q4_k_m) | 0.5 bytes | Single-digit % loss | ~4-5GB | Personal sweet spot |
Sources: llama.cpp GGUF quantization benchmarks, hands-on LLM inference testing (2026-02), and the models' HuggingFace cards.
INT4 VRAM cheat sheet by size
| Model size | INT4 VRAM | FP16 VRAM | Minimum GPU |
|---|---|---|---|
| 7B | ~5GB | ~16GB | RTX 3060 12GB |
| 13-14B | ~9GB | ~28GB | RTX 4060 Ti 16GB |
| 32B | ~20GB | ~64GB | RTX 4090/5090 32GB |
| 70B | ~40GB | ~140GB | RTX 6000 Ada 48GB / dual card |
Remember one thing: for personal local use, always start at INT4. The quality loss is usually imperceptible, you save 75% of the VRAM, and it's faster too. Original precision is only needed for production and benchmarking.
GPU tiers: how big a model your card can run
Entry tier (8-16GB)
RTX 3060 12GB / 4060 8GB / 4060 Ti 16GB. Comfortably runs 7B-14B INT4. This tier is enough for a personal assistant, local RAG, and code completion. The 4060 Ti 16GB is the value king: 16GB fits a 14B INT4 with headroom to spare.
Main tier (24-32GB)
RTX 4090 24GB / 5090 32GB. Runs 32B INT4, or 14B FP16 if you want the quality. This tier does real work: local agents, batch processing, serious reasoning. Two RTX 5060 Ti 16G cards (32G combined) on vLLM can also run 32B AWQ, a budget option.
Professional tier (48GB+)
RTX 6000 Ada 48GB, or dual RTX 4090/5090 with NVLink memory pooling. The threshold for 70B INT4. Beyond that, 70B FP16 needs dual 80GB cards (A100/H100), which is a server room, not a desktop, and basically off-limits for individuals.
No discrete GPU / integrated graphics
llama.cpp on CPU + system RAM. 32GB of RAM can run 7B INT4, but the speed may be a few tokens per second (GPUs do tens to hundreds). Only for very low frequency where you don't mind waiting. Don't expect to do real-time conversation with it.
Picking a framework: Ollama / vLLM / llama.cpp / LM Studio
Ollama: the personal pick
One command, ollama run deepseek-r1:7b, and you're running. Version 0.5+ supports mixed CPU/GPU inference and dynamic model offloading. The model library is ready-made and switching is easy. For personal, single-machine, desktop, experimental use, pick Ollama with your eyes closed. The downside is high-concurrency throughput trails vLLM.
vLLM: production serving
To turn a model into an API for many people or many requests, vLLM is the standard answer. The Chunked Prefill + Prefix Caching in v0.7+ manages VRAM fragmentation efficiently, with throughput far beyond Ollama. It needs CUDA 12.4+ and supports multi-card parallelism natively. Pick vLLM for a product backend. Configuration is a notch more complex than Ollama.
llama.cpp: the most portable
Written in C++, it runs almost anywhere: CPU, Mac Metal, all kinds of edge devices. The birthplace of the GGUF quantization format. The top pick for those with no Nvidia card, who want maximum portability, or who use the Mac command line. Lots of room for performance tuning, but it takes fiddling.
LM Studio: GUI, beginner-friendly
It has a GUI: click to download a model and start chatting, and it works well on Mac (with MLX acceleration) and Windows. For people who don't want to touch the command line at all, start here. It's less deep than the other three, but the barrier is lowest.
The one-line path: try LM Studio as a beginner → switch to Ollama for daily use once comfortable → move to vLLM for a product → heavy Mac users go straight to llama.cpp / MLX.
Which models to run locally in 2026
- All-round reasoning leader: DeepSeek-R1-Distill-Qwen-7B / 32B. The distilled versions are small but strong; the 7B's reasoning quality is close to mid-tier cloud, and it runs in 4-5GB of VRAM. It should be the first thing an indie developer installs.
- Chinese + multilingual: the Qwen3 series (7B/14B/32B). Native Chinese training, strong on classical Chinese, policy text, and Southeast Asian languages, the top pick for local Chinese use.
- General + ecosystem: the Llama 4 series. The biggest community, the most complete toolchain, the most fine-tuning material.
- Agent / tool calling: GLM-4-9B. The most reliable structured output and function calling among small models, pick it for local agents.
- Code: Qwen3 Coder / DeepSeek Coder. Use these for local code completion and when wiring Cline/Aider to a local model.
The selection logic is the same as in the cloud: there's no do-everything king, so match the model to the use case. The steadiest personal starter combo is DeepSeek-R1-Distill-Qwen-7B (reasoning) + Qwen3-7B (Chinese): together they cover 90% of needs in 10GB of VRAM.
Apple Silicon: the underrated option
Many people don't realize it: the Mac's unified memory is a hidden advantage for local models. On a regular PC, VRAM and RAM are separate, so a 16GB card is 16GB. Apple Silicon's memory is shared between CPU and GPU, so an M4 Max 128GB can hand 100GB+ over as "VRAM."
- M3/M4 Max 64GB: runs 32B quantized smoothly, 70B just barely.
- M4 Max 128GB: runs 70B quantized at an experience close to a professional card, while staying quiet, power-efficient, and case-free.
Use LM Studio (GUI, MLX built in) or llama.cpp (Metal acceleration). MLX is Apple's own ML framework, optimized natively for unified memory and a notch faster than generic options.
Bottom line: if you already have a high-memory Mac, stop researching an Nvidia card. The one in your hands may already be enough, and the experience is quieter.
Real cost: local vs API
Let's run the numbers on a concrete scenario: 5,000 calls a day, 800 tokens in and 200 out each, on a 7B-class model.
| Option | Upfront | Monthly cost | Payback period |
|---|---|---|---|
| DeepSeek API (cloud) | ¥0 | ~¥150-300 | None (ongoing) |
| Local RTX 4090 | ~¥13,000 | ~¥80 electricity | About 8-14 months |
| Local Mac M4 Max (already owned) | ¥0 (reuse) | ~¥30 electricity | Immediate |
Electricity estimated at 0.4 kWh/hour under full load, 8 hours a day, ¥0.6/kWh. Upfront cost at China retail prices as of May 2026.
In plain terms: light use, stick with the API and don't buy a card. High-frequency plus long-term is when local pays off, and payback runs 8-14 months, during which you have to genuinely use it heavily every day. What's truly irreplaceable about local isn't saving money, it's "data never leaves the premises + unlimited calls + offline availability + no vendor holding you hostage." The people paying for those four things are the ones who should deploy locally.
Related reading
FAQ
What's the minimum GPU? 7B INT4 needs only 4-5GB, so an RTX 3060 is fine. With no discrete GPU, llama.cpp on CPU works but is slow.
Ollama or vLLM? Ollama for personal use (one command), vLLM for serving (high throughput). LM Studio / llama.cpp on Mac.
Which quantization? For personal use, always start at INT4: 75% VRAM saved, single-digit % quality loss.
Which model to run? DeepSeek-R1 distilled all-round, Qwen3 for Chinese, GLM-4-9B for agents, Qwen3 Coder for code.
Is it cheaper than the API? Not for light use; only high-frequency plus long-term pays off (8-14 months). The core value is data privacy + unlimited calls.
Can a Mac run it? Yes, unified memory is the advantage. An M4 Max 128GB runs 70B quantized, using MLX.
→ Compare every open-source model's specs, context, and capabilities on Check.AI