What is the minimum GPU to run an LLM locally?

It depends on model size and quantization. A 7B model at INT4 needs only 4-5GB of VRAM, so an RTX 3060 12GB / 4060 8GB can run it. Full FP16 precision for a 7B needs 16GB (RTX 4060 Ti 16GB / 4080). A 32B at INT4 needs about 20GB (RTX 4090/5090 32GB). A 70B at INT4 needs about 40GB, so an RTX 6000 Ada 48GB or dual cards. With no discrete GPU you can still use llama.cpp on CPU + RAM, but it is slow.

What is the difference between Ollama and vLLM, and which should I use?

Ollama: one command to install and run, supports mixed CPU/GPU and dynamic offloading, good for personal, single-machine, desktop, experimental use. vLLM: production-grade high throughput, supports Chunked Prefill, Prefix Caching, and multi-card parallelism, needs CUDA 12.4+, good for serving to many users. For personal local use pick Ollama; for a product API pick vLLM. Mac users use LM Studio or llama.cpp (Metal/MLX acceleration).

How do I choose among INT4, INT8, and FP16 quantization?

FP16 = original precision, 2 bytes per parameter, best quality but the most VRAM (about 16GB for 7B). INT8 halves the VRAM (about 8GB for 7B) with very small quality loss. INT4 (q4_k_m) drops VRAM to about 1/4 (about 4-5GB for 7B), is the fastest, and loses only a single-digit percentage of quality, the sweet spot for personal deployment. Unless you are doing serious production or benchmarking, start at INT4 for personal local use.

Which open-source model is best to run locally in 2026?

All-round reasoning: DeepSeek-R1-Distill-Qwen-7B/32B (distilled, small but strong). Chinese + multilingual: the Qwen3 series. General + ecosystem: Llama 4. Agent / tool calling: GLM-4-9B. Code: Qwen3 Coder / DeepSeek Coder. The top pick for indie developers is DeepSeek-R1-Distill-Qwen-7B (runs in 4GB of VRAM, with reasoning quality close to mid-tier cloud models).

Is local deployment cheaper than using an API?

It depends on volume. Low frequency (a few dozen calls a day): the API is cheaper, no GPU needed. High frequency + long-term + sensitive data: local pays off. One RTX 4090 (about ¥13,000) plus electricity (about 0.4 kWh/hour under load) running DeepSeek distilled 7B for a few thousand calls a day pays back in six months to a year, after which the marginal cost is close to electricity. The biggest value is not saving money, it is data staying local + unlimited calls + offline availability.

Can a Mac run LLMs locally?

Yes, and Apple Silicon's unified memory is a hidden advantage. On an M3/M4 Max with 64-128GB, most of the memory can serve as VRAM, so a quantized 70B will run. Use LM Studio (GUI) or llama.cpp (command line, Metal acceleration); the newer MLX framework is Apple-native and faster than generic options. A 128GB M4 Max running a quantized 70B is close to the experience of a professional card, while staying quiet and power-efficient.

深度指南 · 2026-05-18 · by @zayuerweb-dev

The Complete Guide to Running Open-Source LLMs Locally: GPU, VRAM, Frameworks, Models

"How big a model can my GPU run?" "Ollama or vLLM?" "Does DeepSeek 70B really need a server?" These come up constantly. Running models locally stopped being an enthusiast-only thing well before 2026: DeepSeek pushed distilled small models close to mid-tier cloud quality, quantization fits a 7B into 4GB of VRAM, and Ollama gets you running with one command. This guide uses real VRAM numbers, framework trade-offs, GPU tiers, and an electricity-cost breakdown to settle "how do I run a model on my own machine" in one read.

30-second verdict

Entry (8-16GB VRAM): RTX 4060 / 4060 Ti, runs 7B-14B INT4. DeepSeek-R1-Distill-Qwen-7B is the sweet spot.
Main (24-32GB): RTX 4090 / 5090, runs 32B INT4 or 14B FP16, gets real work done.
Professional (48GB+): RTX 6000 Ada / dual cards, runs 70B INT4.
Mac users: M3/M4 Max with 64-128GB unified memory + MLX runs 70B quantized, quiet and power-efficient.
No discrete GPU: llama.cpp + CPU + lots of RAM, runs but slowly, only for low frequency.
Framework: Ollama for personal use, vLLM for serving, LM Studio / llama.cpp on Mac.
Model: DeepSeek-R1 distilled all-round, Qwen3 for Chinese, GLM-4-9B for agents, Qwen3 Coder for code.
When in doubt: get Ollama + DeepSeek-R1-Distill-Qwen-7B running first, then upgrade hardware when it can't keep up.

Compare open-source models' specs and capabilities on Check.AI →

When to run locally (and when not to)

A cold splash first: most people don't need local deployment. The DeepSeek API costs a few cents per million tokens, light use won't run up much in a year, and you skip buying a GPU, setting up the environment, and tuning a framework.

Three cases where local is genuinely worth it:

Data can't leave the premises: healthcare, legal, internal corporate data, personal privacy. This is the hardest reason; however expensive, it has to be local.
High-frequency, high-volume calls: tens of thousands a day, where the API bill snowballs and local marginal cost approaches electricity.
Offline / weak network / full control: no dependence on any vendor, no fear of price hikes, throttling, shutdowns, or censorship.

Conversely, if you just want to play with AI or call it a few dozen times a day, skip local, use the API. The money for a GPU buys years of API credit.

How to size VRAM: quantization is the key

VRAM is the first hurdle in local deployment. The formula is simple: parameter count × bytes per parameter + 30% headroom (activations, KV cache). Bytes per parameter is set by the quantization precision:

Precision	Per param	Quality	7B VRAM	Best for
FP16	2 bytes	Original	~16GB	Production / benchmarking
INT8	1 byte	Minimal loss	~8GB	Quality-sensitive
INT4 (q4_k_m)	0.5 bytes	Single-digit % loss	~4-5GB	Personal sweet spot

Sources: llama.cpp GGUF quantization benchmarks, hands-on LLM inference testing (2026-02), and the models' HuggingFace cards.

INT4 VRAM cheat sheet by size

Model size	INT4 VRAM	FP16 VRAM	Minimum GPU
7B	~5GB	~16GB	RTX 3060 12GB
13-14B	~9GB	~28GB	RTX 4060 Ti 16GB
32B	~20GB	~64GB	RTX 4090/5090 32GB
70B	~40GB	~140GB	RTX 6000 Ada 48GB / dual card

Remember one thing: for personal local use, always start at INT4. The quality loss is usually imperceptible, you save 75% of the VRAM, and it's faster too. Original precision is only needed for production and benchmarking.

GPU tiers: how big a model your card can run

Entry tier (8-16GB)

RTX 3060 12GB / 4060 8GB / 4060 Ti 16GB. Comfortably runs 7B-14B INT4. This tier is enough for a personal assistant, local RAG, and code completion. The 4060 Ti 16GB is the value king: 16GB fits a 14B INT4 with headroom to spare.

Main tier (24-32GB)

RTX 4090 24GB / 5090 32GB. Runs 32B INT4, or 14B FP16 if you want the quality. This tier does real work: local agents, batch processing, serious reasoning. Two RTX 5060 Ti 16G cards (32G combined) on vLLM can also run 32B AWQ, a budget option.

Professional tier (48GB+)

RTX 6000 Ada 48GB, or dual RTX 4090/5090 with NVLink memory pooling. The threshold for 70B INT4. Beyond that, 70B FP16 needs dual 80GB cards (A100/H100), which is a server room, not a desktop, and basically off-limits for individuals.

No discrete GPU / integrated graphics

llama.cpp on CPU + system RAM. 32GB of RAM can run 7B INT4, but the speed may be a few tokens per second (GPUs do tens to hundreds). Only for very low frequency where you don't mind waiting. Don't expect to do real-time conversation with it.

Picking a framework: Ollama / vLLM / llama.cpp / LM Studio

Ollama: the personal pick

One command, ollama run deepseek-r1:7b, and you're running. Version 0.5+ supports mixed CPU/GPU inference and dynamic model offloading. The model library is ready-made and switching is easy. For personal, single-machine, desktop, experimental use, pick Ollama with your eyes closed. The downside is high-concurrency throughput trails vLLM.

vLLM: production serving

To turn a model into an API for many people or many requests, vLLM is the standard answer. The Chunked Prefill + Prefix Caching in v0.7+ manages VRAM fragmentation efficiently, with throughput far beyond Ollama. It needs CUDA 12.4+ and supports multi-card parallelism natively. Pick vLLM for a product backend. Configuration is a notch more complex than Ollama.

llama.cpp: the most portable

Written in C++, it runs almost anywhere: CPU, Mac Metal, all kinds of edge devices. The birthplace of the GGUF quantization format. The top pick for those with no Nvidia card, who want maximum portability, or who use the Mac command line. Lots of room for performance tuning, but it takes fiddling.

LM Studio: GUI, beginner-friendly

It has a GUI: click to download a model and start chatting, and it works well on Mac (with MLX acceleration) and Windows. For people who don't want to touch the command line at all, start here. It's less deep than the other three, but the barrier is lowest.

The one-line path: try LM Studio as a beginner → switch to Ollama for daily use once comfortable → move to vLLM for a product → heavy Mac users go straight to llama.cpp / MLX.

Which models to run locally in 2026

All-round reasoning leader: DeepSeek-R1-Distill-Qwen-7B / 32B. The distilled versions are small but strong; the 7B's reasoning quality is close to mid-tier cloud, and it runs in 4-5GB of VRAM. It should be the first thing an indie developer installs.
Chinese + multilingual: the Qwen3 series (7B/14B/32B). Native Chinese training, strong on classical Chinese, policy text, and Southeast Asian languages, the top pick for local Chinese use.
General + ecosystem: the Llama 4 series. The biggest community, the most complete toolchain, the most fine-tuning material.
Agent / tool calling: GLM-4-9B. The most reliable structured output and function calling among small models, pick it for local agents.
Code: Qwen3 Coder / DeepSeek Coder. Use these for local code completion and when wiring Cline/Aider to a local model.

The selection logic is the same as in the cloud: there's no do-everything king, so match the model to the use case. The steadiest personal starter combo is DeepSeek-R1-Distill-Qwen-7B (reasoning) + Qwen3-7B (Chinese): together they cover 90% of needs in 10GB of VRAM.

Apple Silicon: the underrated option

Many people don't realize it: the Mac's unified memory is a hidden advantage for local models. On a regular PC, VRAM and RAM are separate, so a 16GB card is 16GB. Apple Silicon's memory is shared between CPU and GPU, so an M4 Max 128GB can hand 100GB+ over as "VRAM."

M3/M4 Max 64GB: runs 32B quantized smoothly, 70B just barely.
M4 Max 128GB: runs 70B quantized at an experience close to a professional card, while staying quiet, power-efficient, and case-free.

Use LM Studio (GUI, MLX built in) or llama.cpp (Metal acceleration). MLX is Apple's own ML framework, optimized natively for unified memory and a notch faster than generic options.

Bottom line: if you already have a high-memory Mac, stop researching an Nvidia card. The one in your hands may already be enough, and the experience is quieter.

Real cost: local vs API

Let's run the numbers on a concrete scenario: 5,000 calls a day, 800 tokens in and 200 out each, on a 7B-class model.

Option	Upfront	Monthly cost	Payback period
DeepSeek API (cloud)	¥0	~¥150-300	None (ongoing)
Local RTX 4090	~¥13,000	~¥80 electricity	About 8-14 months
Local Mac M4 Max (already owned)	¥0 (reuse)	~¥30 electricity	Immediate

Electricity estimated at 0.4 kWh/hour under full load, 8 hours a day, ¥0.6/kWh. Upfront cost at China retail prices as of May 2026.

In plain terms: light use, stick with the API and don't buy a card. High-frequency plus long-term is when local pays off, and payback runs 8-14 months, during which you have to genuinely use it heavily every day. What's truly irreplaceable about local isn't saving money, it's "data never leaves the premises + unlimited calls + offline availability + no vendor holding you hostage." The people paying for those four things are the ones who should deploy locally.

FAQ

What's the minimum GPU? 7B INT4 needs only 4-5GB, so an RTX 3060 is fine. With no discrete GPU, llama.cpp on CPU works but is slow.

Ollama or vLLM? Ollama for personal use (one command), vLLM for serving (high throughput). LM Studio / llama.cpp on Mac.

Which quantization? For personal use, always start at INT4: 75% VRAM saved, single-digit % quality loss.

Which model to run? DeepSeek-R1 distilled all-round, Qwen3 for Chinese, GLM-4-9B for agents, Qwen3 Coder for code.

Is it cheaper than the API? Not for light use; only high-frequency plus long-term pays off (8-14 months). The core value is data privacy + unlimited calls.

Can a Mac run it? Yes, unified memory is the advantage. An M4 Max 128GB runs 70B quantized, using MLX.

→ Compare every open-source model's specs, context, and capabilities on Check.AI