Check.AI

深度指南 · 2026-05-18 · by

The Complete Guide to Running Open-Source LLMs Locally: GPU, VRAM, Frameworks, Models

"How big a model can my GPU run?" "Ollama or vLLM?" "Does DeepSeek 70B really need a server?" These come up constantly. Running models locally stopped being an enthusiast-only thing well before 2026: DeepSeek pushed distilled small models close to mid-tier cloud quality, quantization fits a 7B into 4GB of VRAM, and Ollama gets you running with one command. This guide uses real VRAM numbers, framework trade-offs, GPU tiers, and an electricity-cost breakdown to settle "how do I run a model on my own machine" in one read.

30-second verdict

Compare open-source models' specs and capabilities on Check.AI →

When to run locally (and when not to)

A cold splash first: most people don't need local deployment. The DeepSeek API costs a few cents per million tokens, light use won't run up much in a year, and you skip buying a GPU, setting up the environment, and tuning a framework.

Three cases where local is genuinely worth it:

Conversely, if you just want to play with AI or call it a few dozen times a day, skip local, use the API. The money for a GPU buys years of API credit.

How to size VRAM: quantization is the key

VRAM is the first hurdle in local deployment. The formula is simple: parameter count × bytes per parameter + 30% headroom (activations, KV cache). Bytes per parameter is set by the quantization precision:

Precision Per param Quality 7B VRAM Best for
FP162 bytesOriginal~16GBProduction / benchmarking
INT81 byteMinimal loss~8GBQuality-sensitive
INT4 (q4_k_m)0.5 bytesSingle-digit % loss~4-5GBPersonal sweet spot

Sources: llama.cpp GGUF quantization benchmarks, hands-on LLM inference testing (2026-02), and the models' HuggingFace cards.

INT4 VRAM cheat sheet by size

Model size INT4 VRAM FP16 VRAM Minimum GPU
7B~5GB~16GBRTX 3060 12GB
13-14B~9GB~28GBRTX 4060 Ti 16GB
32B~20GB~64GBRTX 4090/5090 32GB
70B~40GB~140GBRTX 6000 Ada 48GB / dual card

Remember one thing: for personal local use, always start at INT4. The quality loss is usually imperceptible, you save 75% of the VRAM, and it's faster too. Original precision is only needed for production and benchmarking.

GPU tiers: how big a model your card can run

Entry tier (8-16GB)

RTX 3060 12GB / 4060 8GB / 4060 Ti 16GB. Comfortably runs 7B-14B INT4. This tier is enough for a personal assistant, local RAG, and code completion. The 4060 Ti 16GB is the value king: 16GB fits a 14B INT4 with headroom to spare.

Main tier (24-32GB)

RTX 4090 24GB / 5090 32GB. Runs 32B INT4, or 14B FP16 if you want the quality. This tier does real work: local agents, batch processing, serious reasoning. Two RTX 5060 Ti 16G cards (32G combined) on vLLM can also run 32B AWQ, a budget option.

Professional tier (48GB+)

RTX 6000 Ada 48GB, or dual RTX 4090/5090 with NVLink memory pooling. The threshold for 70B INT4. Beyond that, 70B FP16 needs dual 80GB cards (A100/H100), which is a server room, not a desktop, and basically off-limits for individuals.

No discrete GPU / integrated graphics

llama.cpp on CPU + system RAM. 32GB of RAM can run 7B INT4, but the speed may be a few tokens per second (GPUs do tens to hundreds). Only for very low frequency where you don't mind waiting. Don't expect to do real-time conversation with it.

Picking a framework: Ollama / vLLM / llama.cpp / LM Studio

Ollama: the personal pick

One command, ollama run deepseek-r1:7b, and you're running. Version 0.5+ supports mixed CPU/GPU inference and dynamic model offloading. The model library is ready-made and switching is easy. For personal, single-machine, desktop, experimental use, pick Ollama with your eyes closed. The downside is high-concurrency throughput trails vLLM.

vLLM: production serving

To turn a model into an API for many people or many requests, vLLM is the standard answer. The Chunked Prefill + Prefix Caching in v0.7+ manages VRAM fragmentation efficiently, with throughput far beyond Ollama. It needs CUDA 12.4+ and supports multi-card parallelism natively. Pick vLLM for a product backend. Configuration is a notch more complex than Ollama.

llama.cpp: the most portable

Written in C++, it runs almost anywhere: CPU, Mac Metal, all kinds of edge devices. The birthplace of the GGUF quantization format. The top pick for those with no Nvidia card, who want maximum portability, or who use the Mac command line. Lots of room for performance tuning, but it takes fiddling.

LM Studio: GUI, beginner-friendly

It has a GUI: click to download a model and start chatting, and it works well on Mac (with MLX acceleration) and Windows. For people who don't want to touch the command line at all, start here. It's less deep than the other three, but the barrier is lowest.

The one-line path: try LM Studio as a beginner → switch to Ollama for daily use once comfortable → move to vLLM for a product → heavy Mac users go straight to llama.cpp / MLX.

Which models to run locally in 2026

The selection logic is the same as in the cloud: there's no do-everything king, so match the model to the use case. The steadiest personal starter combo is DeepSeek-R1-Distill-Qwen-7B (reasoning) + Qwen3-7B (Chinese): together they cover 90% of needs in 10GB of VRAM.

Apple Silicon: the underrated option

Many people don't realize it: the Mac's unified memory is a hidden advantage for local models. On a regular PC, VRAM and RAM are separate, so a 16GB card is 16GB. Apple Silicon's memory is shared between CPU and GPU, so an M4 Max 128GB can hand 100GB+ over as "VRAM."

Use LM Studio (GUI, MLX built in) or llama.cpp (Metal acceleration). MLX is Apple's own ML framework, optimized natively for unified memory and a notch faster than generic options.

Bottom line: if you already have a high-memory Mac, stop researching an Nvidia card. The one in your hands may already be enough, and the experience is quieter.

Real cost: local vs API

Let's run the numbers on a concrete scenario: 5,000 calls a day, 800 tokens in and 200 out each, on a 7B-class model.

Option Upfront Monthly cost Payback period
DeepSeek API (cloud)¥0~¥150-300None (ongoing)
Local RTX 4090~¥13,000~¥80 electricityAbout 8-14 months
Local Mac M4 Max (already owned)¥0 (reuse)~¥30 electricityImmediate

Electricity estimated at 0.4 kWh/hour under full load, 8 hours a day, ¥0.6/kWh. Upfront cost at China retail prices as of May 2026.

In plain terms: light use, stick with the API and don't buy a card. High-frequency plus long-term is when local pays off, and payback runs 8-14 months, during which you have to genuinely use it heavily every day. What's truly irreplaceable about local isn't saving money, it's "data never leaves the premises + unlimited calls + offline availability + no vendor holding you hostage." The people paying for those four things are the ones who should deploy locally.

FAQ

What's the minimum GPU? 7B INT4 needs only 4-5GB, so an RTX 3060 is fine. With no discrete GPU, llama.cpp on CPU works but is slow.

Ollama or vLLM? Ollama for personal use (one command), vLLM for serving (high throughput). LM Studio / llama.cpp on Mac.

Which quantization? For personal use, always start at INT4: 75% VRAM saved, single-digit % quality loss.

Which model to run? DeepSeek-R1 distilled all-round, Qwen3 for Chinese, GLM-4-9B for agents, Qwen3 Coder for code.

Is it cheaper than the API? Not for light use; only high-frequency plus long-term pays off (8-14 months). The core value is data privacy + unlimited calls.

Can a Mac run it? Yes, unified memory is the advantage. An M4 Max 128GB runs 70B quantized, using MLX.

→ Compare every open-source model's specs, context, and capabilities on Check.AI