A practical guide to running LLMs on your own hardware - covering the tools (Ollama, LM Studio, Jan), hardware requirements by VRAM tier, model selection, quantization formats, and how to integrate local inference into your dev workflow.
Running an LLM locally means executing a large language model entirely on your own hardware - no API calls, no cloud dependency, no data leaving your machine. For developers tired of mounting API bills, or teams that need private inference in air-gapped environments, local LLMs have become a real option. The tooling has matured fast. As of early 2026, you can go from zero to a working local ChatGPT-like setup in under five minutes.
This guide covers the hardware you need, the tools worth using, how to pick a model, and where local inference still falls short. No "what is an LLM" preamble - if you're here, you already know.
Why Bother Running LLMs Locally?
Five reasons keep coming up in practice:
- Privacy and data sovereignty. Your prompts and data never leave your machine. For regulated industries, legal teams, or anyone handling sensitive code, this alone justifies the effort.
- Cost. A single developer burning through GPT-4-class API calls can easily spend $50-200/month. A one-time $300 GPU runs a capable 8B model indefinitely.
- Latency. No network round-trip. Token generation starts immediately. For interactive coding assistants or local RAG pipelines, this matters.
- Offline access. Airplanes, classified networks, remote sites - local models work everywhere.
- Experimentation freedom. Run uncensored models, test fine-tuned variants, swap architectures without waiting for an API provider to support them.
The trade-off is straightforward: you give up frontier-model quality (GPT-4o, Claude Opus) in exchange for control, privacy, and zero marginal cost per token.
What Hardware Do You Actually Need?
VRAM is the bottleneck. The rough formula: (model parameters x bits per weight) / 8 = gigabytes of VRAM required. A 7B parameter model at 4-bit quantization needs about 3.5 GB of VRAM. A 70B model at 4-bit needs roughly 35 GB.
Here is what fits at each tier:
| VRAM | What You Can Run | Expected Speed |
|---|---|---|
| 8 GB (RTX 4060, M1/M2) | 7-8B models at Q4_K_M | 30-50 tok/s |
| 16 GB (RTX 4070 Ti, M2 Pro) | 14B models at Q4, or 8B at Q8 | 20-40 tok/s |
| 24 GB (RTX 4090, RTX 3090) | 27-32B models at Q4_K_M | 15-30 tok/s |
| 48 GB+ (dual GPU, M4 Max, A6000) | 70B models at Q4 | 8-15 tok/s |
A model that fits entirely in VRAM runs roughly 10x faster than one that spills over into system RAM. Capacity beats raw GPU speed here - an RTX 3090 with 24 GB often outperforms an RTX 4080 with 16 GB on larger models simply because it avoids offloading.
CPU-only inference works for quantized 7B models, but expect 3-8 tokens per second depending on your CPU. Usable for batch processing. Painful for interactive chat.
Apple Silicon deserves its own mention. The M1 through M4 chips share a unified memory pool between CPU and GPU, which means a MacBook Pro with 36 GB of unified memory can load models that would need a 36 GB discrete GPU on a PC. The MLX framework from Apple is 20-30% faster than llama.cpp on Apple Silicon for most model sizes. An M4 Pro with 48 GB of RAM comfortably runs Qwen 3 32B (Q4) at 15-22 tokens per second - fast enough for interactive use.
The Tools: Ollama, LM Studio, Jan, and Beyond
The local LLM ecosystem has consolidated around a few tools. Here is how they compare:
| Ollama | LM Studio | Jan | llama.cpp | vLLM | |
|---|---|---|---|---|---|
| Interface | CLI + REST API | GUI + API server | GUI desktop app | CLI | CLI + API server |
| Best for | Developers, automation | Beginners, exploration | Privacy-focused desktop use | Power users, custom builds | Multi-user serving |
| Model format | GGUF (auto-converted) | GGUF | GGUF | GGUF | GPTQ, AWQ, FP16 |
| OpenAI-compatible API | Yes | Yes | Yes | Via server mode | Yes |
| Multi-GPU | Basic layer splitting | Limited | No | Layer offloading | Tensor parallelism |
| GitHub stars | 100K+ | Closed source | 30K+ | 80K+ | 45K+ |
Ollama: The Docker of Local LLMs
Ollama mirrors Docker's UX: you pull models by name and run them with a single command. It wraps llama.cpp under the hood and exposes an OpenAI-compatible API on port 11434.
# Install (macOS/Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run a model
ollama run llama3.2
# Run a specific quantized variant
ollama run qwen3:14b-q4_K_M
# List downloaded models
ollama list
Ollama handles quantization format conversion automatically, manages model storage, and supports GPU acceleration out of the box on NVIDIA, AMD, and Apple Silicon. For most developers, this is the right starting point.
One thing to know: Ollama processes requests sequentially by default. Under concurrent load - say, five developers hitting the same Ollama server - latency spikes from 2 seconds to 45+ seconds. For multi-user scenarios, vLLM with its continuous batching architecture is a better fit, achieving up to 16x higher throughput under concurrency.
LM Studio: Best GUI Experience
LM Studio is a desktop application with a polished interface for browsing, downloading, and chatting with models. It searches Hugging Face directly, lets you adjust inference parameters with sliders, and spins up a local API server with one click. If you want to experiment with different models without touching a terminal, start here.
Jan: Privacy-First Desktop App
Jan is an open-source desktop app that runs completely offline. It has added agentic workflow features with project workspaces and browser-based MCP tool integration. Good for non-technical users who want a private ChatGPT replacement.
Choosing a Model and Understanding Quantization
Where to Find Models
Two main sources: the Ollama model library (curated, ready to pull) and Hugging Face (the full ecosystem, including GGUF-quantized variants from community contributors like TheBloke and bartowski).
Quantization: What Q4_K_M Actually Means
Quantization compresses model weights from 16-bit floating point down to 4, 5, 6, or 8 bits. The GGUF format (used by llama.cpp, Ollama, LM Studio, and Jan) is the standard for local inference. The naming convention breaks down like this:
- Q = quantized, the number is bits per weight (Q4 = 4-bit, Q8 = 8-bit)
- K = k-quant method (grouped quantization with per-group scaling) - better quality than older methods
- S/M/L = Small, Medium, Large variant - controls how many layers get higher precision
Quality ranking from best to most compressed: Q8_0 > Q6_K > Q5_K_M > Q4_K_M > Q4_K_S > Q3_K_S > Q2_K
Q4_K_M is the sweet spot for most users - it cuts VRAM usage by roughly 75% compared to FP16 while keeping perplexity degradation under 1% for most models. Q5_K_M is worth the extra memory if you have room. Q8_0 is near-lossless but needs twice the VRAM of Q4.
Recommended Models (Early 2026)
| Use Case | Model | Parameters | Min VRAM (Q4) |
|---|---|---|---|
| Fast general chat | Llama 3.2 8B | 8B | 5 GB |
| Coding assistant | Qwen 3 14B | 14B | 8 GB |
| Strong reasoning | Gemma 3 27B | 27B | 16 GB |
| Near-frontier quality | Qwen 3 32B | 32B | 20 GB |
| Maximum local capability | Llama 4 Scout 70B | 70B | 40 GB |
Qwen 3 models have been particularly strong in 2026, with the 32B variant matching or beating GPT-4o on several public benchmarks while running on a single RTX 4090. The Gemma 3 27B from Google is another standout - its 128K context window and multimodal support make it versatile for RAG applications.
Adding a UI and Integrating Into Your Workflow
Open WebUI: A ChatGPT-Like Interface for Local Models
Open WebUI is a self-hosted web interface that connects to Ollama or any OpenAI-compatible API. It gives you multi-model chat, document upload, built-in RAG, conversation history, and MCP tool integration - all running on your own hardware.
Getting it running takes three commands:
# 1. Make sure Ollama is running
ollama serve
# 2. Pull a model
ollama pull qwen3:14b
# 3. Start Open WebUI via Docker
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Open your browser to http://localhost:3000 and you have a private, multi-model chat interface with RAG capabilities.
Local Models as an OpenAI API Drop-In
Both Ollama and LM Studio expose OpenAI-compatible endpoints. Any code that uses the OpenAI SDK works with a two-line change:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1", # Ollama
api_key="not-needed"
)
response = client.chat.completions.create(
model="qwen3:14b",
messages=[{"role": "user", "content": "Explain MVCC in PostgreSQL"}]
)
This makes local models a drop-in replacement for development and testing. Tools like Continue use this same approach to provide local Copilot-style autocomplete in VS Code and JetBrains IDEs.
Limitations and When to Use Cloud Instead
Local LLMs have real constraints:
- No frontier models. GPT-4o, Claude Opus, Gemini Ultra - these are not available for local deployment. The gap has narrowed, but it still exists for complex reasoning and very long context tasks.
- Context window limits. Most local setups top out at 32K-128K tokens in practice. Cloud models offer 200K+ with better performance at those lengths.
- Multi-user throughput. Serving more than a handful of concurrent users locally requires vLLM and serious GPU hardware. API providers handle scaling for you.
- No automatic updates. You manage model versions, security patches, and infrastructure yourself.
The pragmatic approach: use local models for development, testing, privacy-sensitive workloads, and experimentation. Use cloud APIs for production serving at scale and tasks requiring frontier-level quality. Many teams run both - local Ollama for dev, cloud API for production - with the same code thanks to OpenAI-compatible endpoints.
Key Takeaways
- Start with Ollama for CLI-driven workflows, or LM Studio if you prefer a GUI. Both get you running in under 5 minutes.
- VRAM determines what you can run. 8 GB handles 7-8B models. 24 GB handles 27-32B models. Capacity matters more than GPU generation.
- Q4_K_M quantization is the default choice - 75% smaller than FP16 with minimal quality loss.
- Qwen 3 and Gemma 3 are the strongest open model families for local use right now.
- Open WebUI + Ollama gives you a private ChatGPT-like setup with RAG in three commands.
- Local for dev, cloud for prod is a practical hybrid that gives you the best of both worlds.