Best Local LLM for 8GB VRAM: Optimal Settings for AI Roleplay & ERP
A blunt 2026 guide to making 8GB cards work for local roleplay: what fits, what slows down, and which settings actually earn their place.
Every 8GB guide eventually turns into the same smug advice: buy more VRAM.
Useless when the card is already inside your machine.
The good news is that 8GB still works. The bad news is that 8GB only works when you stop lying to yourself about what the number means.
1. The VRAM Tax: OS and Browser Overhead
On a normal desktop, 8GB does not mean 8GB of free memory for inference:
- Windows (DWM): Sequesters 1.5GB – 2.5GB of VRAM to render the desktop.
- macOS (WindowServer): Consumes 1.5GB – 2.0GB of VRAM.
- Linux (GNOME/KDE): Consumes 1.0GB – 2.0GB depending on the compositor.
- Browsers (Chrome/Firefox with hardware accel): Consume 1.0GB – 2.0GB for a few active tabs.
- Discord / Spotify / Electron apps: Consume another 0.5GB – 1.0GB combined.
If you run a standard desktop setup, your starting available VRAM is closer to 4.5GB – 5.5GB of actual usable space. If your combined model weights and Key-Value (KV) cache exceed this limit, the system offloads layers to system RAM (DDR4/DDR5) via the PCIe bus. This "spilling" drops your generation speed from a fluid 30+ tokens per second down to 1–2 tokens per second.
2. VRAM Allocation Dynamics Table
To prevent memory spilling, you must balance model parameters against context size. The table below outlines memory footprints for GGUF Q4_K_M (4-bit) models:
| Model Size | Quantization | Context Length | Estimated VRAM | 8GB Hardware Viability |
|---|---|---|---|---|
| 3B | Q4_K_M | 8,000 tokens | ~3.2 GB | Excellent. High headroom for background apps. |
| 7B – 8B | Q4_K_M | 4,000 tokens | ~5.8 – 6.0 GB | Optimal. Fits comfortably with basic OS tasks. |
| 7B – 8B | Q4_K_M | 8,000 tokens | ~6.2 GB | Viable. Requires closing heavy browser tabs. |
| 7B – 8B | Q4_K_M | 16,000 tokens | ~7.2 GB | Borderline. Requires strict OS resource pruning. |
| 12B – 14B | Q4_K_M | 4,000 tokens | ~8.6 GB | Exceeds limits. Spills to CPU/RAM immediately. |
3. Quantization Choices: GGUF vs. AWQ
On 8GB cards, your quantization format dictates stability:
- GGUF (GGUF-iMat): The gold standard for consumer desktop deployment. It supports row-wise quantization and Importance Matrix (iMatrix) optimization, preserving 92% of the model's FP16 intelligence. GGUF allows hybrid offloading (spilling layers to the CPU) without crashing.
- AWQ / GPTQ: Designed strictly for server GPUs. They do not support CPU offloading. If your context window expands past your physical VRAM limit by even a single token, AWQ will immediately trigger a fatal Out of Memory (OOM) crash.
The 1-Bit Exception: PrismML Bonsai
If you need an ultra-low footprint, 2026 saw the release of PrismML Bonsai 8B. Running on a custom 1-bit GGUF fork, it packs an 8B parameters model into a tiny 1.15GB VRAM footprint. While requiring specialized inference kernels, it allows you to run a full 16K context window with less than 2GB of VRAM.
4. Practical CLI Execution Snippets
Here are copy-pasteable configurations to execute local models on 8GB hardware via llama-server (llama.cpp) or Ollama:
Config A: Fast Coding / Autocomplete Assistant
- Model: Qwen 2.5 Coder 3B GGUF (Q4_K_M)
- Command (Ollama):
# Enable Flash Attention and restrict parallel requests to save VRAM export OLLAMA_FLASH_ATTENTION=1 export OLLAMA_NUM_PARALLEL=1 ollama run qwen2.5-coder:3b
Config B: Core Desktop Assistant (Daily Generalist)
- Model: Llama 3.1 8B Instruct GGUF (Q4_K_M)
- Command (llama-server):
# -c 8192 limits context, -fa enables Flash Attention, -ngl 99 forces all layers onto GPU ./llama-server -m llama-3.1-8b-instruct.Q4_K_M.gguf -c 8192 -fa -ngl 99
Config C: 12B Model Roleplay (Advanced Context Shifting)
- Model: Mistral NeMo 12B Instruct GGUF (iMat Q4_K_M)
- Command (llama-server):
# --cache-type-k q8_0 compresses Key-Value cache to 8-bit, -ngl 41 offloads 41 of 51 layers ./llama-server --model Mistral-Nemo-Instruct-12B-iMat-GGUF-Q4_K_M.gguf --flash-attn --cache-type-k q8_0 --cache-type-v q8_0 --ctx-size 16384 --n-gpu-layers 41
Config D: 35B MoE Model (Extreme Hybrid Execution)
- Model: Qwen 3.6 35B MoE GGUF (Q4_K_M — Active parameters: 3B)
- Command (llama-server):
# Splits the active experts across CPU/GPU using --n-cpu-moe 38 to maintain high token throughput ./llama-server -m Qwen3.6-35B-A3B-Q4_K_M.gguf -ngl 99 --n-cpu-moe 38 -c 8192 -fa on --cache-type-k q8_0 --reasoning-budget-message "\n\nStop thinking and answer now."
Stop treating your machine like an unlimited cloud server. Keep your context lengths capped, enforce Flash Attention, utilize quantized KV caches, and close background tabs. That is how you turn an 8GB card into a fast, disciplined local stack.
Continue Reading
Related Guides
Free Uncensored AI: How to Run Local LLM API on Google Colab (2026)
A practical 2026 guide to using Google Colab as a disposable LLM host: what fits on the free tier, where the limits really are, and how to expose a stable API without pretending the setup is production-grade.
Local LLM on Mac: Setup Guide for Uncensored AI Roleplay (Apple Silicon M-Series)
A 2026 Mac guide for local roleplay stacks covering unified memory, model sizing, MLX versus llama.cpp, thermal limits, and clean Apple Silicon setup paths.
SillyTavern Image Generation: Connect Stable Diffusion for Visual AI Roleplay
A 2026 SillyTavern image generation guide covering Stable Diffusion connections, local versus API workflows, ComfyUI integration, prompt extraction, and visual consistency for AI roleplay.
Ready for private AI?
Experience zero-log, client-side encrypted AI roleplay directly in your browser.
Launch App