Coding with Local Agents on an RTX 3090

TL;DR Running coding agents on local machines has never been easier. This article gives easy setup instructions for running Qwen 3.6 27B on an RTX 3090 in Linux. I then show how to use the model in VS Code using the LLM Gateway extension. By the end of this guide, you’ll be free of service providers and able to run a variety of OSS models.

Update (May 16, 2026): Over a week has gone by since posting this, so of course everything has changed. Well not everything but MTP is the new performance hotness and I’ve added a section about how I use it at the end.

Overview

There are roughly two steps to running a local coding agent:

Get the model up and running serving the standard chat API.
Connect the model to your coding environment (e.g., VS Code).

There are hundreds of different OSS models, and hundreds of different model servers to choose from. You have, frankly, an overwhelming number of options to fulfill step #1. That said, if you’re looking to run these models on consumer-grade hardware, you will be looking at models in the 7B-31B parameter range. Here is one site, of many, that tries to rank these beasts: Artificial Analysis

For this guide, I will focus on Qwen 3.6 27B from Alibaba since it works well-enough. But Gemma 4 31B from Google is a champ and is worth also looking at.

There is a wonderful arms race happening with model servers right now too. A model server is a giant math library, optimized into oblivion, that deigns to run an HTTP server so it can service requests. But it also has one more crucial component: a caching layer that keeps as much of chat conversations in GPU memory as possible in order to minimize latency and compute time - the KV cache.

For this guide, I will focus on llama.cpp since it is pretty popular, easy to use, and has good GPU support. But there are a number of other servers that are worth looking at, including vLLM, Ollama, MLX-LM, MTPLX, and on and on.

Download the Model

This is both the easy part, and the hard part. Easy, because all you have to do is go to Hugging Face and download any of the thousands of models available. It’s hard because there are so many models! There’s model families, model sizes, model fine tunes, model quantizations, model formats. Oh my!

Most inference engines (like llama.cpp) support a specific set of model formats, so that will narrow down your options. For llama.cpp, the supported format is GGUF, so you’ll want to look for models in that format. For MLX models (to run on Apple Silicon), you’ll look in the mlx-community.

You’ll now need to pick a quantization size. Quantization is a compression method for model weights. If we took a 27 billion parameter model with 32-bit floating point weights, it would be 27B * 32 bits = 108 GB in size. Unless you have a datacenter handy, you won’t be running that. Instead, you’ll choose, say a 4-bit quantized model. This will compress the weights down to 27B * 4 bits = 13.5 GB, which is much more manageable for consumer hardware. The tradeoff is that quantization can reduce the model’s performance and accuracy, but it’s often a necessary compromise.

Now the RTX 3090 has 24 GB of VRAM so you might be tempted to pick a higher-bit quantization, but you have to keep in mind that the context and the output also have to fit in GPU memory. If you want long contexts and long outputs, you might have to go with a lower-bit quantization to ensure everything fits.

The Q4_K_M quantization format is a good compromise for a 27B model and a 24 GB GPU. So I’m going to download the Qwen 3.6 27B Q4_K_M model from Hugging Face:

wget "https://huggingface.co/unsloth/Qwen3.6-27B-GGUF/resolve/main/Qwen3.6-27B-Q4_K_M.gguf?download=true"

(wget is a little dumb, so you’ll need to rename the file after downloading it since it doesn’t handle the ?download=true part of the URL very well.)

Build llama.cpp

You can download prebuilt libraries of llama.cpp but if you want to ensure its optimized for your machine and hardware, you’ll want to build it yourself. Thankfully, it’s pretty easy to do:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 8

Aside from the nastiness of having to use CMake, building software doesn’t get much easier than this.

I passed the -DGGML_CUDA=ON flag to ensure that I get NVIDIA CUDA support, which is crucial for running these large models on consumer-grade hardware. If you’re on an M-series Mac, you would want to pass -DGGML_METAL=ON instead to get support for Apple’s Metal API.

If all goes well, you will have a nice, shiny build/bin/llama-server executable that you can use to serve your model.

Run the Server

You will want to run the server with a delicious soup of command line arguments. Something like this:

./build/bin/llama-server -m ~/Downloads/Qwen3.6-27B-Q4_K_M.gguf --host 0.0.0.0 -ngl 99 -c 262144 -fa on --cache-type-k q4_0 --cache-type-v q4_0

Let’s deconstruct that soup:

Argument	Description
`-m`	The path to the model file you downloaded.
`--host 0.0.0.0`	This tells the server to listen on all network interfaces, which is necessary if you want to connect to it from another machine (e.g., your dev machine).
`-ngl 99`	This sets the number of GPU layers to use. Setting this to 99 tells the server to use as many GPU layers as possible, which will maximize performance.
`-c 262144`	This sets the context size to 262,144 tokens, which is the maximum context size for this model. You can adjust this based on your needs and GPU memory constraints.
`-fa on`	This enables the “faster auto-regressive decoding” feature, which can improve performance.
`--cache-type-k q4_0 --cache-type-v q4_0`	This sets the quantization type for the KV cache to `q4_0`, which is a good choice for performance and memory efficiency.

Notice how we are quantizing the KV cache (context and outputs) as well. This is a crucial step for ensuring that the model runs efficiently on consumer-grade hardware, as the KV cache can consume a significant amount of GPU memory.

You’ll be greeted with typical programmer excretions:

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24159 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24159 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9026-a817a22bc
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 11 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/home/fak/Downloads/Qwen3.6-27B-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 3090)   | 24159 = 23257 + (21388 = 15345 +    5206 +     836) +      -20486 |
common_memory_breakdown_print: |   - Host               |                   1214 =   682 +       0 +     532                |
common_params_fit_impl: projected to use 21388 MiB of device memory vs. 23257 MiB of free device memory
common_params_fit_impl: will leave 1868 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.66 seconds
llama_model_loader: loaded meta data with 51 key-value pairs and 851 tensors from /home/fak/Downloads/Qwen3.6-27B-Q4_K_M.gguf (version GGUF V3 (latest))

Congratulations. You’re now an AI service provider. I recommend getting some seed capital and start selling access to your model to the highest bidder.

But before you do that…

Install LLM Gateway in VS Code

I rock VS Code for all my coding needs, and I want to be able to use my local model in its AI agent chat window thingy. To do that, I need to install an extension that connects VS Code to the standard chat API. (Why VS Code doesn’t support the API standard that literally every LLM server provides is beyond me.)

ANYWAY, I like the LLM Gateway extension by Andrew Butson.

Install that extension.
Open the “GitHub Copilot LLM Gateway: Configure Server” UI from the command palette and enter the URL for your server (e.g., http://my-awesome-server.local:8080).
Test the connection with the “GitHub Copilot LLM Gateway: Test Server Connection” command. It should say “Found 1 model(s)” if everything is working. (If it’s not working, email James Montemagno and ask him for help.)
Open the “Chat: Manage Language Models” UI from the command palette. You should see your model listed but it will probably be grayed out for some reason. Click it, click the eye ball (gross!), and it should now be active and ready to use in the chat window.
Open the chat window, and click the model selector. Choose “Other Models”, scroll, and scroll, looking for your model. It’s there somewhere. I promise. You might doubt it, but have faith. When in doubt, keep scrolling. You can do it. You found it! Click it, and now you can use your local model in the chat window!

MTP for Speed

(Added May 16, 2026)

Since posting this article, Multi-Token Prediction (MTP) has been released, and it is a game changer for performance. MTP is a new parallelism method that somehow makes things faster by doing more work. Weird, I know. Instead of the purely serial operation of (1) generate a token, (2) add it to the context, (3) GOTO 1, MTP uses a much smaller model to quickly do the 1-2-3 dance for a few tokens and then uses the real model to verify the results. Given the nature of these beasts, the smaller model takes up very little wall time but still has OK-ish accuracy. The big model, instead of being just a generator, is used to verify the probabilities of these new tokens. It can do that quickly because it doesn’t need to test them one at a time, but can test them all in parallel (vs serial). Since modern LLMs are memory bandwidth bound, not compute bound, this parallel execution is “free”. The result is a huge speed boost, about 1.4x-1.8x. It’s a crazy hack, and I’m here for it.

As of this writing, MTP is still a work-in-progress for llama.cpp, but it is available in PRs and forks. I’m compiling using am17an’s fork.

./build/bin/llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL --host 0.0.0.0 -c 150000 -ngl 99 -fa on --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-p 0.95 --top-k 20 --presence-penalty 0.0 --min-p 0.00 --spec-type draft-mtp --spec-draft-n-max 2

The important new args are:

Argument	Description
`--spec-type draft-mtp`	This tells the server to use the MTP parallelism method.
`--spec-draft-n-max 2`	This sets the maximum number of tokens to predict in parallel. 2 is a very conservative choice.

I’ve also modified a few other sampling parameters based on recommendations for coding environments:

Argument	Description
`--temp 0.6`	This sets the temperature to 0.6, which is a good choice for coding tasks as it encourages more deterministic outputs while still allowing for some creativity.
`--top-p 0.95`	This sets the nucleus sampling parameter to 0.95, which helps to ensure that the model generates more relevant and coherent code by focusing on the most probable tokens.
`--top-k 20`	This sets the top-k sampling parameter to 20, which limits the number of tokens considered at each step to the 20 most likely, further improving the relevance of the generated code.
`--presence-penalty 0.0`	This sets the presence penalty to 0.0, which means that the model will not be penalized for generating tokens that have already appeared in the context, which can be beneficial for coding tasks where repetition of certain tokens (e.g., variable names, function names) is common.
`--min-p 0.00`	This sets the minimum probability threshold to 0.00, which means that the model will consider all tokens regardless of their probability, allowing for a wider range of potential outputs.

I also switched to the Qwen3.6-27B-MTP-GGUF model, which includes the smaller MTP model needed for the parallel token prediction. And I switched to the UD-Q4_K_XL quantization format, because guessing which quantization format to use is half the fun of self-hosting models.

Is it Worth It?

What does an RTX machine cost these days?

Component	Price (USD)
RTX 3090	$1500
CPU	$300
64 GB RAM	$700 (what has the world come to?)
HDD	$200
PSU	$150
Case	$100
Total	$2,950

So for about $3,000 you can have your very own local coding agent. That’s a pretty hefty price tag, but it’s also a one-time cost.

In a typical day, I burn through about 50,000,000 tokens. 500,000 output tokens, 1,750,000 input tokens, and the rest are cache hits. At 40 tok/second (typical for my RTX), my compute day is about (500,000 + 1,750,000) / 40 = 56,250 seconds, which is about 15.6 hours of compute time per day. Ugh.

Right now, you can use DeepSeek for $3.48 per 1,000,000 output tokens, $1.74 for inputs, and $0.0145 for cache hits. So my daily cost would be (500,000 / 1,000,000) * 3.48 + (1,750,000 / 1,000,000) * 1.74 + (47,750,000 / 1,000,000) * 0.0145 = $5.48 per day. That’s about $1,400 per year (five day work weeks). So in about 2 years, I would recoup the cost of running my own local agent. Hmmm…

So you might not want to run out and buy your own server. But, if you do have an over-provisioned gaming rig, well you might as well put it to use doing something useful. ;-) X

Conclusion

Since 2017 I have been advocating running local models. I’m amazed that it’s now possible to run 27B parameter variants on consumer hardware. (In my mind, 7B is still tremendous.) These are real models, able to write good code, in a fully agentic harness. Amazing.

While the up front hardware cost, the noise of fans, and the slower response rates are not ideal and don’t make this an easy win, I have a different perspective. AI coding has changed how I work. Permanently. I do not want to go back to writing every line of code by hand, it seems absurd now. But I also don’t like being at the mercy of large cloud providers. Having the ability to run my own local agent, even with its limitations, is a huge win for me. I know, even with no internet connection, I can still do what I love: code.

Colophon: Written by hand. Proofread and edited by Qwen 3.6 27B running on an RTX 3090.