DeepSeek On Old RTX GPUs
The world is fascinated by DeepSeek. I am just annoyed at how hard it hit my portfolio, but I also believe that is temporary. If their model training techniques can be replicated it will benefit the industry. This paper is not about training but about running models on a GPU that the documentation says is not sufficient. The docs are wrong. I am going to save you money by stopping from buying a new system you do not need.
First, this is about running quantized and distilled models. The regular models are too big for running in a fairly typical desktop computer. What is the difference in the models? Quantized and/or Distilled are smaller and hence use much less memory. This means they also run faster (at the end of this article I provide a little more detail).
Sources all over the web point to show tables about the requirements. Here is a typical one from GPU System Requirements for Running DeepSeek-R1 with the section that is inaccurate clearly marked.
My Setup
- RTX 2060 OC edition with 12GB RAM (not overclocked)
- 48GB of main memory
- AMD Ryzen 5 5600X (not overclocked)
- NVME drive (1 TB M.2)
- Serving the models from Ollama
- Windows 11 but running the model in WSL using Ubuntu
This is nowhere near the power of the systems in the above table, but I can run every model in the red circle perfectly. You can buy my GPU on ebay for as little as $150. There are also some critical system RAM requirements that are completely neglected by this table- you need 20–24 GB of RAM.
High System RAM Need Explained
Nvidia has a very special feature called unified memory. If a model is too big for the GPU VRAM, it will use main RAM. This is why the main RAM usage goes up so high when the model is run. Unified Memory for CUDA Beginners | NVIDIA Technical Blog
Ollam Model Runner
Why Ollama? I tried a few ways, and this was the simplest by far. Here are the instructions I followed: Running DeepSeek-R1 Model on Your Local Machine The only thing to be aware of is that Ollama runs automatically so don’t try to start with “ollama serve” or you will get an “address already in use” error. Instead, just run the model with:
ollama run deepseek-r1:14b
replacing 14b with the model you want (1.5b, 7b, 8b, or 14b).
Model Sizes
What is the “b” or “B” in model names? It stands for billion and is about the number of model parameters. For example, 14b has 14 billion parameters. This is NOT the same as model size, since you have to know how big a parameter is to calculate the size and memory usage. Those are measured in bits and models use a range of 32, 16, 8, or 4. Multiple the number of parameters by the size of each parameter in bits, then divide by 8 to get the number of bytes. Note that models have more than just parameters, but this will get you close.
Show Me
The video shows how much main RAM and video RAM go up as I start WSL and then start serving DeepSeek models. I am running the 14b model that supposedly requires an RTX 4080 and 16GB VRAM. Not the case as you can clearly see. I did not tune anything. Don’t know how ;)
Conclusion
I hope this will help you run DeepSeek and experiment with it. I found it interesting to run the “dumb” 1.5b model and compare answers to the 14b model. 1.5b does not have the level of knowledge (it just does not some things at all) or the ability to provide as detailed answers when it does know something. 14b, though, is easy to run and performs well- even when compared to other models like GPT-4, Claude, OpenAI Mini (see DeepSeek R1 Distilled Models in Ollama: Not What You Think | by Kshitij Darwhekar | Jan, 2025 | Towards AI).
Quantization and Distillation
There are many detailed references on this, so I will just present the gist. A distilled model is a “student” model that learns from a more complex “teacher” model. It is a type of transfer learning, but with the dual goals of making a model smaller while keeping its performance high. The student is trained using prompts and the outputs of the teacher model for those prompts. I have seen some cases where the regular training data and process is also used. The overall process is faster and uses less money and power while helping create a smaller model that is easier to deploy and run on less powerful systems. An analogy is taking a class that teaches you just how to pass an exam rather than teaching you the subject thoroughly.
Quantization is a much simpler concept. My company did data compression work and since model quantization is just a type of compression it was an area of interest that I worked on. A highly precise model would use parameters with high precision, such as 32-bit floating point numbers. Is this precision really needed? It depends. It could help with some problems, but the gains are minimal and for most models the extra precision does not matter. It is the question of how many significant digits you need and how do changes in that impact the error rate. This is studied with numerical analysis techniques.
Compression can also alter a model to run on different architectures. You cannot run a 32-bit model in a 16-bit microcontroller, for example, but you could quantize it to 16-bit floating point (FP16) and then run it. You could further quantize the model and run on an 8-bit microcontroller using 8-bit integers (INT8) to represent parameters. To quantize, you convert the parameters to lower precision numerical representation by rewriting the model while ignoring the higher precision bits in each parameter, cutting the size down dramatically. 32-bit can be converted to 16 bit, or 8 bit or even 4 (INT4). Quantization lets you run AI models on IoT devices and wearables.
Performing distillation followed by quantization results in dramatically smaller models. DeepSeek-R1 has 671 billion 8-bit parameters. DeepSeek-R1-Distill-Qwen-14B has 14 billion 4-bit parameters.