How to host and run DeepSeek 671B in your house for under $2,000

Follow me on X: https://x.com/thomasunise

DeepSeek R1 is a 671 billion parameter mixture-of-experts language model released by a Chinese AI lab that, in January 2025, caused a minor panic in Silicon Valley when it matched or outperformed models that cost fifty times as much to train.

The panic wasn’t entirely about the benchmarks. It was about what the benchmarks implied: that the expensive moat everyone assumed existed around frontier AI might be more navigable than anyone wanted to admit.

That’s interesting context. But it’s not why you’re reading this.

You’re reading this because 671 billion parameters sounds like a number that lives in a data center, and someone told you it doesn’t have to. They were right. The architecture that makes DeepSeek R1 so capable is the same architecture that makes it runnable on hardware you can buy used on eBay, assemble in an afternoon, and stick in your home office next to your router and your cold brew setup.

This article is going to tell you exactly how to do that. Not at a surface level. All the way down.

What Mixture of Experts Actually Means for Your Hardware

Before we talk about servers and RAM, you need to understand why DeepSeek R1’s architecture is different from a dense model like GPT, and why that difference is the entire reason this is buildable at home.

A dense model activates every parameter on every token it generates. If a dense model has 70 billion parameters, every single one of those parameters participates in generating each word of output. This is computationally expensive. It also means if you want to run a 70B dense model, you need enough memory to hold all 70 billion parameters in fast storage simultaneously.

A Mixture of Experts model does something different. It has a large total parameter count, but only a fraction of those parameters activate for any given token. DeepSeek R1 has 671 billion total parameters, but only 37 billion are active at any one time during inference. The model maintains a routing layer that decides, token by token, which subset of “expert” subnetworks to involve in the computation. The rest sit dormant.

The implication for hardware is significant. What you need to hold in fast GPU memory isn’t 671 billion parameters worth of data. It’s 37 billion parameters worth. The remaining 634 billion parameters can live in slower system RAM and get pulled into VRAM on demand when specific experts are called.

This is called CPU offloading, and it is the technical mechanism that makes a $1,500 home build possible. The GPU handles the hot layers that activate constantly. System RAM holds the cold experts that get called occasionally. The model runs slower than it would on a rack of H100s, but it runs. And for a home user, 15 to 20 tokens per second is absolutely fine. That’s fast enough that you’re reading the output as it streams, not staring at a loading bar.

Understanding this architecture matters because it’s going to inform every hardware decision you make. You’re not trying to fit 671 billion parameters into VRAM. You’re trying to fit 37 billion active parameters into VRAM and 634 billion cold parameters into the cheapest fast RAM you can find.

The Hardware Stack, Component by Component

The Server: Why a Used Dell PowerEdge Makes More Sense Than Anything Else

You could build this on consumer hardware. People do. But there are a handful of reasons why a used enterprise server is the right call for this specific use case, and they compound on each other.

The first reason is the DIMM slots. Running DeepSeek R1 requires 512GB of RAM. No consumer motherboard supports 512GB of RAM. The chipsets don’t allow it. The DIMM slots aren’t there. Consumer platforms top out at 192GB on high-end desktop boards, and even that requires expensive DDR5 registered memory that erases the cost advantage.

Enterprise servers are designed from the ground up to support massive RAM configurations. A Dell PowerEdge R730xd has 24 DIMM slots across two processor sockets. At 32GB per DIMM, that’s 768GB total capacity. You populate 16 of those slots and you have your 512GB with room to expand.

The second reason is that used enterprise servers are absurdly cheap right now. The R730xd launched around 2015 at a list price north of $15,000 depending on configuration. Enterprise customers on 3 to 5 year refresh cycles retired these machines in massive quantities as they migrated workloads to cloud infrastructure. The secondary market is flooded. You can buy a dual-processor R730xd with rails and a power supply for $300 to $500 on eBay today. The hardware is excellent. It’s just old enough that it has no resale value to enterprises anymore, which is your gain.

The third reason is that enterprise servers are designed to be serviced. The R730xd doesn’t require a screwdriver to open. Everything is tool-free accessible. The documentation exists. The error codes are standardized. iDRAC (Dell’s remote management interface) gives you out-of-band access to the machine even when the OS is down. These aren’t nice-to-haves. When something goes wrong at 11pm, they matter.

The specific configuration you want:

Dell PowerEdge R730xd
Dual Intel Xeon E5-2699 v4 (22 cores each, 44 total cores, 88 threads)
2U form factor
Verify the seller includes the 2.5″ backplane (you want NVMe-capable)

The E5-2699 v4 specifically is worth targeting because the v4 generation supports DDR4 memory, which is what you need for the 32GB RDIMM modules. Earlier v3 processors support DDR3 only, and DDR3 modules are harder to find in quantity at 32GB density. A quick check of the listing: look for “v4” in the CPU description. If the seller doesn’t know, ask or skip it.

Budget: $300 to $500 for the server.

The RAM: 512GB of DDR4 ECC Registered Memory

This is the single most important component for running DeepSeek R1. The model in Q4 quantization occupies roughly 400GB of memory. You need 512GB total to give the model its space and have overhead for the OS, inference software, and any other processes running.

You’re looking for DDR4-2400 or DDR4-2666 ECC Registered (RDIMM) 32GB modules. ECC is not optional for this use case. ECC (Error-Correcting Code) memory detects and corrects single-bit memory errors in real time. When you have 512GB of RAM running a sustained inference workload for hours, the probability of a memory error that corrupts output or crashes the process without ECC is non-trivial. Enterprise servers use ECC for a reason. Use ECC.

Registered (buffered) is the DIMM type required by the server’s memory controller. Consumer UDIMM modules will not work. The server will not POST with UDIMMs. When you’re buying, look for the letters RDIMM or LRDIMM in the part description.

32GB modules in DDR4 are selling for $20 to $30 each on eBay right now. You need 16 of them. That’s $320 to $480 for 512GB of ECC registered DDR4. Let that sink in for a moment. You can buy half a terabyte of enterprise RAM for less than one month of heavy Claude API usage.

The population order matters.

Dual-processor servers have a specific sequence for installing DIMMs to maximize memory channel bandwidth. Installing them wrong doesn’t break anything, but you’ll leave performance on the table. Dell’s documentation for the R730xd specifies the exact population order, and it’s worth following. For 16 DIMMs across 24 slots on a dual-socket board, you’re filling channels A through D on both processors in a specific interleaved pattern. Pull the spec sheet from Dell’s support site before you start.

Budget: $320 to $480 for 16x 32GB DDR4 RDIMM.

The GPU: RTX 3090 and Why 24GB VRAM Is the Floor

The GPU is not where the model lives. The model lives in RAM. The GPU is where the active computation happens, and specifically, it’s where the model’s hot layers live permanently during inference.

Without a GPU, inference on DeepSeek R1 via CPU and RAM alone gets you somewhere between 3 and 8 tokens per second depending on memory bandwidth. That’s not unusable, but it’s not good either. With an RTX 3090, you’re looking at 15 to 25 tokens per second for single-user inference. The difference is the GPU’s memory bandwidth and compute throughput handling the attention layers and the constantly-active shared trunk of the model.

Why the RTX 3090 specifically? 24GB of VRAM. This is the floor for a useful GPU on this task. With 24GB, the inference stack can hold the model’s shared experts, embedding layers, and attention mechanisms in VRAM permanently while system RAM handles the cold experts. Below 24GB, you’re offloading too much to RAM and the speedup isn’t worth the cost of the card.

The RTX 3090 sits at $500 to $600 on eBay. The RTX 3090 Ti gets you slightly more performance at $700 to $800 and is worth considering. The RTX 4090 at 24GB VRAM offers significantly higher compute throughput and would push you closer to 30 tokens per second, but you’re looking at $1,200 to $1,500 used and the cost curve starts making less sense for a home build.

One important note about the R730xd and consumer GPUs: the server’s baseboard management controller (BMC) will detect a non-Dell GPU and respond by ramping the chassis fans to full speed. Full speed on enterprise server fans is genuinely loud. Not “slightly annoying” loud. Loud enough that you will close the door to whatever room it’s in. This is fixable with two ipmitool commands after you get Linux running, which we’ll cover in the software section.

You also need to verify PCIe slot clearance. The R730xd has riser cards for its PCIe slots, and a full-length triple-slot GPU like the RTX 3090 requires the correct riser bracket. Some sellers include risers configured for half-height cards. Check what the listing shows or ask. You want a full-height full-length riser. Dell part number for the correct riser is publicly documented.

Budget: $500 to $600 for RTX 3090.

The Storage: NVMe for OS and Model Staging

The model runs from RAM during inference. Once it’s loaded, disk isn’t doing anything. But getting the model loaded is a different story.

DeepSeek R1 in Q4 quantization is approximately 400GB. Loading 400GB from a spinning hard drive into RAM at startup takes a long time. Loading from a fast NVMe SSD takes 5 to 10 minutes.

You want a 2TB NVMe SSD. 2TB gives you room for Ubuntu, the CUDA toolkit and all its dependencies, the inference software stack, the model itself with some breathing room, and whatever else ends up living on the machine. Samsung 980 Pro and WD Black SN850 are reliable choices. Generic off-brand NVMe drives have higher failure rates and it’s not worth gambling on storage.

The R730xd’s NVMe support depends on the riser configuration you have. Some configurations support M.2 NVMe directly, others need a PCIe adapter card. A $15 PCIe to M.2 adapter handles it if native M.2 isn’t available. Verify before you buy the drive.

Budget: $100 to $150 for 2TB NVMe.

Total Budget

Component	What You’re Buying	Estimated Cost
Dell PowerEdge R730xd	Dual Xeon E5-2699 v4	$300-500
16x 32GB DDR4 ECC RDIMM	512GB total	$320-480
RTX 3090 24GB	Used, eBay	$500-600
2TB NVMe SSD	Samsung or WD Black	$100-150
Total		$1,220-1,730

Call it $1,500 as a realistic midpoint. The variance is mostly about when you buy and how patient you are on eBay.

Assembly

Assembly on an R730xd is not difficult. It is designed to be serviced by data center technicians working quickly. Nothing about it is precious or fragile. Two to three hours with no prior server experience is realistic.

The basic sequence:

Remove the top cover (slide release, lift off). The interior is well-organized and labeled. Memory slots are along both sides of the board, clearly marked. Follow Dell’s population guide for dual-processor boards. Seat each DIMM firmly until both retention clips engage. A DIMM that isn’t fully seated will cause boot errors that are annoying to diagnose.

The PCIe riser holds your GPU. The riser slides out, you seat the GPU, slide it back. Make sure the GPU’s power connectors are accessible. The R730xd’s power supply does not have PCIe power cables (it’s a server, not a gaming rig), so you need a PCIe power adapter that converts from the server’s internal power connectors. These are available on Amazon for $10 to $15 and are a standard item for this exact use case.

NVMe goes in via your adapter or native slot. Straightforward.

Close it up, plug in power and ethernet, and you’re ready for software.

Software: From Bare Metal to Running Inference

Operating System

Ubuntu Server 22.04 LTS. Not 24.04. The reason is CUDA compatibility. NVIDIA’s CUDA toolkit has the longest and most stable support history on 22.04, and the inference software you’re going to run (llama.cpp) has the most tested compatibility surface on this version. Choosing a newer OS to be cutting-edge here is not the right call. Choose the boring option that works.

Download the server ISO, flash it to a USB drive with Balena Etcher or Rufus, boot from it on the R730xd. The installer is straightforward. Allocate your NVMe for the OS, set up SSH access during installation, skip the optional snap packages.

Fan Control

Before you do anything else after booting Linux, fix the fans. The BMC has detected a non-Dell GPU and it is not happy about it. Run this:

sudo apt install ipmitool
sudo ipmitool raw 0x30 0x30 0x01 0x00
sudo ipmitool raw 0x30 0x30 0x02 0xff 0x1e

The first command disables automatic fan speed management. The second sets fan speed to 30% manually. The 0x1e at the end is hexadecimal for 30. If 30% is still too loud or your GPU runs hot, adjust the value. 0x23 is 35%, 0x28 is 40%.

Important: add these commands to a cron job or systemd service that runs on boot. The iDRAC will periodically try to reassert control and ramp fans back up unless you keep telling it not to.

# /etc/systemd/system/fan-control.service
[Unit]
Description=IPMI Fan Control
After=network.target

[Service]
Type=oneshot
ExecStart=/usr/bin/ipmitool raw 0x30 0x30 0x01 0x00
ExecStart=/usr/bin/ipmitool raw 0x30 0x30 0x02 0xff 0x1e
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

sudo systemctl enable fan-control
sudo systemctl start fan-control

Now the machine should be quiet enough to live in a room with.

NVIDIA Drivers and CUDA

sudo apt update && sudo apt upgrade -y
sudo apt install build-essential dkms
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt install nvidia-driver-545
sudo reboot

After reboot, verify with nvidia-smi. You should see your RTX 3090 listed with its VRAM reported. If the driver installed correctly, CUDA 12.x comes with it.

Install the CUDA toolkit separately for the development headers that llama.cpp needs:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4

Add CUDA to your path:

echo 'export PATH=/usr/local/cuda-12.4/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Verify with nvcc --version.

Downloading the Model

DeepSeek R1 in its full 671B form is available on Hugging Face. You want the GGUF quantized version, specifically Q4_K_M or Q4_K_S. GGUF is the format that llama.cpp uses for inference, and Q4_K_M is the quantization level that gives you the best tradeoff between model quality and memory requirements.

The unsloth team maintains high-quality GGUF quantizations of major models. Search Hugging Face for unsloth/DeepSeek-R1-GGUF. The Q4_K_M split files together are approximately 390-410GB.

Install the Hugging Face CLI with the high-speed transfer library:

pip install huggingface_hub hf_transfer
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli download unsloth/DeepSeek-R1-GGUF \
  --include "*Q4_K_M*" \
  --local-dir /data/models/deepseek-r1 \
  --local-dir-use-symlinks False

On a standard home internet connection (500Mbps down), this download runs 4 to 10 hours. Start it before you go to sleep.

The model will download as multiple split GGUF files (typically named something like DeepSeek-R1-Q4_K_M-00001-of-00009.gguf etc.). llama.cpp handles split files natively; you point it at the first file and it loads the rest automatically.

llama.cpp

llama.cpp is the inference engine. It’s a C++ implementation that runs quantized models with excellent CPU offloading support, CUDA acceleration for the GPU-resident layers, and an OpenAI-compatible server mode. It’s the right tool for this use case specifically because of how well it handles the GPU plus RAM split that DeepSeek R1’s architecture requires.

Build from source:

sudo apt install cmake
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

The -DGGML_CUDA=ON flag compiles CUDA support. This takes 10 to 20 minutes. When it finishes, you have the llama-server binary in the build/bin/ directory.

Running the Server

The key parameter is --n-gpu-layers. This number controls how many of the model’s transformer layers are loaded into VRAM versus staying in system RAM. Your RTX 3090 has 24GB of VRAM, which is enough to hold roughly 12 to 18 layers of DeepSeek R1 depending on layer size.

The right number to start with is experiment-dependent, but 10 to 14 layers is a reasonable starting point for a 24GB card. More layers in VRAM means faster inference until you run out of VRAM and the system starts swapping. Start at 10 and increment up until you see VRAM utilization near 90% in nvidia-smi without going over.

./build/bin/llama-server \
  --model /data/models/deepseek-r1/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 12 \
  --ctx-size 16384 \
  --threads 44 \
  --parallel 1 \
  --flash-attn \
  -ngl 12

--threads 44 uses all physical cores on your dual E5-2699 v4 setup. --flash-attn enables flash attention which reduces VRAM usage for the attention mechanism and meaningfully improves throughput. --ctx-size 16384 gives you a 16,384 token context window. DeepSeek R1 supports up to 128K context but pushing large contexts with this hardware configuration will impact speed.

The server starts and binds to port 8080. It serves an OpenAI-compatible API at http://[server-ip]:8080/v1. When it’s ready, you’ll see a log line indicating it’s listening for connections.

The model name for API calls is whatever you set with --alias. If you don’t set an alias, it defaults to the filename. Set --alias deepseek-r1 for a cleaner experience.

Connecting to Your Tools

The endpoint is OpenAI-compatible. Anything that accepts a custom OpenAI base URL works with zero additional configuration. You’re changing two values: the URL and the API key (which can be anything, since there’s no authentication by default on a local server).

Cursor

Settings > Models > Model Name and Base URL. Set base URL to http://[server-ip]:8080/v1 and add deepseek-r1 as a custom model. You can now select it from the model dropdown in any Cursor chat or composer session.

Continue.dev

In ~/.continue/config.json:

{
  "models": [
    {
      "title": "DeepSeek R1 Local",
      "provider": "openai",
      "model": "deepseek-r1",
      "apiBase": "http://[server-ip]:8080/v1",
      "apiKey": "local"
    }
  ]
}

Continue.dev uses this for chat, edit, and autocomplete. The model selection appears in the Continue sidebar in VS Code.

Aider

aider \
  --openai-api-base http://[server-ip]:8080/v1 \
  --openai-api-key local \
  --model openai/deepseek-r1

Aider handles multi-file editing and automatically manages context. Pair it with your local server and you have an unlimited agentic coding loop that runs until the task is done.

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://[server-ip]:8080/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="deepseek-r1",
    messages=[
        {"role": "user", "content": "Here is my codebase..."}
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Streaming is supported and you want to use it. DeepSeek R1 thinks before it answers (it outputs its reasoning in a <think> block before the actual response), and streaming lets you watch that reasoning process in real time rather than waiting for the full response.

Remote Access with Tailscale

Your server is on your home network. You want to reach it from anywhere.

The naive approach is port forwarding on your router. This works but it exposes an unauthenticated API endpoint to the open internet, and the configuration is fragile when your home IP changes.

Tailscale is the right answer. It creates an encrypted mesh network between devices. Install it on the server, install it on your laptop, install it on whatever else you want. The server gets a permanent private IP on your Tailscale network (something like 100.x.x.x) that stays consistent even as your home ISP changes your public IP.

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

After authenticating, your server has a Tailscale IP. Your API calls from your laptop use that IP regardless of where you are. The coffee shop, a hotel, your client’s office. The inference server is always reachable at the same address and the traffic is encrypted end to end.

The free Tailscale tier supports up to 100 devices. You are not hitting that limit.

Open WebUI (Optional)

If you want a ChatGPT-style interface for casual use alongside the API access, Open WebUI is the standard choice. It runs as a Docker container, points at your llama.cpp server, and gives you a full chat UI with conversation history, model selection, and system prompt configuration.

sudo apt install docker.io
sudo docker run -d \
  --network=host \
  -e OLLAMA_BASE_URL=http://localhost:8080 \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access it at http://[server-ip]:3000. Point it at your llama.cpp endpoint in the settings. This is optional but useful when you want to have a conversation rather than a coding session.

What 15-20 Tokens Per Second Actually Means for Real Work

The token speed number sounds slow if you’re comparing it to API latency on a cloud provider with GPU clusters. In practice it doesn’t feel slow because you’re comparing it to reading speed, not to some ideal.

Average adult reading speed is around 200 to 300 words per minute, or roughly 3 to 5 words per second. At 15 tokens per second (a token is roughly 0.75 words), the model is producing output at about 11 words per second. You cannot read that fast. The model is writing faster than you’re processing it.

Where the speed matters is in agentic workflows where the model is running loops. A code debugging agent running 50 iterations of “analyze error, propose fix, apply fix, run tests, repeat” is making 50 sequential inference calls. At a hosted API with rate limits, you hit a wall. On your local server with no rate limits, those 50 iterations run back to back until the task is complete.

This is the actual unlock. Not the token speed. The absence of limits on how long or how many steps a task can run.

DeepSeek R1 is specifically good at this. Its training included heavy emphasis on reasoning and multi-step problem solving. It thinks through problems explicitly before answering, which means for complex agentic tasks it’s producing more reliable outputs than models that jump straight to an answer. The thinking tokens cost you time (R1 can output hundreds of tokens of reasoning before giving the actual response), but the output quality on hard problems justifies it.

A practical example: you run Aider against a React codebase with a complex state management bug. You tell it to find and fix the bug. It reads the relevant files (using your 16K context window), reasons through the problem, proposes a fix, applies it, runs the tests, sees the failure, re-reads the error, reasons again, proposes a revised fix. It runs until it either solves the problem or exhausts the context. On a hosted API, you’re watching your token meter and your rate limit counter simultaneously. On your local server, you just wait for it to finish.

Monitoring and Maintenance

This is a real server running a real workload. It needs basic monitoring.

Check GPU temperature during inference. The RTX 3090 can run hot under sustained load. Target under 80°C. The R730xd has good airflow if you haven’t blocked it, but monitor the first few sessions. nvidia-smi dmon gives you real-time GPU temperature and utilization.

Watch RAM usage. htop or free -h shows you available system memory. The model takes roughly 400GB. With 512GB installed, you have ~100GB of headroom. If something is eating into that headroom (Docker containers, other processes), you’ll see inference quality degrade before you see an outright crash.

The R730xd has iDRAC built in. Plug the dedicated iDRAC port into your network and you have a web UI and remote console access to the machine independent of the OS. If the server hangs or you need to power cycle it remotely, this is how you do it. Worth configuring.

Model updates happen. DeepSeek has released updated versions and will likely continue to. When a better version of R1 or a successor releases in GGUF format, downloading and swapping the model file is the whole process. The server command updates to point at the new file. Everything else stays the same.

The Math on Owning vs. Renting

Claude API (Anthropic’s pricing) at heavy agentic use for a developer genuinely using it as infrastructure runs at hundreds of dollars per month if not more. Not occasional use. Real daily use on coding tasks, document work, automation scripts, agentic workflows.

This build costs $1,500 all in. At $300 per month in API costs, payback is five months. At $500 per month, it’s three months. After that it’s free.

Electricity at full inference load on this hardware is $35 to $55 per month. That’s your ongoing cost. One dinner out.

The server will run for years. You can load different models as better ones come out. You can give your whole team access to the same endpoint. You can run multiple models simultaneously if you add a second GPU. The hardware you buy is infrastructure you own, not a subscription you’re serviced by.

The comparison isn’t “this server versus ChatGPT Plus.” It’s “this server versus treating frontier AI as a utility you have a meter on.”

What This Is Not

It’s not a production inference server for thousands of users. If you’re running a SaaS product with real traffic, you need proper GPU infrastructure. This is for your own use, a small team, or a home lab context.

It’s not zero-config. You’re setting up a server. That involves things going wrong in ways this guide can’t anticipate, reading error messages, Googling things, and occasionally asking for help on Reddit or Discord or Claude or ChatGPT.

If that’s not something you’re up for, the end of this article has an alternative.

It’s also not perfect on every benchmark. DeepSeek R1 is excellent at reasoning and code. For multimodal tasks (images, audio) it doesn’t apply. For pure creative writing some people prefer other models.

This is a reasoning and coding-focused setup.

If You Want This Done For You, I can help.

Sourcing, assembling, benchmarking, and configuring this takes focused work. Some people will just want the result without the process.

We build these end to end. Hardware arrives tested and verified. OS, CUDA, inference stack, and model are installed and configured. Your tools are pointed at the right endpoint before it ships. You plug it in, connect to your Tailscale network, and it works.

If you want to talk through whether this makes sense for your situation, reach out by contacting me here https://thomasunise.com/request-a-call/

Or at my AI agency eeko systems at this link: https://eeko.systems

How to host and run DeepSeek 671B in your house for under $2,000

What Mixture of Experts Actually Means for Your Hardware

The Hardware Stack, Component by Component

The Server: Why a Used Dell PowerEdge Makes More Sense Than Anything Else

The RAM: 512GB of DDR4 ECC Registered Memory

The GPU: RTX 3090 and Why 24GB VRAM Is the Floor

The Storage: NVMe for OS and Model Staging

Total Budget

Assembly

Software: From Bare Metal to Running Inference

Operating System

Fan Control

NVIDIA Drivers and CUDA

Downloading the Model

llama.cpp

Running the Server

Connecting to Your Tools

Cursor

Continue.dev

Aider

Python

Remote Access with Tailscale

Open WebUI (Optional)

What 15-20 Tokens Per Second Actually Means for Real Work

Monitoring and Maintenance

The Math on Owning vs. Renting

What This Is Not

If You Want This Done For You, I can help.

Leave a Reply Cancel Reply

hello@thomasunise.com

651-253-7454

LinkedIn

X

How to host and run DeepSeek 671B in your house for under $2,000

What Mixture of Experts Actually Means for Your Hardware

The Hardware Stack, Component by Component

The Server: Why a Used Dell PowerEdge Makes More Sense Than Anything Else

The RAM: 512GB of DDR4 ECC Registered Memory

The GPU: RTX 3090 and Why 24GB VRAM Is the Floor

The Storage: NVMe for OS and Model Staging

Total Budget

Assembly

Software: From Bare Metal to Running Inference

Operating System

Fan Control

NVIDIA Drivers and CUDA

Downloading the Model

llama.cpp

Running the Server

Connecting to Your Tools

Cursor

Continue.dev

Aider

Python

Remote Access with Tailscale

Open WebUI (Optional)

What 15-20 Tokens Per Second Actually Means for Real Work

Monitoring and Maintenance

The Math on Owning vs. Renting

What This Is Not

If You Want This Done For You, I can help.

You May Also Like

The science is settled: Millennials are peak evolution

How to get a job in 2026

The Complete Sales & Marketing System Blueprint for Home Service Companies

Author Thomas Unise

Leave a Reply Cancel Reply