Site icon Shuru

A case study of self-hosted, locally running coding LLM chatbots

The exploration of feasible Claude Code alternatives

Introduction

Shuru Labs has been continuously experimenting with ways to enhance developer productivity through developing data-driven, actionable frameworks, that increase the inclusion of AI tools in developer workflows.

This article captures my exercise in evaluating different avenues to locally run large‑language‑model (LLM) chatbots, without relying on proprietary services such as Anthropic’s Claude‑Code. While Sonnet-4 and Claude Code have proved invaluable, Claude Code’s variable usage limits aren’t very dependable (10-40 message requests every 5 hours, further subjected to spot-pricing). And the usage pressure on Anthropic’s infrastructure also brings the risk of service outages. 

Hence, as a possible alternative (or backup), I’ve explored and compared several GPU‑as‑a‑Service (GaaS) providers, free research notebooks, and token‑based inference platforms, and thereafter, further experimented with self‑hosting an LLM on my local single‑GPU workstation.

Platforms Explored

PlatformCategoryKey Point
RunpodGPU CloudPer‑second billing
PaperspaceGPU CloudVS Code plug‑in for Paperspace machines
TensorDockGPU MarketplaceSpot‑style pricing
KaggleResearch NotebookFree but quota‑limited
GroqToken APIUltra‑low latency

1. Runpod – GPU Cloud

2. Paperspace (DigitalOcean)

3. TensorDock – Marketplace Style GPU Cloud

4. Kaggle Notebooks

5. GroqCloud – Tokens‑as‑a‑Service

Glossary

A brief explanation of some technical terms that would come right ahead:

TermDefinitionImplication for LLM Hosting
QuantizationCompressing model weights to lower precision.(For example, 4‑bit quantization).Cuts VRAM requirement by 40‑60 %.Allows smaller GPUs to host bigger models at a manageable accuracy cost.
FP16Half‑precision 16‑bit floating‑point format.Baseline on NVIDIA GPUs. Halves memory in comparison to FP32, but still heavy for large models.
VRAM FootprintThe total GPU memory used by weights, activations, and KV‑cache.Hard ceiling on model size – when exceeded the process goes out-of-memory (OOM), leading to a crash/the OS killing it.
Context WindowMaximum tokens the model can attend to in a single pass.Longer windows need more memory and time. Context windows hence limit prompt lengths.
LoRALow‑Rank Adaptation fine‑tuning technique.Allows cheap fine‑tuning with tiny adapter weights that can be swapped in/out.
Token APICloud services that charge per input/output token (e.g., Groq, OpenRouter).While they could be faster and not have scalability concerns, raw GPU access and custom weights aren’t available.
Spot PricingDynamic rental rate that fluctuates with supply/demand.Cheaper cost per hour, but with a risk of instance pre‑emption or sudden price spikes.
GPU MarketplacePlatform aggregating third‑party GPU hosts (For example: TensorDock).Wider inventory and lower prices, but host quality can vary.
GGUF / EXL2Efficient file formats for quantized checkpoints.Loads instantly in Llama.cpp / text‑gen‑webui saves disk space.
Inference TokensTokens processed while the model receives or generates text.Determines compute cost and overall latency. Batching reduces per‑token cost.

The Self‑Hosting Experiment

I repurposed my gaming rig with an Nvidia RTX 4080 16 GB (≈ 48 TFLOPS FP16, 716 GB/s bandwidth), to see if I could serve a lightweight coding assistant locally.

The Model Selection Journey

Qwen 2.5 Coder benchmarks and comparison.
  1. While I originally wanted to test Qwen 2.5 Coder 32B Instruct, even with an SLI setup, this open-weight, flagship, code-oriented model will be well out of reach without a 64 GB VRAM GPU. This follows a generalized rule-of-thumb:

Ruleofthumb: FP16 models need approximately 2x the LLM parameter‑size of GPU VRAM.

  1. So, as an alternative, I decided to try the Qwen 2.5 coder 7b instead, as its requirements were (theoretically) matched much better by the RTX 4080.
  2. However, the 15GB FP16 footprint from Qwen 2.5 coder 7b left no headroom, and my notebook would crash on the model mounting cell before proceeding to the next one.
  3. Hence, I decided to switch to the safer Qwen 2.5 coder 3b. And if that wouldn’t work either, the even safer DeepSeek‑Coder 1.3B Instruct – with approximately 6GB (3GBs for DeepSeek) FP16, hence stable with lots of VRAM to spare for context windows, as well as replicable on consumer GPUs and shared usage platforms (such as Kaggle).
  4. Plan: Using Qwen 2.5 coder 7b by quantizing to 4 bits, to squeeze larger context or multi‑query attention.
Qwen 2.5 Coder models comparison as per parameter counts/model sizes.

Making an OpenAI-compatible chatbot API using FastAPI and ngrok

As an extra addition, to make my LLM-setup usable across the network (the LLM running on my workstation and my work laptop using it over the LAN), or even across the internet, I decided to write an Open-AI compatible chatbot API, so that Agents compatible with the OpenAI V1 standard https://platform.openai.com/docs/api-reference/introduction, (such as Roo Code/CLine) running on other computers, may also utilize this LLM setup, a la OpenRouter style. The exposed endpoint URL was reachable via an ngrok tunnel, which I could share with my teammates as well.

Link to my python notebook: https://github.com/shurutech/reboot/blob/main/selfhosted-coding-agent/llm_server.ipynb
Running via IDE Agents (Roo Code):
  1. Setup
    Roo Code Agentic Setup.
  2. Demo:
    Roo Code Agentic AI demo
Running via command line (powershell):
Shell script + demo on windows.
Running via command line (Bash/WSL):
Shell script for linux.
Shell script demo for WSL/Bash.

Key Takeaways

  1. A single‑GPU setup with 16 GB VRAM may realistically host a 3B parameter model in FP16; however, the throughput from consumer GPUs will not match that of Data Center GPUs.
  2. Marketplace clouds like TensorDock often undercut traditional GaaS players by more than 3× on mid‑range GPUs.
  3. Token APIs (Groq, OpenRouter) shine in terms of latency, but without the LoRA weights and token-agnostic pricing, they can be counterintuitive in terms of pricing, if model use is very high (For example, Agentic AI).
    See: https://shurutech.com/openrouter-experiment-phase-1-power-users-how-our-top-ai-developers-really-work/#model-usage.
  4. By choosing either a more-powerful GPU or by choosing a low-weight (pun unintended) model, one can stick to the good practice of always leaving room for at least a few GBs of VRAM (say 12.5%) margin to safeguard against OOM process autokills. (This is especially relevant if self-hosted locally running AI models are being pursued beyond the enthusiast/amateur scale).

What’s Next?

As with this experiment successfully showcasing the local LLM approach for power coding agents, Shuru Labs will continue to explore the feasibility of “bare metal”-like or self-managed LLM provisioning options. Though a complete pivot in this direction would leave us more reliant on open-weight models (their continued development in the open source AI community), it would also help us navigate the ever-changing field of AI tools, since our AI tooling costs and SLA availability would become more predictable.

We would also have better hands-on data w.r.t individual model usefulness and its infra caveats (as opposed to being locked-in to one provider/model only).

Setup Reference Specs

Model Spec: Qwen 2.5 Coder 3B Instruct
Hardware Spec:
Software Spec:

Setup: Detailed setup instructions are available at https://github.com/shurutech/reboot/blob/main/selfhosted-coding-agent/setup_instructions.md

Quantization and 8GB GPUs

For readers on smaller cards (For ex, an RTX 3060 12 GB or 8 GB MacBooks), quantizing to 4‑bit (GGUF/EXL2) cuts memory requirements roughly by half, and lets you load a 13B model in 8 GB with minimal/manageable quality drop. Tools like llama‑cpp‑quantize and bitsandbytes make this a one‑command process – stay tuned with Shuru Labs for our future quantization experiments.

Bonus: A 24 GB graphics card option worth considering

The RTX 4090 24 GB offers nearly double the tensor throughput of an RTX 4080, while costing as low as $0.35 / hour on TensorDock Community hosts – often the best price per TeraFLOP amongst single‑GPU options.

Disclaimer:

Shuru is not affiliated with any third-party GPU-as-a-service/Cloud AI solution providers/Proprietors mentioned in this article. This article neither endorses any single mentioned service over the others (nor does it endorse any mentioned services in general), nor was sponsored by for/paid for by any proprietors mentioned in this article. This experiment and its exploration of feasible solution providers was for educational purposes only. And any commendations/criticisms for any service/entity/provider mentioned, is purely analytical and subjective in nature.

Author

Exit mobile version