Saiga LLM

Production-ready multi-user LLM service running Saiga Nemo 12B (Llama family, GGUF Q4_K_M) locally on 2× NVIDIA P104-100 with layer offloading via llama.cpp. Distributed deployment with web UI, OpenAI-compatible API, Telegram bot, and full observability stack.

saiga.vaibkod.ru @saiga_ai_bot github.com/KondrashovDenis/saiga

</section>

12B

Parameters
Saiga Nemo

Context
window

15-18

Tokens/sec
at 8K context

~7.5GB

Total VRAM
2× P104-100

About the Project

Self-hosted Russian-language LLM with full ML pipeline

What is Saiga?

Saiga is one of the most popular open-source Russian-language LLMs, created by Ilya Gusev — Senior ML Engineer at Booking.com. Most open language models are optimised for English, causing inefficient Russian tokenisation; Saiga fixes that.

This project runs Saiga Nemo 12B locally on two NVIDIA P104-100 mining GPUs (8 GB VRAM each) with layer offloading via llama.cpp. The third P104 in the same host is repurposed for Stable Diffusion + ControlNet in a sibling RAG project — rational hardware sharing between ML tasks. Real-world performance on commodity hardware that costs a fraction of cloud inference.

Full-stack ML pipeline: from GPU inference to observability. Distributed deployment across two physical machines with Telegram bot on a separate VDS connecting through SSH-tunnel with restricted-key — solves the Russian ISP blocking of Telegram API.

Hardware Setup

# Hardware configuration

GPU_0: NVIDIA P104-100 · 8 GB VRAM (LLM layers)

GPU_1: NVIDIA P104-100 · 8 GB VRAM (LLM layers)

GPU_2: NVIDIA P104-100 · 8 GB VRAM (Stable Diffusion · sibling project)

# Layer split via llama.cpp

USED: ~7.5 GB total (model layers across 2 GPUs)

SHM: 64 GB

MODEL: Saiga-nemo-12b-Q4_K_M.gguf

ENGINE: llama.cpp + CUDA 11.8

CONTEXT: 8 192 tokens

SPEED: 15-18 tok/s · 99%+ uptime

Service Features

GPU Cluster

2× P104-100 mining GPUs (Pascal, без tensor cores) с layer-split через llama.cpp. CUDA 11.8 в контейнере, host driver 535.x, nvidia-container-toolkit runtime для proper library forwarding. Третья P104 в том же хосте — под Stable Diffusion для соседнего RAG-проекта.

Distributed Deployment

GPU-сервер с моделью + Telegram-бот на изолированном VDS. Бот ходит к Postgres через SSH-туннель с restricted-key (permitopen=127.0.0.1:5432), к LLM — через HTTPS+Bearer. Решает блокировку api.telegram.org. Single account web↔bot через custom Telegram deep-link auth (без Widget) — диалоги и настройки общие.

Production Observability

Sentry на web и боте, Prometheus + Grafana с custom dashboards: GPU memory/util/temp/power через NVIDIA DCGM-exporter, контейнеры через cAdvisor v0.51, host metrics через node-exporter. Доступ через metrics.vaibkod.ru с basic-auth.

CI/CD + Tests

GitHub Actions с ruff (strict) + pytest. 62 unit-теста с 99% coverage на тестируемых модулях (markdown converter, database URL converter, env validators, TelegramLinkToken model). Coverage artifact upload.

Architecture

Distributed across 2 physical hosts · 5 subdomains with different auth policies

// GPU host

Debian 12 · 2× P104-100 (LLM)
Web UI (Flask + Gunicorn)
LLM inference (llama.cpp)
PostgreSQL 16 · Redis 7
Caddy reverse-proxy
Prometheus + Grafana stack

// VDS (telegram bot)

python-telegram-bot 20.7 async
aiohttp (LLM client + retry)
Async SQLAlchemy 2.0 + asyncpg
SSH-tunnel to Postgres
autossh systemd service
Bearer-auth to llm.vaibkod.ru

// Caddy subdomains

saiga.vaibkod.ru — public web
llm.vaibkod.ru — Bearer-auth API
saigaui.vaibkod.ru — basic-auth
metrics.vaibkod.ru — basic-auth
10+ services across 2 hosts
auto Let's Encrypt

Tech Stack

// LLM SERVING

text-generation-webui (oobabooga fork), llama-cpp-python with GGML_CUDA=on, PyTorch 2.7 + cu126, GGUF Q4_K_M quantisation

// GPU INFRASTRUCTURE

2× NVIDIA P104-100 (Pascal, no tensor cores), CUDA 11.8 in container, host driver 535.x, nvidia-container-toolkit runtime

// BACKEND (AI SCENARIOS)

Python 3.11, Flask + Gunicorn (web with stream chat), python-telegram-bot 20.7 async, aiohttp (LLM API client with retry + Bearer), async SQLAlchemy 2.0 + asyncpg

// STORAGE

PostgreSQL 16 (shared between web and bot via unified saiga_shared SQLAlchemy package), Alembic for migrations with DDL/DML role split (_migrator for Alembic, _app for runtime — runtime cannot DROP/ALTER), Redis 7 for bot state

// OBSERVABILITY

Sentry SDK (Flask integration + manual init for bot), Prometheus 2.55, Grafana 11.3, DCGM-exporter 3.3.7, cAdvisor v0.51, node-exporter 1.8

// CI/CD + INFRA

GitHub Actions (ruff strict + pytest with coverage), Caddy 2 reverse-proxy with auto Let's Encrypt, Docker Compose (10+ services), autossh systemd units for tunnelling

Saiga Model Family

Choose the model that fits your hardware

Model	Parameters	Context	Base	VRAM (Q4)
Saiga LLaMA 2 7B	7B	4 096	Meta LLaMA 2	~6 GB
Saiga Mistral 7B	7B	8 192	Mistral 7B	~6 GB
Saiga LLaMA 3 8B	8B	8 192	Meta LLaMA 3	~7 GB
Saiga LLaMA 2 13B	13B	4 096	Meta LLaMA 2	~10 GB
★ Saiga Nemo 12B	12.2B	128 000	Mistral NeMo / NVIDIA	~9 GB (split)
Saiga LLaMA 2 70B	70B	4 096	Meta LLaMA 2	~48 GB

Training Methods

SFT

Supervised fine-tuning on Russian instruction-response pairs

DPO / SimPO

Preference optimisation using paired comparison data

SLERP

Model merging via spherical linear interpolation

LoRA

Parameter-efficient adaptation for targeted fine-tuning

Quantisation Levels

GGUF format via llama.cpp. Choose quality vs VRAM trade-off:

Q2_K — 2-bit Q3_K_M — 3-bit Q4_K_M — 4-bit ★ Q5_K_M — 5-bit Q6_K — 6-bit Q8_0 — 8-bit

★ Q4_K_M recommended — best quality/VRAM ratio

Deployment Frameworks

llama.cpp

CPU + GPU inference, GGUF native — used here

Ollama

Easy local deployment, REST API

vLLM

High-throughput, requires tensor cores

Transformers

HuggingFace native, full precision

Want to integrate local AI?

Local LLM deployment for your business — data security, full control, no cloud costs. Let's discuss your use case.

Send Request