<section id="saiga">
Saiga LLM
Production-ready multi-user LLM service running Saiga Nemo 12B (Llama family, GGUF Q4_K_M) locally on 2× NVIDIA P104-100 with layer offloading via llama.cpp. Distributed deployment with web UI, OpenAI-compatible API, Telegram bot, and full observability stack.
</section>
Saiga Nemo
window
at 8K context
2× P104-100
About the Project
Self-hosted Russian-language LLM with full ML pipeline
What is Saiga?
Saiga is one of the most popular open-source Russian-language LLMs, created by Ilya Gusev — Senior ML Engineer at Booking.com. Most open language models are optimised for English, causing inefficient Russian tokenisation; Saiga fixes that.
This project runs Saiga Nemo 12B locally on two NVIDIA P104-100 mining GPUs (8 GB VRAM each) with layer offloading via llama.cpp. The third P104 in the same host is repurposed for Stable Diffusion + ControlNet in a sibling RAG project — rational hardware sharing between ML tasks. Real-world performance on commodity hardware that costs a fraction of cloud inference.
Full-stack ML pipeline: from GPU inference to observability. Distributed deployment across two physical machines with Telegram bot on a separate VDS connecting through SSH-tunnel with restricted-key — solves the Russian ISP blocking of Telegram API.
Hardware Setup
Service Features
GPU Cluster
2× P104-100 mining GPUs (Pascal, без tensor cores) с layer-split через llama.cpp. CUDA 11.8 в контейнере, host driver 535.x, nvidia-container-toolkit runtime для proper library forwarding. Третья P104 в том же хосте — под Stable Diffusion для соседнего RAG-проекта.
Distributed Deployment
GPU-сервер с моделью + Telegram-бот на изолированном VDS. Бот ходит к Postgres через SSH-туннель с restricted-key (permitopen=127.0.0.1:5432), к LLM — через HTTPS+Bearer. Решает блокировку api.telegram.org. Single account web↔bot через custom Telegram deep-link auth (без Widget) — диалоги и настройки общие.
Production Observability
Sentry на web и боте, Prometheus + Grafana с custom dashboards: GPU memory/util/temp/power через NVIDIA DCGM-exporter, контейнеры через cAdvisor v0.51, host metrics через node-exporter. Доступ через metrics.vaibkod.ru с basic-auth.
CI/CD + Tests
GitHub Actions с ruff (strict) + pytest. 62 unit-теста с 99% coverage на тестируемых модулях (markdown converter, database URL converter, env validators, TelegramLinkToken model). Coverage artifact upload.
Architecture
Distributed across 2 physical hosts · 5 subdomains with different auth policies
// GPU host
- Debian 12 · 2× P104-100 (LLM)
- Web UI (Flask + Gunicorn)
- LLM inference (llama.cpp)
- PostgreSQL 16 · Redis 7
- Caddy reverse-proxy
- Prometheus + Grafana stack
// VDS (telegram bot)
- python-telegram-bot 20.7 async
- aiohttp (LLM client + retry)
- Async SQLAlchemy 2.0 + asyncpg
- SSH-tunnel to Postgres
- autossh systemd service
- Bearer-auth to llm.vaibkod.ru
// Caddy subdomains
- saiga.vaibkod.ru — public web
- llm.vaibkod.ru — Bearer-auth API
- saigaui.vaibkod.ru — basic-auth
- metrics.vaibkod.ru — basic-auth
- 10+ services across 2 hosts
- auto Let's Encrypt
Tech Stack
// LLM SERVING
text-generation-webui (oobabooga fork), llama-cpp-python with GGML_CUDA=on, PyTorch 2.7 + cu126, GGUF Q4_K_M quantisation
// GPU INFRASTRUCTURE
2× NVIDIA P104-100 (Pascal, no tensor cores), CUDA 11.8 in container, host driver 535.x, nvidia-container-toolkit runtime
// BACKEND (AI SCENARIOS)
Python 3.11, Flask + Gunicorn (web with stream chat), python-telegram-bot 20.7 async, aiohttp (LLM API client with retry + Bearer), async SQLAlchemy 2.0 + asyncpg
// STORAGE
PostgreSQL 16 (shared between web and bot via unified saiga_shared SQLAlchemy package), Alembic for migrations with DDL/DML role split (_migrator for Alembic, _app for runtime — runtime cannot DROP/ALTER), Redis 7 for bot state
// OBSERVABILITY
Sentry SDK (Flask integration + manual init for bot), Prometheus 2.55, Grafana 11.3, DCGM-exporter 3.3.7, cAdvisor v0.51, node-exporter 1.8
// CI/CD + INFRA
GitHub Actions (ruff strict + pytest with coverage), Caddy 2 reverse-proxy with auto Let's Encrypt, Docker Compose (10+ services), autossh systemd units for tunnelling
Saiga Model Family
Choose the model that fits your hardware
Training Methods
SFT
Supervised fine-tuning on Russian instruction-response pairs
DPO / SimPO
Preference optimisation using paired comparison data
SLERP
Model merging via spherical linear interpolation
LoRA
Parameter-efficient adaptation for targeted fine-tuning
Quantisation Levels
GGUF format via llama.cpp. Choose quality vs VRAM trade-off:
★ Q4_K_M recommended — best quality/VRAM ratio
Deployment Frameworks
Want to integrate local AI?
Local LLM deployment for your business — data security, full control, no cloud costs. Let's discuss your use case.
Send Request