<C/> loading...

<section id="saiga">

Saiga LLM

Production-ready multi-user LLM service running Saiga Nemo 12B (Llama family, GGUF Q4_K_M) locally on 2× NVIDIA P104-100 with layer offloading via llama.cpp. Distributed deployment with web UI, OpenAI-compatible API, Telegram bot, and full observability stack.

saiga.vaibkod.ru @saiga_ai_bot github.com/KondrashovDenis/saiga

</section>

12B
Parameters
Saiga Nemo
8K
Context
window
15-18
Tokens/sec
at 8K context
~7.5GB
Total VRAM
2× P104-100

About the Project

Self-hosted Russian-language LLM with full ML pipeline

What is Saiga?

Saiga is one of the most popular open-source Russian-language LLMs, created by Ilya Gusev — Senior ML Engineer at Booking.com. Most open language models are optimised for English, causing inefficient Russian tokenisation; Saiga fixes that.

This project runs Saiga Nemo 12B locally on two NVIDIA P104-100 mining GPUs (8 GB VRAM each) with layer offloading via llama.cpp. The third P104 in the same host is repurposed for Stable Diffusion + ControlNet in a sibling RAG project — rational hardware sharing between ML tasks. Real-world performance on commodity hardware that costs a fraction of cloud inference.

Full-stack ML pipeline: from GPU inference to observability. Distributed deployment across two physical machines with Telegram bot on a separate VDS connecting through SSH-tunnel with restricted-key — solves the Russian ISP blocking of Telegram API.

Hardware Setup

# Hardware configuration
GPU_0: NVIDIA P104-100 · 8 GB VRAM (LLM layers)
GPU_1: NVIDIA P104-100 · 8 GB VRAM (LLM layers)
GPU_2: NVIDIA P104-100 · 8 GB VRAM (Stable Diffusion · sibling project)
# Layer split via llama.cpp
USED: ~7.5 GB total (model layers across 2 GPUs)
SHM: 64 GB
MODEL: Saiga-nemo-12b-Q4_K_M.gguf
ENGINE: llama.cpp + CUDA 11.8
CONTEXT: 8 192 tokens
SPEED: 15-18 tok/s · 99%+ uptime

Service Features

GPU Cluster

2× P104-100 mining GPUs (Pascal, без tensor cores) с layer-split через llama.cpp. CUDA 11.8 в контейнере, host driver 535.x, nvidia-container-toolkit runtime для proper library forwarding. Третья P104 в том же хосте — под Stable Diffusion для соседнего RAG-проекта.

Distributed Deployment

GPU-сервер с моделью + Telegram-бот на изолированном VDS. Бот ходит к Postgres через SSH-туннель с restricted-key (permitopen=127.0.0.1:5432), к LLM — через HTTPS+Bearer. Решает блокировку api.telegram.org. Single account web↔bot через custom Telegram deep-link auth (без Widget) — диалоги и настройки общие.

Production Observability

Sentry на web и боте, Prometheus + Grafana с custom dashboards: GPU memory/util/temp/power через NVIDIA DCGM-exporter, контейнеры через cAdvisor v0.51, host metrics через node-exporter. Доступ через metrics.vaibkod.ru с basic-auth.

CI/CD + Tests

GitHub Actions с ruff (strict) + pytest. 62 unit-теста с 99% coverage на тестируемых модулях (markdown converter, database URL converter, env validators, TelegramLinkToken model). Coverage artifact upload.

Architecture

Distributed across 2 physical hosts · 5 subdomains with different auth policies

// GPU host

  • Debian 12 · 2× P104-100 (LLM)
  • Web UI (Flask + Gunicorn)
  • LLM inference (llama.cpp)
  • PostgreSQL 16 · Redis 7
  • Caddy reverse-proxy
  • Prometheus + Grafana stack

// VDS (telegram bot)

  • python-telegram-bot 20.7 async
  • aiohttp (LLM client + retry)
  • Async SQLAlchemy 2.0 + asyncpg
  • SSH-tunnel to Postgres
  • autossh systemd service
  • Bearer-auth to llm.vaibkod.ru

// Caddy subdomains

  • saiga.vaibkod.ru — public web
  • llm.vaibkod.ru — Bearer-auth API
  • saigaui.vaibkod.ru — basic-auth
  • metrics.vaibkod.ru — basic-auth
  • 10+ services across 2 hosts
  • auto Let's Encrypt

Tech Stack

// LLM SERVING

text-generation-webui (oobabooga fork), llama-cpp-python with GGML_CUDA=on, PyTorch 2.7 + cu126, GGUF Q4_K_M quantisation

// GPU INFRASTRUCTURE

2× NVIDIA P104-100 (Pascal, no tensor cores), CUDA 11.8 in container, host driver 535.x, nvidia-container-toolkit runtime

// BACKEND (AI SCENARIOS)

Python 3.11, Flask + Gunicorn (web with stream chat), python-telegram-bot 20.7 async, aiohttp (LLM API client with retry + Bearer), async SQLAlchemy 2.0 + asyncpg

// STORAGE

PostgreSQL 16 (shared between web and bot via unified saiga_shared SQLAlchemy package), Alembic for migrations with DDL/DML role split (_migrator for Alembic, _app for runtime — runtime cannot DROP/ALTER), Redis 7 for bot state

// OBSERVABILITY

Sentry SDK (Flask integration + manual init for bot), Prometheus 2.55, Grafana 11.3, DCGM-exporter 3.3.7, cAdvisor v0.51, node-exporter 1.8

// CI/CD + INFRA

GitHub Actions (ruff strict + pytest with coverage), Caddy 2 reverse-proxy with auto Let's Encrypt, Docker Compose (10+ services), autossh systemd units for tunnelling

Saiga Model Family

Choose the model that fits your hardware

Model Parameters Context Base VRAM (Q4)
Saiga LLaMA 2 7B7B4 096Meta LLaMA 2~6 GB
Saiga Mistral 7B7B8 192Mistral 7B~6 GB
Saiga LLaMA 3 8B8B8 192Meta LLaMA 3~7 GB
Saiga LLaMA 2 13B13B4 096Meta LLaMA 2~10 GB
★ Saiga Nemo 12B12.2B128 000Mistral NeMo / NVIDIA~9 GB (split)
Saiga LLaMA 2 70B70B4 096Meta LLaMA 2~48 GB

Training Methods

SFT

Supervised fine-tuning on Russian instruction-response pairs

DPO / SimPO

Preference optimisation using paired comparison data

SLERP

Model merging via spherical linear interpolation

LoRA

Parameter-efficient adaptation for targeted fine-tuning

Quantisation Levels

GGUF format via llama.cpp. Choose quality vs VRAM trade-off:

Q2_K — 2-bit Q3_K_M — 3-bit Q4_K_M — 4-bit ★ Q5_K_M — 5-bit Q6_K — 6-bit Q8_0 — 8-bit

★ Q4_K_M recommended — best quality/VRAM ratio

Deployment Frameworks

llama.cpp
CPU + GPU inference, GGUF native — used here
Ollama
Easy local deployment, REST API
vLLM
High-throughput, requires tensor cores
Transformers
HuggingFace native, full precision

Want to integrate local AI?

Local LLM deployment for your business — data security, full control, no cloud costs. Let's discuss your use case.

Send Request