llm: add FastAPI shim, gateway LLM endpoints, tests, and docs

2026-04-12 09:41:21 +02:00
parent baf497b015
commit 59c9584250
15 changed files with 1779 additions and 11 deletions
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
-# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity) – Dockerized FastAPI
+# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity + LLM) – Dockerized FastAPI

-This repository provides **six standalone vision services** (CLIP / BLIP / YOLO / Qdrant / Card Renderer / Maturity)
-and a **Gateway API** that can call them individually or together.
+This repository provides internal AI services for image analysis, vector search, card rendering, moderation,
+and text generation behind a single **Gateway API**.

 ## Services & Ports

@@ -13,6 +13,7 @@ and a **Gateway API** that can call them individually or together.
 - `qdrant-svc`: internal Qdrant API wrapper
 - `card-renderer`: internal card rendering service
 - `maturity`: internal NSFW/maturity classifier service
+- `llm`: internal text-generation service using a thin FastAPI shim over `llama-server` (profile-based, internal only)

 ## Run

@@ -20,6 +21,16 @@ and a **Gateway API** that can call them individually or together.
 docker compose up -d --build
 ```

+That starts the default vision stack only. The LLM service is disabled by default so operators are not forced to run Qwen3 on the same host.
+
+To also start the local llama.cpp service:
+
+```bash
+docker compose --profile llm up -d --build
+```
+
+Before enabling the `llm` profile locally, place the GGUF model file described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
+
 If you use BLIP, create a `.env` file first.

 Required variables:
@@ -40,6 +51,26 @@ MATURITY_THRESHOLD_REVIEW=0.60
 MATURITY_ENABLED=true
 ```

+Optional LLM configuration:
+
+```bash
+LLM_ENABLED=false
+LLM_URL=http://llm:8080
+LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
+LLM_TIMEOUT=120
+LLM_MAX_TOKENS_DEFAULT=256
+LLM_MAX_TOKENS_HARD_LIMIT=1024
+LLM_MAX_REQUEST_BYTES=65536
+
+# Local llm profile only
+MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
+LLM_CONTEXT_SIZE=4096
+LLM_THREADS=4
+LLM_GPU_LAYERS=0
+```
+
+Recommended production topology for the LLM: keep the gateway on the current vision host and point `LLM_URL` at a separate private machine or VPN-reachable container host. Running the full vision stack and Qwen3 together on a small 4c/8GB VPS will usually degrade both.
+
 Service startup now waits on container healthchecks, so first boot may take longer while models finish loading.

 ## Health
@@ -48,6 +79,71 @@ Service startup now waits on container healthchecks, so first boot may take long
 curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/health
 ```

+LLM-specific gateway health:
+
+```bash
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
+```
+
+## LLM Smoke Test
+
+Use this checklist on a Docker-capable host after provisioning the GGUF file and setting `LLM_ENABLED=true`.
+
+1. Start the gateway and local LLM profile.
+
+```bash
+docker compose --profile llm up -d --build gateway llm
+```
+
+2. Confirm the LLM container is running and healthy.
+
+```bash
+docker compose ps llm
+docker compose logs --tail=100 llm
+```
+
+3. Check the internal LLM health contract.
+
+```bash
+curl http://127.0.0.1:8080/health
+```
+
+Expected fields: `status`, `model`, `context_size`, `threads`.
+
+4. Check gateway health and LLM reachability.
+
+```bash
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
+```
+
+5. Verify model discovery through the gateway.
+
+```bash
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
+```
+
+6. Run a short non-streaming chat completion.
+
+```bash
+curl -H "X-API-Key: <your-api-key>" -X POST http://127.0.0.1:8003/ai/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
+      {"role": "user", "content": "Write one sentence about an artist who creates cinematic sci-fi wallpaper packs."}
+    ],
+    "max_tokens": 80
+  }'
+```
+
+7. If anything fails, inspect the two relevant services first.
+
+```bash
+docker compose logs --tail=200 llm
+docker compose logs --tail=200 gateway
+```
+
 ## Universal analyze (ALL)

 ### With URL
@@ -271,11 +367,51 @@ curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/cards/rend
  -d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","title":"Artwork Title"}'
 ```

+## LLM / Chat Completions
+
+The gateway exposes stable text-generation endpoints backed by the internal `llm` service. They reuse the existing `X-API-Key` protection and keep the LLM container internal-only.
+
+### OpenAI-style chat endpoint
+```bash
+curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
+      {"role": "user", "content": "Write a short creator biography for an artist who just hit 10,000 followers."}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 220,
+    "stream": false
+  }'
+```
+
+### Project-friendly chat endpoint
+```bash
+curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/ai/chat \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
+      {"role": "user", "content": "Suggest metadata tags for a cyberpunk wallpaper pack."}
+    ],
+    "max_tokens": 180
+  }'
+```
+
+### List models
+```bash
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
+```
+
 ## Notes

 - Models are loaded at service startup; initial container start can take 1–2 minutes as model weights are downloaded.
 - Qdrant data is persisted in the project folder at `./data/qdrant`, so it survives container restarts and recreates.
+- The local `llm` profile does **not** auto-download Qwen3 weights. Mount the GGUF file explicitly and let startup fail fast if it is missing.
 - Remote image URLs are restricted to public `http`/`https` hosts. Localhost, private IP ranges, and non-image content types are rejected.
 - The maturity service uses `Falconsai/nsfw_image_detection` (ViT-based). Thresholds are configurable via `.env`. The model handles photos and stylized digital art but should be calibrated against real Skinbase content before production use.
+- For small VPS deployments, prefer `LLM_ENABLED=true` with `LLM_URL` pointing to a separate LLM host instead of running the `llm` profile on the same machine.
 - For production: add auth, rate limits, and restrict gateway exposure (private network).
 - GPU: you can add NVIDIA runtime later (compose profiles) if needed.