llm: add FastAPI shim, gateway LLM endpoints, tests, and docs

2026-04-12 09:41:21 +02:00
parent baf497b015
commit 59c9584250
15 changed files with 1779 additions and 11 deletions
--- a/USAGE.md
+++ b/USAGE.md
@@ -1,10 +1,10 @@
 # Skinbase Vision Stack — Usage Guide

-This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity services).
+This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity, and optional LLM services).

 ## Overview

- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity` (FastAPI each, except `qdrant` which is the official Qdrant DB).
+- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity`, `llm` (FastAPI each except `qdrant`; `llm` is a thin FastAPI shim that manages an internal `llama-server` process).
 - Gateway is the public API endpoint; the other services are internal.

 ## Model overview
@@ -21,6 +21,8 @@ This document explains how to run and use the Skinbase Vision Stack (Gateway + C

 - **Maturity**: Dedicated NSFW/maturity classifier. Accepts an image and returns a normalized safety signal including `maturity_label` (`safe`/`mature`), `confidence`, raw `score`, optional sublabels (e.g. `nsfw`), and an `action_hint` (`safe`, `review`, `flag_high`) designed for Nova moderation workflows. Powered by `Falconsai/nsfw_image_detection` (ViT-based, HuggingFace). Thresholds are configurable via environment variables.

+- **LLM**: Internal text-generation service backed by `llama.cpp` and a GGUF Qwen3 model. Exposed through the gateway for non-streaming chat completions and model discovery. Intended for Nova workflows such as creator bios, metadata suggestions, moderation helper text, and other short internal generation tasks.
+
 ## Prerequisites

 - Docker Desktop (with `docker compose`) or a Docker environment.
@@ -55,12 +57,48 @@ MATURITY_ENABLED=true
 - `MATURITY_THRESHOLD_REVIEW`: score above this but below mature threshold → `mature` + `review` (default `0.60`).
 - `MATURITY_ENABLED`: set to `false` to disable maturity endpoints at the gateway without removing the service.

+Optional LLM configuration:
+
+```bash
+LLM_URL=http://llm:8080
+LLM_ENABLED=false
+LLM_TIMEOUT=120
+LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
+LLM_MAX_TOKENS_DEFAULT=256
+LLM_MAX_TOKENS_HARD_LIMIT=1024
+LLM_MAX_REQUEST_BYTES=65536
+
+# Local llm profile only
+MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
+LLM_CONTEXT_SIZE=4096
+LLM_THREADS=4
+LLM_GPU_LAYERS=0
+LLM_EXTRA_ARGS=
+```
+
 Run from repository root:

 ```bash
 docker compose up -d --build
 ```

+That starts the default vision stack only.
+
+To also start the local LLM service:
+
+```bash
+docker compose --profile llm up -d --build
+```
+
+Before enabling the `llm` profile, provision the GGUF model described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
+
+For small production hosts, the preferred setup is usually to keep the gateway local and point `LLM_URL` at a separate private LLM host:
+
+```bash
+LLM_ENABLED=true
+LLM_URL=http://private-llm-host:8080
+```
+
 Stop:

 ```bash
@@ -82,6 +120,74 @@ Check the gateway health endpoint:
 curl https://vision.klevze.net/health
 ```

+Check LLM-specific gateway health:
+
+```bash
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
+```
+
+## LLM smoke test checklist
+
+Use this sequence on a machine with Docker available after you have mounted the GGUF model and enabled the gateway with `LLM_ENABLED=true`.
+
+1. Start the gateway with the `llm` profile.
+
+```bash
+docker compose --profile llm up -d --build gateway llm
+```
+
+2. Confirm the LLM service came up cleanly.
+
+```bash
+docker compose ps llm
+docker compose logs --tail=100 llm
+```
+
+3. Check the repo-owned internal health endpoint.
+
+```bash
+curl http://127.0.0.1:8080/health
+```
+
+Expected fields: `status`, `model`, `context_size`, `threads`.
+
+4. Confirm the gateway sees the LLM backend.
+
+```bash
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
+```
+
+5. Verify model discovery.
+
+```bash
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
+curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/models
+```
+
+6. Run a small chat request through the gateway.
+
+```bash
+curl -X POST http://127.0.0.1:8003/v1/chat/completions \
+  -H "X-API-Key: <your-api-key>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
+      {"role": "user", "content": "Write one short admin help sentence about reviewing wallpaper metadata."}
+    ],
+    "max_tokens": 60,
+    "stream": false
+  }'
+```
+
+7. If startup or health fails, inspect the relevant logs.
+
+```bash
+docker compose logs --tail=200 llm
+docker compose logs --tail=200 gateway
+```
+
 ## Universal analyze (ALL)

 Analyze an image by URL (gateway aggregates CLIP, BLIP, YOLO):
@@ -241,7 +347,93 @@ Response fields:
 - `review`: score ≥ `MATURITY_THRESHOLD_REVIEW` (default 0.60) but below mature threshold — possible mature, queue for human review.
 - `safe`: score below both thresholds — content appears safe.

-If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing. store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.
+If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing.
+
+## LLM / Chat endpoints
+
+The gateway validates requests, clamps `max_tokens` to configured limits, rejects oversized payloads, and normalizes downstream failures into JSON under an `error` key.
+
+### OpenAI-style chat completions
+
+```bash
+curl -X POST https://vision.klevze.net/v1/chat/completions \
+  -H "X-API-Key: <your-api-key>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
+      {"role": "user", "content": "Write a short biography for a creator known for sci-fi environments."}
+    ],
+    "temperature": 0.7,
+    "max_tokens": 220,
+    "stream": false
+  }'
+```
+
+Supported request fields:
+- `messages` (required)
+- `temperature`
+- `max_tokens`
+- `stream` (`false` only in v1)
+- `top_p`
+- `stop`
+- `presence_penalty`
+- `frequency_penalty`
+
+Validation rules:
+- At least one message is required.
+- Roles must be `system`, `user`, or `assistant`.
+- Empty message content is rejected.
+- Oversized request bodies return `413`.
+- `max_tokens` is clamped to `LLM_MAX_TOKENS_HARD_LIMIT`.
+
+### Project-friendly chat response
+
+```bash
+curl -X POST https://vision.klevze.net/ai/chat \
+  -H "X-API-Key: <your-api-key>" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "system", "content": "You are a helpful metadata assistant."},
+      {"role": "user", "content": "Suggest five tags for a fantasy castle wallpaper."}
+    ]
+  }'
+```
+
+Example response:
+
+```json
+{
+  "model": "qwen3-1.7b-instruct-q4_k_m",
+  "content": "fantasy castle, moonlit fortress, medieval towers, epic landscape, digital painting",
+  "finish_reason": "stop",
+  "usage": {
+    "prompt_tokens": 48,
+    "completion_tokens": 19,
+    "total_tokens": 67
+  }
+}
+```
+
+### Model discovery
+
+```bash
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
+curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
+```
+
+### Failure modes
+
+- `401`: missing or invalid API key
+- `413`: request body exceeds `LLM_MAX_REQUEST_BYTES`
+- `422`: validation failure or unsupported streaming request
+- `503`: LLM disabled or upstream unavailable
+- `504`: upstream timeout
+
+## Vector DB (Qdrant)
+
+Use the Qdrant gateway endpoints to store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.

 Qdrant point IDs must be either an unsigned integer or a UUID string. If you send another string value, the wrapper may replace it with a generated UUID and store the original value in metadata as `_original_id`.