llm: add FastAPI shim, gateway LLM endpoints, tests, and docs

This commit is contained in:
2026-04-12 09:41:21 +02:00
parent baf497b015
commit 59c9584250
15 changed files with 1779 additions and 11 deletions

198
USAGE.md
View File

@@ -1,10 +1,10 @@
# Skinbase Vision Stack — Usage Guide
This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity services).
This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity, and optional LLM services).
## Overview
- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity` (FastAPI each, except `qdrant` which is the official Qdrant DB).
- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity`, `llm` (FastAPI each except `qdrant`; `llm` is a thin FastAPI shim that manages an internal `llama-server` process).
- Gateway is the public API endpoint; the other services are internal.
## Model overview
@@ -21,6 +21,8 @@ This document explains how to run and use the Skinbase Vision Stack (Gateway + C
- **Maturity**: Dedicated NSFW/maturity classifier. Accepts an image and returns a normalized safety signal including `maturity_label` (`safe`/`mature`), `confidence`, raw `score`, optional sublabels (e.g. `nsfw`), and an `action_hint` (`safe`, `review`, `flag_high`) designed for Nova moderation workflows. Powered by `Falconsai/nsfw_image_detection` (ViT-based, HuggingFace). Thresholds are configurable via environment variables.
- **LLM**: Internal text-generation service backed by `llama.cpp` and a GGUF Qwen3 model. Exposed through the gateway for non-streaming chat completions and model discovery. Intended for Nova workflows such as creator bios, metadata suggestions, moderation helper text, and other short internal generation tasks.
## Prerequisites
- Docker Desktop (with `docker compose`) or a Docker environment.
@@ -55,12 +57,48 @@ MATURITY_ENABLED=true
- `MATURITY_THRESHOLD_REVIEW`: score above this but below mature threshold → `mature` + `review` (default `0.60`).
- `MATURITY_ENABLED`: set to `false` to disable maturity endpoints at the gateway without removing the service.
Optional LLM configuration:
```bash
LLM_URL=http://llm:8080
LLM_ENABLED=false
LLM_TIMEOUT=120
LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
LLM_MAX_TOKENS_DEFAULT=256
LLM_MAX_TOKENS_HARD_LIMIT=1024
LLM_MAX_REQUEST_BYTES=65536
# Local llm profile only
MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
LLM_CONTEXT_SIZE=4096
LLM_THREADS=4
LLM_GPU_LAYERS=0
LLM_EXTRA_ARGS=
```
Run from repository root:
```bash
docker compose up -d --build
```
That starts the default vision stack only.
To also start the local LLM service:
```bash
docker compose --profile llm up -d --build
```
Before enabling the `llm` profile, provision the GGUF model described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
For small production hosts, the preferred setup is usually to keep the gateway local and point `LLM_URL` at a separate private LLM host:
```bash
LLM_ENABLED=true
LLM_URL=http://private-llm-host:8080
```
Stop:
```bash
@@ -82,6 +120,74 @@ Check the gateway health endpoint:
curl https://vision.klevze.net/health
```
Check LLM-specific gateway health:
```bash
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
```
## LLM smoke test checklist
Use this sequence on a machine with Docker available after you have mounted the GGUF model and enabled the gateway with `LLM_ENABLED=true`.
1. Start the gateway with the `llm` profile.
```bash
docker compose --profile llm up -d --build gateway llm
```
2. Confirm the LLM service came up cleanly.
```bash
docker compose ps llm
docker compose logs --tail=100 llm
```
3. Check the repo-owned internal health endpoint.
```bash
curl http://127.0.0.1:8080/health
```
Expected fields: `status`, `model`, `context_size`, `threads`.
4. Confirm the gateway sees the LLM backend.
```bash
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
```
5. Verify model discovery.
```bash
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/models
```
6. Run a small chat request through the gateway.
```bash
curl -X POST http://127.0.0.1:8003/v1/chat/completions \
-H "X-API-Key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
{"role": "user", "content": "Write one short admin help sentence about reviewing wallpaper metadata."}
],
"max_tokens": 60,
"stream": false
}'
```
7. If startup or health fails, inspect the relevant logs.
```bash
docker compose logs --tail=200 llm
docker compose logs --tail=200 gateway
```
## Universal analyze (ALL)
Analyze an image by URL (gateway aggregates CLIP, BLIP, YOLO):
@@ -241,7 +347,93 @@ Response fields:
- `review`: score ≥ `MATURITY_THRESHOLD_REVIEW` (default 0.60) but below mature threshold — possible mature, queue for human review.
- `safe`: score below both thresholds — content appears safe.
If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing. store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.
If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing.
## LLM / Chat endpoints
The gateway validates requests, clamps `max_tokens` to configured limits, rejects oversized payloads, and normalizes downstream failures into JSON under an `error` key.
### OpenAI-style chat completions
```bash
curl -X POST https://vision.klevze.net/v1/chat/completions \
-H "X-API-Key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
{"role": "user", "content": "Write a short biography for a creator known for sci-fi environments."}
],
"temperature": 0.7,
"max_tokens": 220,
"stream": false
}'
```
Supported request fields:
- `messages` (required)
- `temperature`
- `max_tokens`
- `stream` (`false` only in v1)
- `top_p`
- `stop`
- `presence_penalty`
- `frequency_penalty`
Validation rules:
- At least one message is required.
- Roles must be `system`, `user`, or `assistant`.
- Empty message content is rejected.
- Oversized request bodies return `413`.
- `max_tokens` is clamped to `LLM_MAX_TOKENS_HARD_LIMIT`.
### Project-friendly chat response
```bash
curl -X POST https://vision.klevze.net/ai/chat \
-H "X-API-Key: <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful metadata assistant."},
{"role": "user", "content": "Suggest five tags for a fantasy castle wallpaper."}
]
}'
```
Example response:
```json
{
"model": "qwen3-1.7b-instruct-q4_k_m",
"content": "fantasy castle, moonlit fortress, medieval towers, epic landscape, digital painting",
"finish_reason": "stop",
"usage": {
"prompt_tokens": 48,
"completion_tokens": 19,
"total_tokens": 67
}
}
```
### Model discovery
```bash
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
```
### Failure modes
- `401`: missing or invalid API key
- `413`: request body exceeds `LLM_MAX_REQUEST_BYTES`
- `422`: validation failure or unsupported streaming request
- `503`: LLM disabled or upstream unavailable
- `504`: upstream timeout
## Vector DB (Qdrant)
Use the Qdrant gateway endpoints to store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.
Qdrant point IDs must be either an unsigned integer or a UUID string. If you send another string value, the wrapper may replace it with a generated UUID and store the original value in metadata as `_original_id`.