llm: add FastAPI shim, gateway LLM endpoints, tests, and docs

This commit is contained in:
2026-04-12 09:41:21 +02:00
parent baf497b015
commit 59c9584250
15 changed files with 1779 additions and 11 deletions

142
README.md
View File

@@ -1,7 +1,7 @@
# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity) Dockerized FastAPI
# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity + LLM) Dockerized FastAPI
This repository provides **six standalone vision services** (CLIP / BLIP / YOLO / Qdrant / Card Renderer / Maturity)
and a **Gateway API** that can call them individually or together.
This repository provides internal AI services for image analysis, vector search, card rendering, moderation,
and text generation behind a single **Gateway API**.
## Services & Ports
@@ -13,6 +13,7 @@ and a **Gateway API** that can call them individually or together.
- `qdrant-svc`: internal Qdrant API wrapper
- `card-renderer`: internal card rendering service
- `maturity`: internal NSFW/maturity classifier service
- `llm`: internal text-generation service using a thin FastAPI shim over `llama-server` (profile-based, internal only)
## Run
@@ -20,6 +21,16 @@ and a **Gateway API** that can call them individually or together.
docker compose up -d --build
```
That starts the default vision stack only. The LLM service is disabled by default so operators are not forced to run Qwen3 on the same host.
To also start the local llama.cpp service:
```bash
docker compose --profile llm up -d --build
```
Before enabling the `llm` profile locally, place the GGUF model file described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
If you use BLIP, create a `.env` file first.
Required variables:
@@ -40,6 +51,26 @@ MATURITY_THRESHOLD_REVIEW=0.60
MATURITY_ENABLED=true
```
Optional LLM configuration:
```bash
LLM_ENABLED=false
LLM_URL=http://llm:8080
LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
LLM_TIMEOUT=120
LLM_MAX_TOKENS_DEFAULT=256
LLM_MAX_TOKENS_HARD_LIMIT=1024
LLM_MAX_REQUEST_BYTES=65536
# Local llm profile only
MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
LLM_CONTEXT_SIZE=4096
LLM_THREADS=4
LLM_GPU_LAYERS=0
```
Recommended production topology for the LLM: keep the gateway on the current vision host and point `LLM_URL` at a separate private machine or VPN-reachable container host. Running the full vision stack and Qwen3 together on a small 4c/8GB VPS will usually degrade both.
Service startup now waits on container healthchecks, so first boot may take longer while models finish loading.
## Health
@@ -48,6 +79,71 @@ Service startup now waits on container healthchecks, so first boot may take long
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/health
```
LLM-specific gateway health:
```bash
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
```
## LLM Smoke Test
Use this checklist on a Docker-capable host after provisioning the GGUF file and setting `LLM_ENABLED=true`.
1. Start the gateway and local LLM profile.
```bash
docker compose --profile llm up -d --build gateway llm
```
2. Confirm the LLM container is running and healthy.
```bash
docker compose ps llm
docker compose logs --tail=100 llm
```
3. Check the internal LLM health contract.
```bash
curl http://127.0.0.1:8080/health
```
Expected fields: `status`, `model`, `context_size`, `threads`.
4. Check gateway health and LLM reachability.
```bash
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
```
5. Verify model discovery through the gateway.
```bash
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
```
6. Run a short non-streaming chat completion.
```bash
curl -H "X-API-Key: <your-api-key>" -X POST http://127.0.0.1:8003/ai/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
{"role": "user", "content": "Write one sentence about an artist who creates cinematic sci-fi wallpaper packs."}
],
"max_tokens": 80
}'
```
7. If anything fails, inspect the two relevant services first.
```bash
docker compose logs --tail=200 llm
docker compose logs --tail=200 gateway
```
## Universal analyze (ALL)
### With URL
@@ -271,11 +367,51 @@ curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/cards/rend
-d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","title":"Artwork Title"}'
```
## LLM / Chat Completions
The gateway exposes stable text-generation endpoints backed by the internal `llm` service. They reuse the existing `X-API-Key` protection and keep the LLM container internal-only.
### OpenAI-style chat endpoint
```bash
curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
{"role": "user", "content": "Write a short creator biography for an artist who just hit 10,000 followers."}
],
"temperature": 0.7,
"max_tokens": 220,
"stream": false
}'
```
### Project-friendly chat endpoint
```bash
curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/ai/chat \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
{"role": "user", "content": "Suggest metadata tags for a cyberpunk wallpaper pack."}
],
"max_tokens": 180
}'
```
### List models
```bash
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
```
## Notes
- Models are loaded at service startup; initial container start can take 12 minutes as model weights are downloaded.
- Qdrant data is persisted in the project folder at `./data/qdrant`, so it survives container restarts and recreates.
- The local `llm` profile does **not** auto-download Qwen3 weights. Mount the GGUF file explicitly and let startup fail fast if it is missing.
- Remote image URLs are restricted to public `http`/`https` hosts. Localhost, private IP ranges, and non-image content types are rejected.
- The maturity service uses `Falconsai/nsfw_image_detection` (ViT-based). Thresholds are configurable via `.env`. The model handles photos and stylized digital art but should be calibrated against real Skinbase content before production use.
- For small VPS deployments, prefer `LLM_ENABLED=true` with `LLM_URL` pointing to a separate LLM host instead of running the `llm` profile on the same machine.
- For production: add auth, rate limits, and restrict gateway exposure (private network).
- GPU: you can add NVIDIA runtime later (compose profiles) if needed.