llm: add FastAPI shim, gateway LLM endpoints, tests, and docs
This commit is contained in:
198
USAGE.md
198
USAGE.md
@@ -1,10 +1,10 @@
|
||||
# Skinbase Vision Stack — Usage Guide
|
||||
|
||||
This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity services).
|
||||
This document explains how to run and use the Skinbase Vision Stack (Gateway + CLIP, BLIP, YOLO, Qdrant, Card Renderer, Maturity, and optional LLM services).
|
||||
|
||||
## Overview
|
||||
|
||||
- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity` (FastAPI each, except `qdrant` which is the official Qdrant DB).
|
||||
- Services: `gateway`, `clip`, `blip`, `yolo`, `qdrant`, `qdrant-svc`, `card-renderer`, `maturity`, `llm` (FastAPI each except `qdrant`; `llm` is a thin FastAPI shim that manages an internal `llama-server` process).
|
||||
- Gateway is the public API endpoint; the other services are internal.
|
||||
|
||||
## Model overview
|
||||
@@ -21,6 +21,8 @@ This document explains how to run and use the Skinbase Vision Stack (Gateway + C
|
||||
|
||||
- **Maturity**: Dedicated NSFW/maturity classifier. Accepts an image and returns a normalized safety signal including `maturity_label` (`safe`/`mature`), `confidence`, raw `score`, optional sublabels (e.g. `nsfw`), and an `action_hint` (`safe`, `review`, `flag_high`) designed for Nova moderation workflows. Powered by `Falconsai/nsfw_image_detection` (ViT-based, HuggingFace). Thresholds are configurable via environment variables.
|
||||
|
||||
- **LLM**: Internal text-generation service backed by `llama.cpp` and a GGUF Qwen3 model. Exposed through the gateway for non-streaming chat completions and model discovery. Intended for Nova workflows such as creator bios, metadata suggestions, moderation helper text, and other short internal generation tasks.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Docker Desktop (with `docker compose`) or a Docker environment.
|
||||
@@ -55,12 +57,48 @@ MATURITY_ENABLED=true
|
||||
- `MATURITY_THRESHOLD_REVIEW`: score above this but below mature threshold → `mature` + `review` (default `0.60`).
|
||||
- `MATURITY_ENABLED`: set to `false` to disable maturity endpoints at the gateway without removing the service.
|
||||
|
||||
Optional LLM configuration:
|
||||
|
||||
```bash
|
||||
LLM_URL=http://llm:8080
|
||||
LLM_ENABLED=false
|
||||
LLM_TIMEOUT=120
|
||||
LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
|
||||
LLM_MAX_TOKENS_DEFAULT=256
|
||||
LLM_MAX_TOKENS_HARD_LIMIT=1024
|
||||
LLM_MAX_REQUEST_BYTES=65536
|
||||
|
||||
# Local llm profile only
|
||||
MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
|
||||
LLM_CONTEXT_SIZE=4096
|
||||
LLM_THREADS=4
|
||||
LLM_GPU_LAYERS=0
|
||||
LLM_EXTRA_ARGS=
|
||||
```
|
||||
|
||||
Run from repository root:
|
||||
|
||||
```bash
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
That starts the default vision stack only.
|
||||
|
||||
To also start the local LLM service:
|
||||
|
||||
```bash
|
||||
docker compose --profile llm up -d --build
|
||||
```
|
||||
|
||||
Before enabling the `llm` profile, provision the GGUF model described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
|
||||
|
||||
For small production hosts, the preferred setup is usually to keep the gateway local and point `LLM_URL` at a separate private LLM host:
|
||||
|
||||
```bash
|
||||
LLM_ENABLED=true
|
||||
LLM_URL=http://private-llm-host:8080
|
||||
```
|
||||
|
||||
Stop:
|
||||
|
||||
```bash
|
||||
@@ -82,6 +120,74 @@ Check the gateway health endpoint:
|
||||
curl https://vision.klevze.net/health
|
||||
```
|
||||
|
||||
Check LLM-specific gateway health:
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
|
||||
```
|
||||
|
||||
## LLM smoke test checklist
|
||||
|
||||
Use this sequence on a machine with Docker available after you have mounted the GGUF model and enabled the gateway with `LLM_ENABLED=true`.
|
||||
|
||||
1. Start the gateway with the `llm` profile.
|
||||
|
||||
```bash
|
||||
docker compose --profile llm up -d --build gateway llm
|
||||
```
|
||||
|
||||
2. Confirm the LLM service came up cleanly.
|
||||
|
||||
```bash
|
||||
docker compose ps llm
|
||||
docker compose logs --tail=100 llm
|
||||
```
|
||||
|
||||
3. Check the repo-owned internal health endpoint.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8080/health
|
||||
```
|
||||
|
||||
Expected fields: `status`, `model`, `context_size`, `threads`.
|
||||
|
||||
4. Confirm the gateway sees the LLM backend.
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
|
||||
```
|
||||
|
||||
5. Verify model discovery.
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/models
|
||||
```
|
||||
|
||||
6. Run a small chat request through the gateway.
|
||||
|
||||
```bash
|
||||
curl -X POST http://127.0.0.1:8003/v1/chat/completions \
|
||||
-H "X-API-Key: <your-api-key>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
|
||||
{"role": "user", "content": "Write one short admin help sentence about reviewing wallpaper metadata."}
|
||||
],
|
||||
"max_tokens": 60,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
7. If startup or health fails, inspect the relevant logs.
|
||||
|
||||
```bash
|
||||
docker compose logs --tail=200 llm
|
||||
docker compose logs --tail=200 gateway
|
||||
```
|
||||
|
||||
## Universal analyze (ALL)
|
||||
|
||||
Analyze an image by URL (gateway aggregates CLIP, BLIP, YOLO):
|
||||
@@ -241,7 +347,93 @@ Response fields:
|
||||
- `review`: score ≥ `MATURITY_THRESHOLD_REVIEW` (default 0.60) but below mature threshold — possible mature, queue for human review.
|
||||
- `safe`: score below both thresholds — content appears safe.
|
||||
|
||||
If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing. store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.
|
||||
If the maturity service is unavailable the gateway returns a `502` or `503` error. **Nova must not treat a gateway failure as a `safe` result** — retry or queue for later processing.
|
||||
|
||||
## LLM / Chat endpoints
|
||||
|
||||
The gateway validates requests, clamps `max_tokens` to configured limits, rejects oversized payloads, and normalizes downstream failures into JSON under an `error` key.
|
||||
|
||||
### OpenAI-style chat completions
|
||||
|
||||
```bash
|
||||
curl -X POST https://vision.klevze.net/v1/chat/completions \
|
||||
-H "X-API-Key: <your-api-key>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
|
||||
{"role": "user", "content": "Write a short biography for a creator known for sci-fi environments."}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 220,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
Supported request fields:
|
||||
- `messages` (required)
|
||||
- `temperature`
|
||||
- `max_tokens`
|
||||
- `stream` (`false` only in v1)
|
||||
- `top_p`
|
||||
- `stop`
|
||||
- `presence_penalty`
|
||||
- `frequency_penalty`
|
||||
|
||||
Validation rules:
|
||||
- At least one message is required.
|
||||
- Roles must be `system`, `user`, or `assistant`.
|
||||
- Empty message content is rejected.
|
||||
- Oversized request bodies return `413`.
|
||||
- `max_tokens` is clamped to `LLM_MAX_TOKENS_HARD_LIMIT`.
|
||||
|
||||
### Project-friendly chat response
|
||||
|
||||
```bash
|
||||
curl -X POST https://vision.klevze.net/ai/chat \
|
||||
-H "X-API-Key: <your-api-key>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a helpful metadata assistant."},
|
||||
{"role": "user", "content": "Suggest five tags for a fantasy castle wallpaper."}
|
||||
]
|
||||
}'
|
||||
```
|
||||
|
||||
Example response:
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "qwen3-1.7b-instruct-q4_k_m",
|
||||
"content": "fantasy castle, moonlit fortress, medieval towers, epic landscape, digital painting",
|
||||
"finish_reason": "stop",
|
||||
"usage": {
|
||||
"prompt_tokens": 48,
|
||||
"completion_tokens": 19,
|
||||
"total_tokens": 67
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Model discovery
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
|
||||
```
|
||||
|
||||
### Failure modes
|
||||
|
||||
- `401`: missing or invalid API key
|
||||
- `413`: request body exceeds `LLM_MAX_REQUEST_BYTES`
|
||||
- `422`: validation failure or unsupported streaming request
|
||||
- `503`: LLM disabled or upstream unavailable
|
||||
- `504`: upstream timeout
|
||||
|
||||
## Vector DB (Qdrant)
|
||||
|
||||
Use the Qdrant gateway endpoints to store image embeddings and find visually similar images. Embeddings are generated automatically by the CLIP service.
|
||||
|
||||
Qdrant point IDs must be either an unsigned integer or a UUID string. If you send another string value, the wrapper may replace it with a generated UUID and store the original value in metadata as `_original_id`.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user