llm: add FastAPI shim, gateway LLM endpoints, tests, and docs
This commit is contained in:
142
README.md
142
README.md
@@ -1,7 +1,7 @@
|
||||
# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity) – Dockerized FastAPI
|
||||
# Skinbase Vision Stack (CLIP + BLIP + YOLO + Qdrant + Card Renderer + Maturity + LLM) – Dockerized FastAPI
|
||||
|
||||
This repository provides **six standalone vision services** (CLIP / BLIP / YOLO / Qdrant / Card Renderer / Maturity)
|
||||
and a **Gateway API** that can call them individually or together.
|
||||
This repository provides internal AI services for image analysis, vector search, card rendering, moderation,
|
||||
and text generation behind a single **Gateway API**.
|
||||
|
||||
## Services & Ports
|
||||
|
||||
@@ -13,6 +13,7 @@ and a **Gateway API** that can call them individually or together.
|
||||
- `qdrant-svc`: internal Qdrant API wrapper
|
||||
- `card-renderer`: internal card rendering service
|
||||
- `maturity`: internal NSFW/maturity classifier service
|
||||
- `llm`: internal text-generation service using a thin FastAPI shim over `llama-server` (profile-based, internal only)
|
||||
|
||||
## Run
|
||||
|
||||
@@ -20,6 +21,16 @@ and a **Gateway API** that can call them individually or together.
|
||||
docker compose up -d --build
|
||||
```
|
||||
|
||||
That starts the default vision stack only. The LLM service is disabled by default so operators are not forced to run Qwen3 on the same host.
|
||||
|
||||
To also start the local llama.cpp service:
|
||||
|
||||
```bash
|
||||
docker compose --profile llm up -d --build
|
||||
```
|
||||
|
||||
Before enabling the `llm` profile locally, place the GGUF model file described in [models/qwen3/README.md](models/qwen3/README.md) and set `LLM_ENABLED=true` in `.env`.
|
||||
|
||||
If you use BLIP, create a `.env` file first.
|
||||
|
||||
Required variables:
|
||||
@@ -40,6 +51,26 @@ MATURITY_THRESHOLD_REVIEW=0.60
|
||||
MATURITY_ENABLED=true
|
||||
```
|
||||
|
||||
Optional LLM configuration:
|
||||
|
||||
```bash
|
||||
LLM_ENABLED=false
|
||||
LLM_URL=http://llm:8080
|
||||
LLM_DEFAULT_MODEL=qwen3-1.7b-instruct-q4_k_m
|
||||
LLM_TIMEOUT=120
|
||||
LLM_MAX_TOKENS_DEFAULT=256
|
||||
LLM_MAX_TOKENS_HARD_LIMIT=1024
|
||||
LLM_MAX_REQUEST_BYTES=65536
|
||||
|
||||
# Local llm profile only
|
||||
MODEL_PATH=/models/Qwen3-1.7B-Instruct-Q4_K_M.gguf
|
||||
LLM_CONTEXT_SIZE=4096
|
||||
LLM_THREADS=4
|
||||
LLM_GPU_LAYERS=0
|
||||
```
|
||||
|
||||
Recommended production topology for the LLM: keep the gateway on the current vision host and point `LLM_URL` at a separate private machine or VPN-reachable container host. Running the full vision stack and Qwen3 together on a small 4c/8GB VPS will usually degrade both.
|
||||
|
||||
Service startup now waits on container healthchecks, so first boot may take longer while models finish loading.
|
||||
|
||||
## Health
|
||||
@@ -48,6 +79,71 @@ Service startup now waits on container healthchecks, so first boot may take long
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/health
|
||||
```
|
||||
|
||||
LLM-specific gateway health:
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/health
|
||||
```
|
||||
|
||||
## LLM Smoke Test
|
||||
|
||||
Use this checklist on a Docker-capable host after provisioning the GGUF file and setting `LLM_ENABLED=true`.
|
||||
|
||||
1. Start the gateway and local LLM profile.
|
||||
|
||||
```bash
|
||||
docker compose --profile llm up -d --build gateway llm
|
||||
```
|
||||
|
||||
2. Confirm the LLM container is running and healthy.
|
||||
|
||||
```bash
|
||||
docker compose ps llm
|
||||
docker compose logs --tail=100 llm
|
||||
```
|
||||
|
||||
3. Check the internal LLM health contract.
|
||||
|
||||
```bash
|
||||
curl http://127.0.0.1:8080/health
|
||||
```
|
||||
|
||||
Expected fields: `status`, `model`, `context_size`, `threads`.
|
||||
|
||||
4. Check gateway health and LLM reachability.
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/health
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/ai/health
|
||||
```
|
||||
|
||||
5. Verify model discovery through the gateway.
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" http://127.0.0.1:8003/v1/models
|
||||
```
|
||||
|
||||
6. Run a short non-streaming chat completion.
|
||||
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" -X POST http://127.0.0.1:8003/ai/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
|
||||
{"role": "user", "content": "Write one sentence about an artist who creates cinematic sci-fi wallpaper packs."}
|
||||
],
|
||||
"max_tokens": 80
|
||||
}'
|
||||
```
|
||||
|
||||
7. If anything fails, inspect the two relevant services first.
|
||||
|
||||
```bash
|
||||
docker compose logs --tail=200 llm
|
||||
docker compose logs --tail=200 gateway
|
||||
```
|
||||
|
||||
## Universal analyze (ALL)
|
||||
|
||||
### With URL
|
||||
@@ -271,11 +367,51 @@ curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/cards/rend
|
||||
-d '{"url":"https://files.skinbase.org/img/aa/bb/cc/md.webp","title":"Artwork Title"}'
|
||||
```
|
||||
|
||||
## LLM / Chat Completions
|
||||
|
||||
The gateway exposes stable text-generation endpoints backed by the internal `llm` service. They reuse the existing `X-API-Key` protection and keep the LLM container internal-only.
|
||||
|
||||
### OpenAI-style chat endpoint
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
|
||||
{"role": "user", "content": "Write a short creator biography for an artist who just hit 10,000 followers."}
|
||||
],
|
||||
"temperature": 0.7,
|
||||
"max_tokens": 220,
|
||||
"stream": false
|
||||
}'
|
||||
```
|
||||
|
||||
### Project-friendly chat endpoint
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" -X POST https://vision.klevze.net/ai/chat \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"messages": [
|
||||
{"role": "system", "content": "You are a concise assistant for Skinbase Nova."},
|
||||
{"role": "user", "content": "Suggest metadata tags for a cyberpunk wallpaper pack."}
|
||||
],
|
||||
"max_tokens": 180
|
||||
}'
|
||||
```
|
||||
|
||||
### List models
|
||||
```bash
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/v1/models
|
||||
curl -H "X-API-Key: <your-api-key>" https://vision.klevze.net/ai/models
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Models are loaded at service startup; initial container start can take 1–2 minutes as model weights are downloaded.
|
||||
- Qdrant data is persisted in the project folder at `./data/qdrant`, so it survives container restarts and recreates.
|
||||
- The local `llm` profile does **not** auto-download Qwen3 weights. Mount the GGUF file explicitly and let startup fail fast if it is missing.
|
||||
- Remote image URLs are restricted to public `http`/`https` hosts. Localhost, private IP ranges, and non-image content types are rejected.
|
||||
- The maturity service uses `Falconsai/nsfw_image_detection` (ViT-based). Thresholds are configurable via `.env`. The model handles photos and stylized digital art but should be calibrated against real Skinbase content before production use.
|
||||
- For small VPS deployments, prefer `LLM_ENABLED=true` with `LLM_URL` pointing to a separate LLM host instead of running the `llm` profile on the same machine.
|
||||
- For production: add auth, rate limits, and restrict gateway exposure (private network).
|
||||
- GPU: you can add NVIDIA runtime later (compose profiles) if needed.
|
||||
|
||||
Reference in New Issue
Block a user