Files
SkinbaseNova/docs/feed-rollout-runbook.md
2026-02-14 15:14:12 +01:00

124 lines
4.8 KiB
Markdown

# Feed Rollout Runbook (clip-cosine-v2, prod set 1)
## Scope
- Candidate: `clip-cosine-v2` with weights `w1=0.52, w2=0.23, w3=0.15, w4=0.10`
- Baseline: `clip-cosine-v1`
- Rollout gates: `10% -> 50% -> 100%`
- Temporary policy: `save_rate` is informational only until save-event schema reliability is confirmed in production.
## Pre-flight checks
1. Confirm config values:
- `DISCOVERY_ROLLOUT_ENABLED=true`
- `DISCOVERY_ROLLOUT_BASELINE_ALGO_VERSION=clip-cosine-v1`
- `DISCOVERY_ROLLOUT_CANDIDATE_ALGO_VERSION=clip-cosine-v2`
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
- `DISCOVERY_FORCE_ALGO_VERSION` is empty
2. Confirm candidate weights are active in `config/discovery.php` and env overrides.
3. Confirm ingestion health for discovery events:
- `event_id` populated for all new events
- `favorite` and `download` events present in `user_discovery_events`
4. Run daily aggregation:
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
## Gate progression
### Gate 1: 10%
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
- Observe for at least 2-3 days with minimum sample volume.
- Required checks:
- CTR delta vs baseline
- Long-dwell-share delta vs baseline
- Diversity concentration delta vs baseline
- Save-rate trend (informational only)
Promote to 50% only if no rollback trigger fires and no persistent warning trend is present.
### Gate 2: 50%
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g50`
- Observe for 3-5 days with stable daily traffic.
- Apply same checks and thresholds.
Promote to 100% only with at least 2 consecutive healthy days.
### Gate 3: 100%
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g100`
- Keep baseline available for rapid rollback via force toggle.
## Monitoring thresholds (candidate vs baseline)
- CTR:
- Warning: drop >= 3%
- Rollback: drop >= 5% (or >= 10% in a single severe window)
- Long dwell share (`(dwell_30_120 + dwell_120_plus) / clicks`):
- Warning: drop >= 4%
- Rollback: drop >= 8% (or >= 12% in a single severe window)
- Diversity concentration (e.g. top-author/top-category share, near-duplicate concentration):
- Warning: rise >= 10%
- Rollback: rise >= 15%
## Rollback actions
### Immediate rollback (fastest)
- Set `DISCOVERY_FORCE_ALGO_VERSION=clip-cosine-v1`
- Reload config/cache as needed in your deployment flow.
- Verify feed responses show `meta.algo_version=clip-cosine-v1`.
### Standard rollback
- Set `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10` (or disable rollout)
- Keep candidate enabled only for controlled validation traffic.
## Save-event schema note and fix
Observed issue class in mixed environments: save-event writes can fail if discovery event schema differs from code expectations (e.g., `meta`/`metadata` drift, required `event_id`).
Implemented fix path:
- Ingestion now always writes `event_id` and inserts schema-aware metadata (`meta` if present, otherwise `metadata` if present).
- Keep `DISCOVERY_EVAL_SAVE_RATE_INFORMATIONAL=true` until production confirms stable save-event ingestion.
Validation query examples:
- Save events by day:
- `SELECT event_date, COUNT(*) FROM user_discovery_events WHERE event_type IN ('favorite','download') GROUP BY event_date ORDER BY event_date DESC;`
- Null/empty event id check:
- `SELECT COUNT(*) FROM user_discovery_events WHERE event_id IS NULL OR event_id = '';`
## Daily operator checklist
1. Run feed aggregation for the previous day.
2. Run evaluator and compare commands:
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
3. Record deltas for CTR, long_dwell_share, diversity concentration.
4. Record save_rate as informational only.
5. Decide: hold, promote gate, or rollback.
## First 24h verification checklist
1. Confirm rollout activation and gate state:
- `DISCOVERY_ROLLOUT_ENABLED=true`
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
- `DISCOVERY_FORCE_ALGO_VERSION` empty
2. Verify both algos are receiving traffic in analytics:
- candidate (`clip-cosine-v2`) should be near 10% share (allow normal variance)
- baseline (`clip-cosine-v1`) remains dominant
3. Run aggregation/evaluation at least twice in first day (midday + end-of-day):
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
4. Check guardrails:
- CTR drop < rollback threshold
- long_dwell_share drop < rollback threshold
- diversity concentration rise < rollback threshold
5. Check save-event ingestion health:
- save events (`favorite`,`download`) are arriving in `user_discovery_events`
- `event_id` is always populated
6. If any rollback trigger is breached, apply emergency rollback preset immediately.