124 lines
4.8 KiB
Markdown
124 lines
4.8 KiB
Markdown
# Feed Rollout Runbook (clip-cosine-v2, prod set 1)
|
|
|
|
## Scope
|
|
|
|
- Candidate: `clip-cosine-v2` with weights `w1=0.52, w2=0.23, w3=0.15, w4=0.10`
|
|
- Baseline: `clip-cosine-v1`
|
|
- Rollout gates: `10% -> 50% -> 100%`
|
|
- Temporary policy: `save_rate` is informational only until save-event schema reliability is confirmed in production.
|
|
|
|
## Pre-flight checks
|
|
|
|
1. Confirm config values:
|
|
- `DISCOVERY_ROLLOUT_ENABLED=true`
|
|
- `DISCOVERY_ROLLOUT_BASELINE_ALGO_VERSION=clip-cosine-v1`
|
|
- `DISCOVERY_ROLLOUT_CANDIDATE_ALGO_VERSION=clip-cosine-v2`
|
|
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
|
- `DISCOVERY_FORCE_ALGO_VERSION` is empty
|
|
2. Confirm candidate weights are active in `config/discovery.php` and env overrides.
|
|
3. Confirm ingestion health for discovery events:
|
|
- `event_id` populated for all new events
|
|
- `favorite` and `download` events present in `user_discovery_events`
|
|
4. Run daily aggregation:
|
|
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
|
|
|
|
## Gate progression
|
|
|
|
### Gate 1: 10%
|
|
|
|
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
|
- Observe for at least 2-3 days with minimum sample volume.
|
|
- Required checks:
|
|
- CTR delta vs baseline
|
|
- Long-dwell-share delta vs baseline
|
|
- Diversity concentration delta vs baseline
|
|
- Save-rate trend (informational only)
|
|
|
|
Promote to 50% only if no rollback trigger fires and no persistent warning trend is present.
|
|
|
|
### Gate 2: 50%
|
|
|
|
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g50`
|
|
- Observe for 3-5 days with stable daily traffic.
|
|
- Apply same checks and thresholds.
|
|
|
|
Promote to 100% only with at least 2 consecutive healthy days.
|
|
|
|
### Gate 3: 100%
|
|
|
|
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g100`
|
|
- Keep baseline available for rapid rollback via force toggle.
|
|
|
|
## Monitoring thresholds (candidate vs baseline)
|
|
|
|
- CTR:
|
|
- Warning: drop >= 3%
|
|
- Rollback: drop >= 5% (or >= 10% in a single severe window)
|
|
- Long dwell share (`(dwell_30_120 + dwell_120_plus) / clicks`):
|
|
- Warning: drop >= 4%
|
|
- Rollback: drop >= 8% (or >= 12% in a single severe window)
|
|
- Diversity concentration (e.g. top-author/top-category share, near-duplicate concentration):
|
|
- Warning: rise >= 10%
|
|
- Rollback: rise >= 15%
|
|
|
|
## Rollback actions
|
|
|
|
### Immediate rollback (fastest)
|
|
|
|
- Set `DISCOVERY_FORCE_ALGO_VERSION=clip-cosine-v1`
|
|
- Reload config/cache as needed in your deployment flow.
|
|
- Verify feed responses show `meta.algo_version=clip-cosine-v1`.
|
|
|
|
### Standard rollback
|
|
|
|
- Set `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10` (or disable rollout)
|
|
- Keep candidate enabled only for controlled validation traffic.
|
|
|
|
## Save-event schema note and fix
|
|
|
|
Observed issue class in mixed environments: save-event writes can fail if discovery event schema differs from code expectations (e.g., `meta`/`metadata` drift, required `event_id`).
|
|
|
|
Implemented fix path:
|
|
|
|
- Ingestion now always writes `event_id` and inserts schema-aware metadata (`meta` if present, otherwise `metadata` if present).
|
|
- Keep `DISCOVERY_EVAL_SAVE_RATE_INFORMATIONAL=true` until production confirms stable save-event ingestion.
|
|
|
|
Validation query examples:
|
|
|
|
- Save events by day:
|
|
- `SELECT event_date, COUNT(*) FROM user_discovery_events WHERE event_type IN ('favorite','download') GROUP BY event_date ORDER BY event_date DESC;`
|
|
- Null/empty event id check:
|
|
- `SELECT COUNT(*) FROM user_discovery_events WHERE event_id IS NULL OR event_id = '';`
|
|
|
|
## Daily operator checklist
|
|
|
|
1. Run feed aggregation for the previous day.
|
|
2. Run evaluator and compare commands:
|
|
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
|
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
|
3. Record deltas for CTR, long_dwell_share, diversity concentration.
|
|
4. Record save_rate as informational only.
|
|
5. Decide: hold, promote gate, or rollback.
|
|
|
|
## First 24h verification checklist
|
|
|
|
1. Confirm rollout activation and gate state:
|
|
- `DISCOVERY_ROLLOUT_ENABLED=true`
|
|
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
|
- `DISCOVERY_FORCE_ALGO_VERSION` empty
|
|
2. Verify both algos are receiving traffic in analytics:
|
|
- candidate (`clip-cosine-v2`) should be near 10% share (allow normal variance)
|
|
- baseline (`clip-cosine-v1`) remains dominant
|
|
3. Run aggregation/evaluation at least twice in first day (midday + end-of-day):
|
|
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
|
|
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
|
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
|
4. Check guardrails:
|
|
- CTR drop < rollback threshold
|
|
- long_dwell_share drop < rollback threshold
|
|
- diversity concentration rise < rollback threshold
|
|
5. Check save-event ingestion health:
|
|
- save events (`favorite`,`download`) are arriving in `user_discovery_events`
|
|
- `event_id` is always populated
|
|
6. If any rollback trigger is breached, apply emergency rollback preset immediately.
|