Upload beautify
This commit is contained in:
123
docs/feed-rollout-runbook.md
Normal file
123
docs/feed-rollout-runbook.md
Normal file
@@ -0,0 +1,123 @@
|
||||
# Feed Rollout Runbook (clip-cosine-v2, prod set 1)
|
||||
|
||||
## Scope
|
||||
|
||||
- Candidate: `clip-cosine-v2` with weights `w1=0.52, w2=0.23, w3=0.15, w4=0.10`
|
||||
- Baseline: `clip-cosine-v1`
|
||||
- Rollout gates: `10% -> 50% -> 100%`
|
||||
- Temporary policy: `save_rate` is informational only until save-event schema reliability is confirmed in production.
|
||||
|
||||
## Pre-flight checks
|
||||
|
||||
1. Confirm config values:
|
||||
- `DISCOVERY_ROLLOUT_ENABLED=true`
|
||||
- `DISCOVERY_ROLLOUT_BASELINE_ALGO_VERSION=clip-cosine-v1`
|
||||
- `DISCOVERY_ROLLOUT_CANDIDATE_ALGO_VERSION=clip-cosine-v2`
|
||||
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
||||
- `DISCOVERY_FORCE_ALGO_VERSION` is empty
|
||||
2. Confirm candidate weights are active in `config/discovery.php` and env overrides.
|
||||
3. Confirm ingestion health for discovery events:
|
||||
- `event_id` populated for all new events
|
||||
- `favorite` and `download` events present in `user_discovery_events`
|
||||
4. Run daily aggregation:
|
||||
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
|
||||
|
||||
## Gate progression
|
||||
|
||||
### Gate 1: 10%
|
||||
|
||||
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
||||
- Observe for at least 2-3 days with minimum sample volume.
|
||||
- Required checks:
|
||||
- CTR delta vs baseline
|
||||
- Long-dwell-share delta vs baseline
|
||||
- Diversity concentration delta vs baseline
|
||||
- Save-rate trend (informational only)
|
||||
|
||||
Promote to 50% only if no rollback trigger fires and no persistent warning trend is present.
|
||||
|
||||
### Gate 2: 50%
|
||||
|
||||
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g50`
|
||||
- Observe for 3-5 days with stable daily traffic.
|
||||
- Apply same checks and thresholds.
|
||||
|
||||
Promote to 100% only with at least 2 consecutive healthy days.
|
||||
|
||||
### Gate 3: 100%
|
||||
|
||||
- Set: `DISCOVERY_ROLLOUT_ACTIVE_GATE=g100`
|
||||
- Keep baseline available for rapid rollback via force toggle.
|
||||
|
||||
## Monitoring thresholds (candidate vs baseline)
|
||||
|
||||
- CTR:
|
||||
- Warning: drop >= 3%
|
||||
- Rollback: drop >= 5% (or >= 10% in a single severe window)
|
||||
- Long dwell share (`(dwell_30_120 + dwell_120_plus) / clicks`):
|
||||
- Warning: drop >= 4%
|
||||
- Rollback: drop >= 8% (or >= 12% in a single severe window)
|
||||
- Diversity concentration (e.g. top-author/top-category share, near-duplicate concentration):
|
||||
- Warning: rise >= 10%
|
||||
- Rollback: rise >= 15%
|
||||
|
||||
## Rollback actions
|
||||
|
||||
### Immediate rollback (fastest)
|
||||
|
||||
- Set `DISCOVERY_FORCE_ALGO_VERSION=clip-cosine-v1`
|
||||
- Reload config/cache as needed in your deployment flow.
|
||||
- Verify feed responses show `meta.algo_version=clip-cosine-v1`.
|
||||
|
||||
### Standard rollback
|
||||
|
||||
- Set `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10` (or disable rollout)
|
||||
- Keep candidate enabled only for controlled validation traffic.
|
||||
|
||||
## Save-event schema note and fix
|
||||
|
||||
Observed issue class in mixed environments: save-event writes can fail if discovery event schema differs from code expectations (e.g., `meta`/`metadata` drift, required `event_id`).
|
||||
|
||||
Implemented fix path:
|
||||
|
||||
- Ingestion now always writes `event_id` and inserts schema-aware metadata (`meta` if present, otherwise `metadata` if present).
|
||||
- Keep `DISCOVERY_EVAL_SAVE_RATE_INFORMATIONAL=true` until production confirms stable save-event ingestion.
|
||||
|
||||
Validation query examples:
|
||||
|
||||
- Save events by day:
|
||||
- `SELECT event_date, COUNT(*) FROM user_discovery_events WHERE event_type IN ('favorite','download') GROUP BY event_date ORDER BY event_date DESC;`
|
||||
- Null/empty event id check:
|
||||
- `SELECT COUNT(*) FROM user_discovery_events WHERE event_id IS NULL OR event_id = '';`
|
||||
|
||||
## Daily operator checklist
|
||||
|
||||
1. Run feed aggregation for the previous day.
|
||||
2. Run evaluator and compare commands:
|
||||
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
||||
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
||||
3. Record deltas for CTR, long_dwell_share, diversity concentration.
|
||||
4. Record save_rate as informational only.
|
||||
5. Decide: hold, promote gate, or rollback.
|
||||
|
||||
## First 24h verification checklist
|
||||
|
||||
1. Confirm rollout activation and gate state:
|
||||
- `DISCOVERY_ROLLOUT_ENABLED=true`
|
||||
- `DISCOVERY_ROLLOUT_ACTIVE_GATE=g10`
|
||||
- `DISCOVERY_FORCE_ALGO_VERSION` empty
|
||||
2. Verify both algos are receiving traffic in analytics:
|
||||
- candidate (`clip-cosine-v2`) should be near 10% share (allow normal variance)
|
||||
- baseline (`clip-cosine-v1`) remains dominant
|
||||
3. Run aggregation/evaluation at least twice in first day (midday + end-of-day):
|
||||
- `php artisan analytics:aggregate-feed --date=YYYY-MM-DD`
|
||||
- `php artisan analytics:evaluate-feed-weights --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
||||
- `php artisan analytics:compare-feed-ab clip-cosine-v1 clip-cosine-v2 --from=YYYY-MM-DD --to=YYYY-MM-DD --json`
|
||||
4. Check guardrails:
|
||||
- CTR drop < rollback threshold
|
||||
- long_dwell_share drop < rollback threshold
|
||||
- diversity concentration rise < rollback threshold
|
||||
5. Check save-event ingestion health:
|
||||
- save events (`favorite`,`download`) are arriving in `user_discovery_events`
|
||||
- `event_id` is always populated
|
||||
6. If any rollback trigger is breached, apply emergency rollback preset immediately.
|
||||
Reference in New Issue
Block a user