Rival
Product Requirements Document

Rival Platform
Observability Stack

AI-governance-first monitoring unifying metrics, logs, and traces across the multi-tenant BPMN workflow platform.

v1.0 APPROVED 2026-02-12 ADR-0007
Executive Summary

The Problem

The NestJS API has zero instrumentation. Prometheus, Grafana, Loki, and Tempo are not deployed. No unified observability profile exists in Docker for local development. This creates blind spots in incident response and prevents data-driven AI governance decisions.

7 New Services

Prometheus, Grafana, Loki, Tempo, Promtail, Node Exporter, AlertManager

4 Metric Families

HTTP requests, duration, errors, active connections for NestJS API

4 Dashboards

Platform Overview, Workflow Metrics, AI/LLM Monitoring, Infrastructure

18 Alert Rules

9 prompt enhancement + 9 Styx workflow alerts consolidated in Prometheus

MVP Goals

Success Criteria

<5min
MTTD for P0 Incidents
<30min
MTTR for P0 Incidents
8
GOV Controls Monitored
<1%
Performance Impact
13mo
Metric Retention
<15min
Developer Onboarding
Goal Success Criteria Timeline
Complete Visibility All services instrumented with metrics, logs, traces Q1 2026
AI Governance GOV-011 through GOV-018 controls mapped and monitored Q1 2026
Production Parity Local Docker observability matches GKE production Q1 2026
Operational Excellence MTTD < 5 min, MTTR < 30 min for P0 incidents Q2 2026
Architecture

Three Pillars of Observability

Unified metrics, logs, and traces flowing through OpenTelemetry to purpose-built storage backends with Grafana as the single pane of glass.

NestJS API

prom-client + OTEL SDK

OTEL Collector

Route & Transform

Storage

Prometheus / Loki / Tempo

Grafana

Dashboards & Alerts

Metrics (Prometheus)

  • rival_api_http_requests_total
  • rival_api_http_request_duration_seconds
  • rival_api_errors_total
  • rival_api_active_connections
  • styx_workflow_* (20+ worker metrics)

Logs (Loki)

  • Docker container log aggregation via Promtail
  • Structured JSON log parsing
  • 30-day retention with compaction
  • Label-based filtering by service
  • Log-to-trace correlation via trace ID

Traces (Tempo + Jaeger)

  • OTLP receivers from OTEL Collector
  • End-to-end distributed tracing
  • API → Workers → CortexOne spans
  • 7-day trace retention
  • Trace-to-log and trace-to-metric links
Infrastructure

Docker Compose Services

All observability services run behind --profile observability for lean local development. One command to start: ./scripts/dev.sh --observability

Service Image Port Purpose
Prometheus prom/prometheus:v2.45.0 9090 Metrics storage & alerting
Grafana grafana/grafana:10.2.3 3333 Dashboards & visualization
Loki grafana/loki:2.9.0 3100 Log aggregation
Tempo grafana/tempo:2.2.0 3200 Trace storage
Promtail grafana/promtail:2.9.0 Log shipping to Loki
Node Exporter prom/node-exporter:v1.6.0 9100 System metrics
AlertManager prom/alertmanager:v0.27.0 9093 Alert routing

Persistent Volumes

  • prometheus_data
  • grafana_data
  • loki_data
  • tempo_data
  • alertmanager_data

Auto-Provisioned

  • 5 Grafana data sources (Prometheus, Loki, Tempo, Jaeger, AlertManager)
  • 4 Grafana dashboards auto-loaded on startup
  • 18 alert rules active in Prometheus
  • 36+ recording rules for efficient queries
Dashboards

Grafana Dashboard Suite

Platform Overview

Service Health & Performance

  • Service up/down status panels
  • API request rate & error rate
  • P95 latency tracking
  • Request rate by service (timeseries)
  • Node memory & CPU usage
Workflow Metrics

BPMN Workflow Intelligence

  • Active workflows & SLA breaches
  • Workflow duration distribution
  • Agent execution time & confidence
  • HITL tasks pending
  • Circuit breaker state
AI/LLM Monitoring

Model & Governance Tracking

  • Model usage distribution (Opus/Sonnet/Haiku)
  • Token usage rate & cost tracking
  • Confidence score trends
  • Structured output parse rate
  • Governance compliance status
Infrastructure

System Resource Monitoring

  • CPU, memory, disk usage
  • Process memory by service
  • OTEL collector self-metrics
  • Log volume by container (Loki)
  • Network I/O monitoring
Alerting

18 Consolidated Alert Rules

Alerts previously scattered across 2 YAML files are consolidated into a single Prometheus-native alert configuration with governance control mapping.

AI Quality & Platform Alerts (9)

QualityScoreDriftMedian < 0.70 for 1h
HighRetryRate>20% retry rate for 15m
ModelRoutingAnomalyOpus > 40% for 30m
LowHaikuUtilizationHaiku < 30% for 2h
LatencySpikeP95 > 300s for 10m
ThroughputDegradation<50% normal for 15m
GovernanceComplianceDriftParse failure >10% for 30m
HighManualReviewRate>50% manual for 1h
CortexOneFunctionErrors>10% error for 10m

Styx Workflow Alerts (9)

styx_workflow_stuckActive > 8h for 5m
styx_agent_failure_rate>5% for 15m
styx_sla_breach_count>3 breaches/h for 5m
styx_hitl_task_agingPending > 4h
styx_circuit_breaker_openState = open for 5m
styx_confidence_trending_down<0.70 avg for 2h
styx_high_regeneration_rate>30% for 20m
styx_cortexone_latencyP95 > 300s for 30m
styx_workflow_completion_rate<80% for 1h
Compliance

AI Governance Controls Mapping

Every alert rule maps to a governance control (GOV-011 through GOV-018) with evidence collection for SOC2 CC7.2 and EU AI Act compliance.

Control Description Monitoring Metric Alert
GOV-011 Model Quality Assurance styx_confidence_score QualityScoreDrift
GOV-012 Model Routing Efficiency styx_model_selected_total ModelRoutingAnomaly
GOV-013 Cost Optimization styx_tokens_used_total LowHaikuUtilization
GOV-014 Human Oversight styx_hitl_tasks_pending styx_hitl_task_aging
GOV-015 Structured Output Integrity styx_structured_output_parse_total GovernanceComplianceDrift, HighRetryRate
GOV-016 Workflow SLA Compliance styx_sla_breach_total styx_sla_breach_count
GOV-017 Audit Trail Completeness styx_workflow_duration_seconds styx_workflow_stuck
GOV-018 Circuit Breaker Resilience styx_circuit_breaker_state styx_circuit_breaker_open
Implementation

5-Phase Rollout

Phase 1 — Week 1-2

Docker Infrastructure

  • 7 new Docker services
  • 9 configuration files
  • 5 persistent volumes
  • --profile observability
Phase 2 — Week 3-4

API Instrumentation

  • OpenTelemetry tracing SDK
  • 4 Prometheus metric families
  • GET /api/metrics endpoint
  • Workers default metrics
Phase 3 — Week 5-6

Grafana Dashboards

  • Platform Overview
  • Workflow Metrics
  • AI/LLM Monitoring
  • Infrastructure
Phase 4 — Week 7

Alerting

  • 18 consolidated alerts
  • 36+ recording rules
  • AlertManager routing
  • Governance mapping
Phase 5 — Week 8

Docs & Scripts

  • dev.sh --observability
  • ADR-0007 ACCEPTED
  • Local observability runbook
  • Onboarding guide
Differentiators

Why This Architecture

AI-Native Observability

Phoenix (Arize) for LLM traces: token usage, cost, bias detection, hallucination risk. No competitor offers native AI governance monitoring combined with cost-efficient self-hosting.

Compliance-First Design

Every alert mapped to GOV-011 through GOV-018 controls. 13-month retention for SOC2 CC7.2. EU AI Act audit trail built in. Evidence collection automated.

Zero-Lock-in Stack

100% open source: Prometheus, Grafana, Loki, Tempo, OpenTelemetry. Portable across clouds. We own the data. No per-host or per-GB pricing.

Developer Experience

One command: ./scripts/dev.sh --observability. Pre-provisioned dashboards. Local dev parity with GKE production. Under 15-minute onboarding.

Full PRD available at docs/prd/observability-platform.md (1,640 lines, 73KB)