Product Requirements Document

Rival Platform
Observability Stack

AI-governance-first monitoring unifying metrics, logs, and traces across the multi-tenant BPMN workflow platform.

v1.0 APPROVED 2026-02-12 ADR-0007

Executive Summary

The Problem

The NestJS API has zero instrumentation. Prometheus, Grafana, Loki, and Tempo are not deployed. No unified observability profile exists in Docker for local development. This creates blind spots in incident response and prevents data-driven AI governance decisions.

7 New Services

Prometheus, Grafana, Loki, Tempo, Promtail, Node Exporter, AlertManager

4 Metric Families

HTTP requests, duration, errors, active connections for NestJS API

4 Dashboards

Platform Overview, Workflow Metrics, AI/LLM Monitoring, Infrastructure

18 Alert Rules

9 prompt enhancement + 9 Styx workflow alerts consolidated in Prometheus

MVP Goals

Success Criteria

<5min

MTTD for P0 Incidents

<30min

MTTR for P0 Incidents

8

GOV Controls Monitored

<1%

Performance Impact

13mo

Metric Retention

<15min

Developer Onboarding

Goal	Success Criteria	Timeline
Complete Visibility	All services instrumented with metrics, logs, traces	Q1 2026
AI Governance	GOV-011 through GOV-018 controls mapped and monitored	Q1 2026
Production Parity	Local Docker observability matches GKE production	Q1 2026
Operational Excellence	MTTD < 5 min, MTTR < 30 min for P0 incidents	Q2 2026

Architecture

Three Pillars of Observability

Unified metrics, logs, and traces flowing through OpenTelemetry to purpose-built storage backends with Grafana as the single pane of glass.

NestJS API

prom-client + OTEL SDK

→

OTEL Collector

Route & Transform

→

Storage

Prometheus / Loki / Tempo

→

Grafana

Dashboards & Alerts

Metrics (Prometheus)

rival_api_http_requests_total
rival_api_http_request_duration_seconds
rival_api_errors_total
rival_api_active_connections
styx_workflow_* (20+ worker metrics)

Logs (Loki)

Docker container log aggregation via Promtail
Structured JSON log parsing
30-day retention with compaction
Label-based filtering by service
Log-to-trace correlation via trace ID

Traces (Tempo + Jaeger)

OTLP receivers from OTEL Collector
End-to-end distributed tracing
API → Workers → CortexOne spans
7-day trace retention
Trace-to-log and trace-to-metric links

Infrastructure

Docker Compose Services

All observability services run behind --profile observability for lean local development. One command to start: ./scripts/dev.sh --observability

Service	Image	Port	Purpose
Prometheus	`prom/prometheus:v2.45.0`	9090	Metrics storage & alerting
Grafana	`grafana/grafana:10.2.3`	3333	Dashboards & visualization
Loki	`grafana/loki:2.9.0`	3100	Log aggregation
Tempo	`grafana/tempo:2.2.0`	3200	Trace storage
Promtail	`grafana/promtail:2.9.0`	—	Log shipping to Loki
Node Exporter	`prom/node-exporter:v1.6.0`	9100	System metrics
AlertManager	`prom/alertmanager:v0.27.0`	9093	Alert routing

Persistent Volumes

prometheus_data
grafana_data
loki_data
tempo_data
alertmanager_data

Auto-Provisioned

5 Grafana data sources (Prometheus, Loki, Tempo, Jaeger, AlertManager)
4 Grafana dashboards auto-loaded on startup
18 alert rules active in Prometheus
36+ recording rules for efficient queries

Dashboards

Grafana Dashboard Suite

Platform Overview

Service Health & Performance

Service up/down status panels
API request rate & error rate
P95 latency tracking
Request rate by service (timeseries)
Node memory & CPU usage

Workflow Metrics

BPMN Workflow Intelligence

Active workflows & SLA breaches
Workflow duration distribution
Agent execution time & confidence
HITL tasks pending
Circuit breaker state

AI/LLM Monitoring

Model & Governance Tracking

Model usage distribution (Opus/Sonnet/Haiku)
Token usage rate & cost tracking
Confidence score trends
Structured output parse rate
Governance compliance status

Infrastructure

System Resource Monitoring

CPU, memory, disk usage
Process memory by service
OTEL collector self-metrics
Log volume by container (Loki)
Network I/O monitoring

Alerting

18 Consolidated Alert Rules

Alerts previously scattered across 2 YAML files are consolidated into a single Prometheus-native alert configuration with governance control mapping.

AI Quality & Platform Alerts (9)

`QualityScoreDrift`	Median < 0.70 for 1h
`HighRetryRate`	>20% retry rate for 15m
`ModelRoutingAnomaly`	Opus > 40% for 30m
`LowHaikuUtilization`	Haiku < 30% for 2h
`LatencySpike`	P95 > 300s for 10m
`ThroughputDegradation`	<50% normal for 15m
`GovernanceComplianceDrift`	Parse failure >10% for 30m
`HighManualReviewRate`	>50% manual for 1h
`CortexOneFunctionErrors`	>10% error for 10m

Styx Workflow Alerts (9)

`styx_workflow_stuck`	Active > 8h for 5m
`styx_agent_failure_rate`	>5% for 15m
`styx_sla_breach_count`	>3 breaches/h for 5m
`styx_hitl_task_aging`	Pending > 4h
`styx_circuit_breaker_open`	State = open for 5m
`styx_confidence_trending_down`	<0.70 avg for 2h
`styx_high_regeneration_rate`	>30% for 20m
`styx_cortexone_latency`	P95 > 300s for 30m
`styx_workflow_completion_rate`	<80% for 1h

Compliance

AI Governance Controls Mapping

Every alert rule maps to a governance control (GOV-011 through GOV-018) with evidence collection for SOC2 CC7.2 and EU AI Act compliance.

Control	Description	Monitoring Metric	Alert
GOV-011	Model Quality Assurance	`styx_confidence_score`	QualityScoreDrift
GOV-012	Model Routing Efficiency	`styx_model_selected_total`	ModelRoutingAnomaly
GOV-013	Cost Optimization	`styx_tokens_used_total`	LowHaikuUtilization
GOV-014	Human Oversight	`styx_hitl_tasks_pending`	styx_hitl_task_aging
GOV-015	Structured Output Integrity	`styx_structured_output_parse_total`	GovernanceComplianceDrift, HighRetryRate
GOV-016	Workflow SLA Compliance	`styx_sla_breach_total`	styx_sla_breach_count
GOV-017	Audit Trail Completeness	`styx_workflow_duration_seconds`	styx_workflow_stuck
GOV-018	Circuit Breaker Resilience	`styx_circuit_breaker_state`	styx_circuit_breaker_open

Implementation

5-Phase Rollout

Phase 1 — Week 1-2

Docker Infrastructure

7 new Docker services
9 configuration files
5 persistent volumes
--profile observability

Phase 2 — Week 3-4

API Instrumentation

OpenTelemetry tracing SDK
4 Prometheus metric families
GET /api/metrics endpoint
Workers default metrics

Phase 3 — Week 5-6

Grafana Dashboards

Platform Overview
Workflow Metrics
AI/LLM Monitoring
Infrastructure

Phase 4 — Week 7

Alerting

18 consolidated alerts
36+ recording rules
AlertManager routing
Governance mapping

Phase 5 — Week 8

Docs & Scripts

dev.sh --observability
ADR-0007 ACCEPTED
Local observability runbook
Onboarding guide

Differentiators

Why This Architecture

AI-Native Observability

Phoenix (Arize) for LLM traces: token usage, cost, bias detection, hallucination risk. No competitor offers native AI governance monitoring combined with cost-efficient self-hosting.

Compliance-First Design

Every alert mapped to GOV-011 through GOV-018 controls. 13-month retention for SOC2 CC7.2. EU AI Act audit trail built in. Evidence collection automated.

Zero-Lock-in Stack

100% open source: Prometheus, Grafana, Loki, Tempo, OpenTelemetry. Portable across clouds. We own the data. No per-host or per-GB pricing.

Developer Experience

One command: ./scripts/dev.sh --observability. Pre-provisioned dashboards. Local dev parity with GKE production. Under 15-minute onboarding.

Full PRD available at docs/prd/observability-platform.md (1,640 lines, 73KB)

Rival PlatformObservability Stack

The Problem

7 New Services

4 Metric Families

4 Dashboards

18 Alert Rules

Success Criteria

Three Pillars of Observability

NestJS API

OTEL Collector

Storage

Grafana

Metrics (Prometheus)

Logs (Loki)

Traces (Tempo + Jaeger)

Docker Compose Services

Persistent Volumes

Auto-Provisioned

Grafana Dashboard Suite

Service Health & Performance

BPMN Workflow Intelligence

Model & Governance Tracking

System Resource Monitoring

18 Consolidated Alert Rules

AI Quality & Platform Alerts (9)

Styx Workflow Alerts (9)

AI Governance Controls Mapping

5-Phase Rollout

Docker Infrastructure

API Instrumentation

Grafana Dashboards

Alerting

Docs & Scripts

Why This Architecture

AI-Native Observability

Compliance-First Design

Zero-Lock-in Stack

Developer Experience

Rival Platform
Observability Stack