SLA Dashboard

Complete

Overview

Service Level Agreements are the contractual backbone of a managed SOC. Enterprise customers buy specific response guarantees: a critical alert must be acknowledged within 15 minutes (MTTA) and fully resolved within 4 hours (MTTR). Without automated tracking, SOC teams rely on manual ticket timestamps and spreadsheets, which are error-prone and cannot provide real-time warning before a breach occurs.

The SLA Dashboard module provides real-time and historical SLA compliance tracking for all tenant tiers. It surfaces per-severity scorecards graded A through F, live breach detection with WebSocket alerts, 30-day MTTA/MTTR trend charts, configurable SLA target definitions, multi-level escalation chain management, multi-channel breach notifications, daily/monthly compliance reports, and penalty tracking.

What Was Proposed

Per-tenant SLA definition with MTTA and MTTR targets by severity (critical, high, medium, low)
Real-time alert lifecycle tracking from creation through acknowledgement to resolution
Background breach detection loop running on a 60-second interval
Warning notifications at configurable threshold (default 80% of SLA window elapsed)
Metric snapshot persistence at 5-minute intervals for historical trend analysis
Historical compliance reports and MTTA/MTTR trend data (up to 365 days)
SLA Configuration CRUD with four predefined templates (standard, premium, enterprise, government)
Multi-level escalation chains with per-severity routing to L1/L2/L3/Manager/CISO
Multi-channel breach notifications: WebSocket, email, and Slack
Daily and monthly SLA reports, period-over-period comparison, and penalty tracking

What's Built

Per-tenant SLA definition (MTTA/MTTR by severity)	✓ Complete
Real-time alert lifecycle tracking	✓ Complete
60-second background breach detection loop	✓ Complete
Warning notifications at configurable threshold	✓ Complete
Metric snapshot persistence (5-minute interval)	✓ Complete
SLA scorecard endpoint (per-severity compliance, grade, avg MTTA/MTTR)	✓ Complete
MTTA and MTTR trend endpoints (1-365 days)	✓ Complete
Historical compliance report endpoint (date-range filtering)	✓ Complete
SLA Config CRUD with 4 templates (standard/premium/enterprise/government)	✓ Complete
Escalation chain engine (L1 through CISO, multi-step, per-severity)	✓ Complete
Breach notification engine (WebSocket, email, Slack channels)	✓ Complete
Daily and monthly SLA report generation	✓ Complete
Period comparison endpoint (delta analysis between two date ranges)	✓ Complete
Penalty tracking and monthly penalty summary	✓ Complete
Frontend: scorecard grid with A-F grades and MTTA/MTTR progress bars	✓ Complete
Frontend: live metrics panel (open alerts, breaches, warnings, compliance %)	✓ Complete
Frontend: 30-day CSS bar chart trend visualization for MTTA and MTTR	✓ Complete
Frontend: active breaches table with elapsed vs target time	✓ Complete
Frontend: collapsible SLA configuration form with per-severity target inputs	✓ Complete

Architecture

Backend Service: SLAManager

File: app/services/sla_manager.py — The SLAManager is a singleton (sla_manager) instantiated at module import time. It composes four sub-services into one cohesive object exposed to the router:

SLAManager core — Holds a dict of SLADefinition dataclasses per tenant and a dict of SLATracker dataclasses per alert. Runs a background asyncio task checking for breaches every 60 seconds and persisting metric snapshots every 5 minutes. Sends WebSocket push notifications at the warning threshold and on confirmed breach.
SLAConfigManager — Full CRUD for SLAConfig records (in-memory dict per tenant). Provides an apply_template shortcut for the four named tiers. Configs carry their own field-level overrides on top of the template baseline.
EscalationEngine — Manages per-tenant, per-severity escalation chains. Each chain is an ordered list of EscalationStep objects (level, role, timeout, notification method). The engine tracks current escalation level per alert and emits EscalationEvent records with timestamps for full audit history.
BreachNotifier — Configures multi-channel notification recipients per tenant (WebSocket, email, Slack). Stores a notification history log. Triggers notify_breach() and notify_warning() when the breach detection loop fires.
SLAReportEngine — Generates daily and monthly compliance reports using stored metric snapshots. Supports period-over-period comparison and penalty calculation at the configured per-hour rate.

SLA Templates

Template	Critical MTTA	Critical MTTR	Penalty/Hour	Notes
`standard`	30 min	8 hr	$50.00	Default for most tenants
`premium`	15 min	4 hr	$100.00	Faster response commitments
`enterprise`	5 min	1 hr	$250.00	Aggressive targets, highest penalties
`government`	10 min	2 hr	$200.00	FedRAMP compliance flags, data sovereignty, audit logging

Key Enums (sla_manager.py)

class SLAStatus(str, Enum):
    COMPLIANT = "compliant"
    WARNING   = "warning"
    BREACHED  = "breached"

class MetricType(str, Enum):
    MTTA             = "mtta"
    MTTR             = "mttr"
    VOLUME           = "volume"
    AUTO_RESOLVE_RATE = "auto_resolve_rate"

class EscalationLevel(str, Enum):
    L1      = "L1"
    L2      = "L2"
    L3      = "L3"
    MANAGER = "Manager"
    CISO    = "CISO"

class NotificationChannel(str, Enum):
    WEBSOCKET = "websocket"
    EMAIL     = "email"
    SLACK     = "slack"

API Endpoints

All endpoints are defined in app/routers/sla.py under the prefix /api/v1/sla.

Core Dashboard Metrics

GET  /api/v1/sla/current?tenant_id=...
     # Live metrics: open_alerts, active_breaches, warnings, overall_compliance_pct, overall_health

GET  /api/v1/sla/scorecard?tenant_id=...
     # Per-severity scorecard: compliance_pct, mtta_avg_min, mtta_target_min, mttr_avg_min, mttr_target_min, grade

GET  /api/v1/sla/compliance?tenant_id=...&start_date=...&end_date=...
     # Historical compliance report with optional ISO 8601 date range

GET  /api/v1/sla/trends/mtta?tenant_id=...&days=30
     # MTTA trend data points (days: 1-365)

GET  /api/v1/sla/trends/mttr?tenant_id=...&days=30
     # MTTR trend data points (days: 1-365)

GET  /api/v1/sla/breaches?tenant_id=...
     # Active and recent SLA breaches

GET  /api/v1/sla/alerts/{alert_id}/tracking
     # Full SLA tracking record for a specific alert (404 if not tracked)

SLA Target Definition

GET /api/v1/sla/definition?tenant_id=...
    # Returns current SLADefinition for the tenant

PUT /api/v1/sla/definition?tenant_id=...
    Body: SLADefinitionRequest {
      critical_mtta_minutes: int (1-120),   critical_mttr_hours: int (1-48),
      high_mtta_minutes: int (1-240),       high_mttr_hours: int (1-72),
      medium_mtta_hours: int (1-24),        medium_mttr_hours: int (1-168),
      low_mtta_hours: int (1-72),           low_mttr_hours: int (1-720),
      notification_threshold_pct: int (50-99)
    }

SLA Config CRUD

GET    /api/v1/sla/templates
POST   /api/v1/sla/configs/{tenant_id}
       Body: { name, template: "standard|premium|enterprise|government", overrides: {} }
GET    /api/v1/sla/configs/{tenant_id}
GET    /api/v1/sla/configs/{tenant_id}/{config_id}
PUT    /api/v1/sla/configs/{tenant_id}/{config_id}   # Partial update (all fields optional)
DELETE /api/v1/sla/configs/{tenant_id}/{config_id}
POST   /api/v1/sla/configs/{tenant_id}/apply-template
       Body: { template_name: "standard|premium|enterprise|government" }

Escalation Chains

GET  /api/v1/sla/escalations/{tenant_id}
POST /api/v1/sla/escalations/{tenant_id}
     Body: { severity: "critical|high|medium|low", chain: [EscalationStepRequest, ...] }

POST /api/v1/sla/escalate/{alert_id}?tenant_id=...&severity=high
     # Trigger escalation to next level; 400 if chain exhausted or undefined

GET  /api/v1/sla/escalation-status/{alert_id}
GET  /api/v1/sla/escalation-events/{tenant_id}?limit=50

Notifications

GET  /api/v1/sla/notifications/{tenant_id}?limit=50
POST /api/v1/sla/notifications/{tenant_id}/configure
     Body: { channels: ["websocket","email","slack"],
             recipients: { "email": ["soc@example.com"] },
             thresholds: [80, 90, 100] }
GET  /api/v1/sla/notifications/{tenant_id}/config

Reports and Penalties

GET  /api/v1/sla/reports/{tenant_id}/daily?date=YYYY-MM-DD
GET  /api/v1/sla/reports/{tenant_id}/monthly?month=YYYY-MM
GET  /api/v1/sla/reports/{tenant_id}/trends?metric=mtta|mttr|compliance&days=90
POST /api/v1/sla/reports/{tenant_id}/compare
     Body: { period1_start, period1_end, period2_start, period2_end }  # ISO 8601

GET  /api/v1/sla/penalties/{tenant_id}
GET  /api/v1/sla/penalties/{tenant_id}/summary?month=YYYY-MM

Routing

Layer	Path	Description
/sla-dashboard	Frontend route (Next.js App Router)	Main SLA Dashboard page
/api/v1/sla	API prefix (FastAPI router)	All SLA backend endpoints

Data Model

The SLA module uses in-memory Python dataclasses rather than SQLAlchemy ORM tables. This allows the demo system to operate without database migrations while still providing realistic data. State is lost on API restart.

SLADefinition

Field	Type	Default	Description
`tenant_id`	str	"default"	Tenant scoping key
`critical_mtta_minutes`	int	15	Max acknowledgement time for critical severity alerts
`critical_mttr_hours`	int	4	Max resolution time for critical alerts
`high_mtta_minutes`	int	30	Max acknowledgement time for high severity alerts
`high_mttr_hours`	int	8	Max resolution time for high alerts
`medium_mtta_hours`	int	4	Max acknowledgement time for medium alerts
`medium_mttr_hours`	int	24	Max resolution time for medium alerts
`low_mtta_hours`	int	8	Max acknowledgement time for low alerts
`low_mttr_hours`	int	72	Max resolution time for low alerts
`notification_threshold_pct`	int	80	% of SLA window elapsed before warning notification fires

SLAMetric (snapshot for trend analysis)

Field	Type	Description
`timestamp`	str (ISO 8601)	When this snapshot was captured (every 5 minutes)
`tenant_id`	str	Tenant this metric belongs to
`metric_type`	MetricType enum	mtta / mttr / volume / auto_resolve_rate
`value`	float	Metric value in seconds (MTTA/MTTR) or count (volume)
`severity`	str	critical / high / medium / low

EscalationStep

Field	Type	Description
`level`	str (EscalationLevel)	L1 / L2 / L3 / Manager / CISO
`role`	str	Role or team to notify (e.g. "SOC Analyst", "CISO")
`timeout_minutes`	int (1-1440)	Minutes to wait before auto-escalating to the next step
`notification_method`	str	email / slack / pager / phone

Prerequisites

SLA Manager singleton — app/services/sla_manager.py must be imported (which happens via the router registration in app/main.py). The singleton starts background asyncio tasks on startup.
Notification Service — app/services/notification_service.py is called internally by the breach notifier to push WebSocket messages. The WebSocket router must also be registered.
Alerts Router integration — app/routers/alerts.py calls sla_manager.track_alert() on alert creation and updates lifecycle state on acknowledgement and resolution so the breach loop has live data.
Frontend API Client — src/lib/api-client.ts provides api.get() and api.put(). All five dashboard data fetches use Promise.all with individual .catch(() => null) guards so any single endpoint failure falls back to mock data without breaking the entire page.

UI Layout

Page Sections (top to bottom)

Header Row — Orange Timer icon, "SLA Dashboard" h1, subtitle text, and a Refresh button (RefreshCw icon spins while loading === true). Refresh calls loadData() which re-fetches all 5 API endpoints.
SLA Scorecard — 4-column grid (Critical, High, Medium, Low). Each card shows: colored severity dot, compliance percentage (green ≥95%, yellow ≥85%, red <85%), A-F grade badge (green/blue/yellow/orange/red), MTTA average vs target with a progress bar (green if within target, red if over), MTTR average vs target with same color logic. Card border/background color matches the compliance level.
Live Metrics — 5-column row of metric tiles: Open Alerts (gray Activity icon), Active Breaches (red XCircle), Warnings (yellow AlertTriangle), Overall Compliance % (dynamically colored based on thresholds), Overall Health (CheckCircle green / AlertTriangle yellow / XCircle red with status label).
Trend Charts — Side-by-side CSS bar charts (no external charting library). Each bar is green when the daily value is at or below the SLA target and red when over. An orange dashed horizontal line marks the target. Hover tooltip shows date and value in minutes. Date range labels at chart bottom. Legend: Under SLA (green), Over SLA (red), Target (orange dash).
Active Breaches Table — White rounded card with columns: Alert ID (monospace), Severity (colored dot + label), Created At (locale-formatted), Elapsed Time (red font when over target), SLA Target, Status ("Breached" red pill or "Warning" yellow pill). Empty state row: "No active breaches -- all SLAs within targets".
SLA Configuration (collapsible) — Section toggle header with ChevronUp/Down. When expanded: 4-column form grid with one column per severity level. Each column has MTTA Target (min) and MTTR Target (min) numeric inputs. Orange "Save Configuration" button submits to PUT /api/v1/sla/definition. Inline success (CheckCircle green) or failure (XCircle red) message appears next to the button after save.