SLA Dashboard

Complete

Overview

Service Level Agreements are the contractual backbone of a managed SOC. Enterprise customers buy specific response guarantees: a critical alert must be acknowledged within 15 minutes (MTTA) and fully resolved within 4 hours (MTTR). Without automated tracking, SOC teams rely on manual ticket timestamps and spreadsheets, which are error-prone and cannot provide real-time warning before a breach occurs.

The SLA Dashboard module provides real-time and historical SLA compliance tracking for all tenant tiers. It surfaces per-severity scorecards graded A through F, live breach detection with WebSocket alerts, 30-day MTTA/MTTR trend charts, configurable SLA target definitions, multi-level escalation chain management, multi-channel breach notifications, daily/monthly compliance reports, and penalty tracking.

What Was Proposed

What's Built

Per-tenant SLA definition (MTTA/MTTR by severity)✓ Complete
Real-time alert lifecycle tracking✓ Complete
60-second background breach detection loop✓ Complete
Warning notifications at configurable threshold✓ Complete
Metric snapshot persistence (5-minute interval)✓ Complete
SLA scorecard endpoint (per-severity compliance, grade, avg MTTA/MTTR)✓ Complete
MTTA and MTTR trend endpoints (1-365 days)✓ Complete
Historical compliance report endpoint (date-range filtering)✓ Complete
SLA Config CRUD with 4 templates (standard/premium/enterprise/government)✓ Complete
Escalation chain engine (L1 through CISO, multi-step, per-severity)✓ Complete
Breach notification engine (WebSocket, email, Slack channels)✓ Complete
Daily and monthly SLA report generation✓ Complete
Period comparison endpoint (delta analysis between two date ranges)✓ Complete
Penalty tracking and monthly penalty summary✓ Complete
Frontend: scorecard grid with A-F grades and MTTA/MTTR progress bars✓ Complete
Frontend: live metrics panel (open alerts, breaches, warnings, compliance %)✓ Complete
Frontend: 30-day CSS bar chart trend visualization for MTTA and MTTR✓ Complete
Frontend: active breaches table with elapsed vs target time✓ Complete
Frontend: collapsible SLA configuration form with per-severity target inputs✓ Complete

Architecture

Backend Service: SLAManager

File: app/services/sla_manager.py — The SLAManager is a singleton (sla_manager) instantiated at module import time. It composes four sub-services into one cohesive object exposed to the router:

  1. SLAManager core — Holds a dict of SLADefinition dataclasses per tenant and a dict of SLATracker dataclasses per alert. Runs a background asyncio task checking for breaches every 60 seconds and persisting metric snapshots every 5 minutes. Sends WebSocket push notifications at the warning threshold and on confirmed breach.
  2. SLAConfigManager — Full CRUD for SLAConfig records (in-memory dict per tenant). Provides an apply_template shortcut for the four named tiers. Configs carry their own field-level overrides on top of the template baseline.
  3. EscalationEngine — Manages per-tenant, per-severity escalation chains. Each chain is an ordered list of EscalationStep objects (level, role, timeout, notification method). The engine tracks current escalation level per alert and emits EscalationEvent records with timestamps for full audit history.
  4. BreachNotifier — Configures multi-channel notification recipients per tenant (WebSocket, email, Slack). Stores a notification history log. Triggers notify_breach() and notify_warning() when the breach detection loop fires.
  5. SLAReportEngine — Generates daily and monthly compliance reports using stored metric snapshots. Supports period-over-period comparison and penalty calculation at the configured per-hour rate.

SLA Templates

TemplateCritical MTTACritical MTTRPenalty/HourNotes
standard30 min8 hr$50.00Default for most tenants
premium15 min4 hr$100.00Faster response commitments
enterprise5 min1 hr$250.00Aggressive targets, highest penalties
government10 min2 hr$200.00FedRAMP compliance flags, data sovereignty, audit logging

Key Enums (sla_manager.py)

class SLAStatus(str, Enum):
    COMPLIANT = "compliant"
    WARNING   = "warning"
    BREACHED  = "breached"

class MetricType(str, Enum):
    MTTA             = "mtta"
    MTTR             = "mttr"
    VOLUME           = "volume"
    AUTO_RESOLVE_RATE = "auto_resolve_rate"

class EscalationLevel(str, Enum):
    L1      = "L1"
    L2      = "L2"
    L3      = "L3"
    MANAGER = "Manager"
    CISO    = "CISO"

class NotificationChannel(str, Enum):
    WEBSOCKET = "websocket"
    EMAIL     = "email"
    SLACK     = "slack"

API Endpoints

All endpoints are defined in app/routers/sla.py under the prefix /api/v1/sla.

Core Dashboard Metrics

GET  /api/v1/sla/current?tenant_id=...
     # Live metrics: open_alerts, active_breaches, warnings, overall_compliance_pct, overall_health

GET  /api/v1/sla/scorecard?tenant_id=...
     # Per-severity scorecard: compliance_pct, mtta_avg_min, mtta_target_min, mttr_avg_min, mttr_target_min, grade

GET  /api/v1/sla/compliance?tenant_id=...&start_date=...&end_date=...
     # Historical compliance report with optional ISO 8601 date range

GET  /api/v1/sla/trends/mtta?tenant_id=...&days=30
     # MTTA trend data points (days: 1-365)

GET  /api/v1/sla/trends/mttr?tenant_id=...&days=30
     # MTTR trend data points (days: 1-365)

GET  /api/v1/sla/breaches?tenant_id=...
     # Active and recent SLA breaches

GET  /api/v1/sla/alerts/{alert_id}/tracking
     # Full SLA tracking record for a specific alert (404 if not tracked)

SLA Target Definition

GET /api/v1/sla/definition?tenant_id=...
    # Returns current SLADefinition for the tenant

PUT /api/v1/sla/definition?tenant_id=...
    Body: SLADefinitionRequest {
      critical_mtta_minutes: int (1-120),   critical_mttr_hours: int (1-48),
      high_mtta_minutes: int (1-240),       high_mttr_hours: int (1-72),
      medium_mtta_hours: int (1-24),        medium_mttr_hours: int (1-168),
      low_mtta_hours: int (1-72),           low_mttr_hours: int (1-720),
      notification_threshold_pct: int (50-99)
    }

SLA Config CRUD

GET    /api/v1/sla/templates
POST   /api/v1/sla/configs/{tenant_id}
       Body: { name, template: "standard|premium|enterprise|government", overrides: {} }
GET    /api/v1/sla/configs/{tenant_id}
GET    /api/v1/sla/configs/{tenant_id}/{config_id}
PUT    /api/v1/sla/configs/{tenant_id}/{config_id}   # Partial update (all fields optional)
DELETE /api/v1/sla/configs/{tenant_id}/{config_id}
POST   /api/v1/sla/configs/{tenant_id}/apply-template
       Body: { template_name: "standard|premium|enterprise|government" }

Escalation Chains

GET  /api/v1/sla/escalations/{tenant_id}
POST /api/v1/sla/escalations/{tenant_id}
     Body: { severity: "critical|high|medium|low", chain: [EscalationStepRequest, ...] }

POST /api/v1/sla/escalate/{alert_id}?tenant_id=...&severity=high
     # Trigger escalation to next level; 400 if chain exhausted or undefined

GET  /api/v1/sla/escalation-status/{alert_id}
GET  /api/v1/sla/escalation-events/{tenant_id}?limit=50

Notifications

GET  /api/v1/sla/notifications/{tenant_id}?limit=50
POST /api/v1/sla/notifications/{tenant_id}/configure
     Body: { channels: ["websocket","email","slack"],
             recipients: { "email": ["soc@example.com"] },
             thresholds: [80, 90, 100] }
GET  /api/v1/sla/notifications/{tenant_id}/config

Reports and Penalties

GET  /api/v1/sla/reports/{tenant_id}/daily?date=YYYY-MM-DD
GET  /api/v1/sla/reports/{tenant_id}/monthly?month=YYYY-MM
GET  /api/v1/sla/reports/{tenant_id}/trends?metric=mtta|mttr|compliance&days=90
POST /api/v1/sla/reports/{tenant_id}/compare
     Body: { period1_start, period1_end, period2_start, period2_end }  # ISO 8601

GET  /api/v1/sla/penalties/{tenant_id}
GET  /api/v1/sla/penalties/{tenant_id}/summary?month=YYYY-MM

Routing

LayerPathDescription
/sla-dashboardFrontend route (Next.js App Router)Main SLA Dashboard page
/api/v1/slaAPI prefix (FastAPI router)All SLA backend endpoints

Data Model

The SLA module uses in-memory Python dataclasses rather than SQLAlchemy ORM tables. This allows the demo system to operate without database migrations while still providing realistic data. State is lost on API restart.

SLADefinition

FieldTypeDefaultDescription
tenant_idstr"default"Tenant scoping key
critical_mtta_minutesint15Max acknowledgement time for critical severity alerts
critical_mttr_hoursint4Max resolution time for critical alerts
high_mtta_minutesint30Max acknowledgement time for high severity alerts
high_mttr_hoursint8Max resolution time for high alerts
medium_mtta_hoursint4Max acknowledgement time for medium alerts
medium_mttr_hoursint24Max resolution time for medium alerts
low_mtta_hoursint8Max acknowledgement time for low alerts
low_mttr_hoursint72Max resolution time for low alerts
notification_threshold_pctint80% of SLA window elapsed before warning notification fires

SLAMetric (snapshot for trend analysis)

FieldTypeDescription
timestampstr (ISO 8601)When this snapshot was captured (every 5 minutes)
tenant_idstrTenant this metric belongs to
metric_typeMetricType enummtta / mttr / volume / auto_resolve_rate
valuefloatMetric value in seconds (MTTA/MTTR) or count (volume)
severitystrcritical / high / medium / low

EscalationStep

FieldTypeDescription
levelstr (EscalationLevel)L1 / L2 / L3 / Manager / CISO
rolestrRole or team to notify (e.g. "SOC Analyst", "CISO")
timeout_minutesint (1-1440)Minutes to wait before auto-escalating to the next step
notification_methodstremail / slack / pager / phone

Prerequisites

UI Layout

Page Sections (top to bottom)