Platform Health & Self-Healing

Autonomous platform health management with circuit breakers, auto-recovery, and self-optimization across all subsystems.

Status
Built
Circuit Breakers
7
Health States
4
Frontend
Live

Overview

The Self-Healing Engine (self_healing.py) monitors all platform subsystems -- database, Redis, SIEM connections, ML models, threat intel feeds, advisory engine, and WebSocket connections. It auto-recovers from transient failures using circuit breakers to prevent cascade failures, and logs all healing actions for audit. The engine runs 4 concurrent health check loops at 30s, 60s, 120s, and 1-hour intervals.

What Was Proposed

What's Built

FeatureStatusDetails
Health Status EngineComplete4 states: healthy, degraded, critical, recovering
Circuit BreakersComplete7 breakers: database, redis, siem_sentinel, siem_splunk, threat_intel, advisory_feeds, ml_pipeline. States: closed/open/half_open
Subsystem MonitoringCompleteChecks: _check_database, _check_redis, _check_ml_models, _check_advisory_engine, _check_websocket_connections
Auto-RecoveryCompleteTransient failure recovery with exponential backoff via circuit breaker recovery_timeout
Healing Audit LogCompleteAll healing actions logged with timestamp, subsystem, action, platform_status
Force HealCompleteAdmin endpoint to force-heal specific subsystems

Architecture

Circuit Breaker Pattern
CircuitBreaker(name, failure_threshold=5, recovery_timeout=60)
    |
    State Machine:
      CLOSED (normal)  -- record_success() keeps it closed
        |                  record_failure() increments failure_count
        v
      OPEN (blocking)  -- failure_count >= threshold
        |                  rejects all requests
        v (after recovery_timeout)
      HALF_OPEN (testing) -- lets one request through
        |                    success -> CLOSED
        v                    failure -> OPEN

Monitored Services:
  database, redis, siem_sentinel, siem_splunk,
  threat_intel, advisory_feeds, ml_pipeline
Services Involved
self_healing.py       -- SelfHealingEngine: main health loop, circuit breakers, healing log
log_healer.py         -- LogHealer: log-based error detection and auto-healing
log_source_monitor.py -- LogSourceMonitor: monitors log source health and availability

API Routing

Router prefix: /api/v1/platform — Tag: platform-health

GET/healthFull platform health status across all subsystems
GET/circuit-breakersCurrent state of all 7 circuit breakers
GET/healing-logRecent self-healing actions (max 200)
POST/heal/{subsystem}Force-heal a specific subsystem (admin)

Prerequisites

Data Model

The Self-Healing Engine uses in-memory state (no database models). Key structures:

ComponentTypeDescription
statusHealthStatus enumhealthy, degraded, critical, recovering
subsystemsdict[str, dict]Per-subsystem health state
circuit_breakersdict[str, CircuitBreaker]7 circuit breakers with state/failure tracking
healing_loglist[dict]Audit log of healing actions
check_intervalint30 seconds between checks

UI Description

File: platform/frontend/src/app/platform-health/page.tsx