Platform Health & Self-Healing
Autonomous platform health management with circuit breakers, auto-recovery, and self-optimization across all subsystems.
Status
Built
Circuit Breakers
7
Health States
4
Frontend
Live
Overview
The Self-Healing Engine (self_healing.py) monitors all platform subsystems -- database, Redis, SIEM connections, ML models, threat intel feeds, advisory engine, and WebSocket connections. It auto-recovers from transient failures using circuit breakers to prevent cascade failures, and logs all healing actions for audit. The engine runs 4 concurrent health check loops at 30s, 60s, 120s, and 1-hour intervals.
What Was Proposed
- Monitor all subsystems: DB, Redis, SIEM connections, ML models, threat intel feeds
- Auto-recover from transient failures
- Circuit breakers to prevent cascade failures
- Self-optimize resource usage
- Full audit trail of all healing actions
- Force-heal capability for admin override
What's Built
| Feature | Status | Details |
|---|---|---|
| Health Status Engine | Complete | 4 states: healthy, degraded, critical, recovering |
| Circuit Breakers | Complete | 7 breakers: database, redis, siem_sentinel, siem_splunk, threat_intel, advisory_feeds, ml_pipeline. States: closed/open/half_open |
| Subsystem Monitoring | Complete | Checks: _check_database, _check_redis, _check_ml_models, _check_advisory_engine, _check_websocket_connections |
| Auto-Recovery | Complete | Transient failure recovery with exponential backoff via circuit breaker recovery_timeout |
| Healing Audit Log | Complete | All healing actions logged with timestamp, subsystem, action, platform_status |
| Force Heal | Complete | Admin endpoint to force-heal specific subsystems |
Architecture
Circuit Breaker Pattern
CircuitBreaker(name, failure_threshold=5, recovery_timeout=60)
|
State Machine:
CLOSED (normal) -- record_success() keeps it closed
| record_failure() increments failure_count
v
OPEN (blocking) -- failure_count >= threshold
| rejects all requests
v (after recovery_timeout)
HALF_OPEN (testing) -- lets one request through
| success -> CLOSED
v failure -> OPEN
Monitored Services:
database, redis, siem_sentinel, siem_splunk,
threat_intel, advisory_feeds, ml_pipeline
Services Involved
self_healing.py -- SelfHealingEngine: main health loop, circuit breakers, healing log
log_healer.py -- LogHealer: log-based error detection and auto-healing
log_source_monitor.py -- LogSourceMonitor: monitors log source health and availability
API Routing
Router prefix: /api/v1/platform — Tag: platform-health
GET/healthFull platform health status across all subsystems
GET/circuit-breakersCurrent state of all 7 circuit breakers
GET/healing-logRecent self-healing actions (max 200)
POST/heal/{subsystem}Force-heal a specific subsystem (admin)
Prerequisites
- PostgreSQL for database health checks
- Redis for cache health checks
- SIEM connections (Sentinel, Splunk) for connectivity checks
- ML Pipeline for model health checks
- Advisory Engine for feed health checks
Data Model
The Self-Healing Engine uses in-memory state (no database models). Key structures:
| Component | Type | Description |
|---|---|---|
| status | HealthStatus enum | healthy, degraded, critical, recovering |
| subsystems | dict[str, dict] | Per-subsystem health state |
| circuit_breakers | dict[str, CircuitBreaker] | 7 circuit breakers with state/failure tracking |
| healing_log | list[dict] | Audit log of healing actions |
| check_interval | int | 30 seconds between checks |
UI Description
File: platform/frontend/src/app/platform-health/page.tsx
- Health Status Banner -- Color-coded platform status: green (healthy), yellow (degraded), red (critical), blue (recovering)
- Circuit Breaker Grid -- Card per service showing state (closed/open/half_open), failure count, last failure time
- Subsystem Status Table -- Per-subsystem health with check timestamps and status indicators
- Healing Log Timeline -- Scrollable log of all healing actions with timestamp, subsystem, action taken, and platform status at time of action
- Force Heal Controls -- Admin buttons to force-heal individual subsystems