Platform Integrity Monitor
Overview
The Platform Integrity Monitor is a self-healing, autonomous infrastructure subsystem built into the ThreatOps SOCaaS API. It addresses a fundamental operational challenge of a fast-moving, multi-module platform: keeping the frontend UI in sync with the backend API surface as new features are continuously deployed.
In practice this means: every time a new FastAPI router is added to the backend, there must be a corresponding Next.js page in the frontend. Without automation, this becomes a manual gap-tracking problem. The Platform Integrity Monitor solves it by continuously scanning both the API router directory and the Next.js app directory, detecting mismatches, auto-generating placeholder frontend pages, queuing Kubernetes rollout deploys, and healing runtime errors — all without human intervention.
The module is composed of three tightly integrated services: PlatformIntegrityMonitor (gap detection + page scaffolding), AutoDeployPipeline (Kubernetes rollout management), and LogHealer (error-pattern-driven auto-healing). All three run as background async loops inside the FastAPI process and expose their state through the /api/v1/integrity router.
What Was Proposed
- Continuous monitoring of API routes vs. frontend pages to detect coverage gaps
- Automatic generation of scaffold frontend pages for any API route that lacks a UI
- Autonomous rebuild and redeploy pipeline triggered when new pages are generated
- Real-time deployment status tracking with per-step audit trail
- Pod log streaming and Kubernetes event inspection for production diagnostics
- Log-based error pattern detection with automated healing actions
- A unified dashboard endpoint combining integrity, deploy, and healer status
- Manual trigger endpoints for on-demand scans, deploys, and healing passes
What's Built
| Feature | Status |
|---|---|
| 30-minute integrity scan loop (API vs. frontend gap detection) | ✓ Complete |
Filesystem-based API route discovery (scans routers/*.py filenames) | ✓ Complete |
Next.js App Router page discovery (scans app/**/page.tsx) | ✓ Complete |
| 14-entry ROUTE_TO_PAGE_MAP for known route-to-page relationships | ✓ Complete |
| Frontend Factory Engine integration for page scaffolding | ✓ Complete |
| Fallback basic template generator for direct file creation | ✓ Complete |
Kubernetes in-cluster rollout restart via AppsV1Api | ✓ Complete |
| Rollout status polling with 120s timeout | ✓ Complete |
Pod log streaming via CoreV1Api (per-deployment, configurable tail) | ✓ Complete |
Kubernetes event listing for the threatops namespace | ✓ Complete |
| Log-based auto-healer with 10 error patterns and 6 healing actions | ✓ Complete |
| 60-second healer scan loop with per-pattern cooldown enforcement | ✓ Complete |
| Unified dashboard endpoint combining all three subsystem statuses | ✓ Complete |
| Frontend page with live 30s polling, Run Scan Now, and Deploy Now buttons | ✓ Complete |
| Scan history table (last 5 scans) and generated pages log (last 10) | ✓ Complete |
| Deploy pipeline view with per-step status icons (CheckCircle2 / XCircle) | ✓ Complete |
Architecture
Service Entry Points
All three services are singletons instantiated at module import time and started by the FastAPI application lifespan handler:
from app.services.platform_integrity import integrity_monitor # PlatformIntegrityMonitor from app.services.auto_deploy import auto_deploy # AutoDeployPipeline from app.services.log_healer import log_healer # LogHealer # Started in app/main.py lifespan: asyncio.create_task(integrity_monitor.start()) # 30-min scan loop asyncio.create_task(auto_deploy.start()) # 2-min deploy watch loop asyncio.create_task(log_healer.start()) # 60-sec healer loop
Integrity Scan Flow
*.py filenames from app/routers/ (excluding __init__.py and websocket.py). Also parses app/main.py for prefix="/api/v1/..." patterns. Returns a set of route prefix strings like /incidents, /alerts, etc.
page.tsx files under src/app/ using rglob. Converts file paths to route strings (e.g., src/app/incidents/page.tsx → /incidents). Skips dynamic routes containing [.
ROUTE_TO_PAGE_MAP. For each entry, verifies both that the API route exists in the discovered set and that the expected frontend page path is absent. Only maps with both conditions produce a gap record.
PageRequirement with type data_table and features: search, sort, pagination, auto-refresh). Falls back to writing a basic Next.js template directly to the filesystem if the Factory is unavailable.
integrity_monitor.pending_deploy = True. The AutoDeployPipeline checks this flag every 2 minutes and triggers a rollout restart automatically.
Auto-Deploy Pipeline
The AutoDeployPipeline uses the Kubernetes Python client (kubernetes package) loaded with in-cluster config (config.load_incluster_config()) and falls back to kubeconfig for local development. Deploys execute three sequential steps:
- Check Deployment — Calls
AppsV1Api.read_namespaced_deployment(name, "threatops")to verify the deployment exists and retrieve current replica counts. - Rollout Restart — Patches the deployment template annotation
kubectl.kubernetes.io/restartedAtwith the current UTC timestamp, triggering a rolling restart. - Wait for Rollout — Polls
ready_replicasandupdated_replicasevery 5 seconds until they equalspec.replicas, or times out after 120 seconds.
Default deployment target: threatops-frontend. Specific deployments can be targeted via POST /api/v1/integrity/trigger-deploy/{deployment_name}.
Three Subsystems
1. PlatformIntegrityMonitor
File: platform/api/app/services/platform_integrity.py
The core gap detection and scaffolding engine. Maintains in-memory state: scan_results (last 50 scans), generated_pages (cumulative log), pending_deploy flag, and last_scan timestamp. The 14-entry ROUTE_TO_PAGE_MAP defines the expected API-to-frontend page relationships:
ROUTE_TO_PAGE_MAP = {
"/incidents": "/incidents",
"/alerts": "/alerts",
"/detection-rules": "/detection-rules",
"/tenants": "/tenants",
"/reports": "/reports",
"/compliance": "/compliance",
"/playbooks": "/playbooks",
"/advisories": "/threat-advisories",
"/autonomous": "/autonomous-soc",
"/ml": "/ml-models",
"/platform": "/platform-health",
"/siem": "/siem-status",
"/audit": "/audit-logs",
"/blog": "/blog",
}
2. AutoDeployPipeline
File: platform/api/app/services/auto_deploy.py
Manages Kubernetes-native deploys using the official kubernetes Python client. Maintains deploy_history (last 20 records) and an is_deploying lock to prevent concurrent deploys. Each DeployRecord contains a list of DeployStep objects with description, exit status, stdout, stderr, and timestamp. Supports both pod log streaming (by label selector) and namespace event listing.
3. LogHealer
File: platform/api/app/services/log_healer.py
Autonomous error-pattern detector and healer. Scans Kubernetes events and pod logs from threatops-api and threatops-frontend every 60 seconds. Matches log text against 10 regex patterns and dispatches one of 6 healing actions with per-pattern cooldown enforcement:
| Pattern Name | Severity | Regex Match | Healing Action | Cooldown |
|---|---|---|---|---|
crash_loop | critical | CrashLoopBackOff / Back-off restarting | ROLLOUT_RESTART | 300s |
oom_killed | critical | OOMKilled / Out of memory / MemoryPressure | ROLLOUT_RESTART | 300s |
image_pull_error | critical | ImagePullBackOff / ErrImagePull / Failed to pull image | NOTIFY | 600s |
db_connection_refused | high | connection refused.*5432 / asyncpg.*ConnectionRefused | REFRESH_DB_POOL | 60s |
db_pool_exhausted | high | connection pool exhausted / QueuePool limit | REFRESH_DB_POOL | 60s |
readiness_probe_failed | high | Readiness probe failed / Liveness probe failed | ROLLOUT_RESTART | 180s |
ssl_certificate_error | high | SSL.*certificate.*expired / TLS handshake.*failed | NOTIFY | 600s |
disk_pressure | high | DiskPressure / no space left on device | CLEAR_CACHE | 300s |
redis_connection_error | medium | redis.*ConnectionError / redis.*TimeoutError | RESET_CIRCUIT_BREAKER | 120s |
unhandled_exception | medium | Traceback (most recent call last) / 500 Internal Server Error | NOTIFY | 60s |
The six healing action types and their implementations:
- ROLLOUT_RESTART — Calls
auto_deploy.deploy(deployment_name)to trigger a Kubernetes rolling restart - REFRESH_DB_POOL — Calls
self_healing_engine.force_heal("database")to refresh the SQLAlchemy connection pool - RESET_CIRCUIT_BREAKER — Iterates
self_healing_engine.circuit_breakersand resets any open breakers toclosedstate with failure_count=0 - CLEAR_CACHE — Triggers Python garbage collection (
gc.collect()) and marks cache cleared - SCALE_UP — Defined in enum; reserved for future Kubernetes HPA scale-out integration
- NOTIFY — Records an alert in the healing log for operator review (used for issues requiring manual resolution: image pull errors, SSL errors, unhandled exceptions)
API Endpoints
Router file: platform/api/app/routers/integrity.py — Prefix: /api/v1/integrity
Integrity Monitor Endpoints
GET /api/v1/integrity/status
# Returns PlatformIntegrityMonitor state:
# running (bool), last_scan (ISO string), total_scans (int),
# pages_generated_total (int), pending_deploy (bool),
# recent_scans (last 5), generated_pages (last 10)
POST /api/v1/integrity/scan
# Triggers a manual integrity scan immediately
# Returns: { status: "completed", result: { timestamp, api_routes, frontend_pages,
# missing_pages, pages_generated, gaps[], generated[] } }
GET /api/v1/integrity/gaps
# Returns the current gap snapshot (live, not from scan history):
# { api_routes[], frontend_pages[], gaps[{ api_prefix, frontend_path, page_name }],
# total_api_routes, total_frontend_pages, total_gaps }
Deploy Pipeline Endpoints
GET /api/v1/integrity/deploy-status
# AutoDeployPipeline state: is_deploying, total_deploys,
# recent_deploys (last 5), last_deploy
POST /api/v1/integrity/deploy
# Marks pending deploy as complete (clears the pending_deploy flag)
# Does NOT trigger a new deploy
POST /api/v1/integrity/trigger-deploy
# Force triggers rollout restart for threatops-frontend
# Returns 409 if a deploy is already in progress
# Returns: { status: "triggered", result: DeployRecord }
POST /api/v1/integrity/trigger-deploy/{deployment_name}
# Trigger rollout restart for a specific deployment by name
# e.g. POST /api/v1/integrity/trigger-deploy/threatops-api
Pod Observability Endpoints
GET /api/v1/integrity/pods/logs/{deployment_name}?tail=100
# Stream recent pod logs for a named deployment
# Returns: { deployment, pods[{ pod, status, logs }] }
GET /api/v1/integrity/pods/events
# List last 30 Kubernetes events in the threatops namespace
# Returns: { namespace: "threatops", events[{ type, reason, message, object, timestamp, count }] }
Log Healer Endpoints
GET /api/v1/integrity/healer/status
# LogHealer state: running, last_scan, scan_interval_seconds,
# total_errors_detected, total_heals_triggered, error_counts,
# recent_heals (last 20), patterns_monitored, pattern_list
POST /api/v1/integrity/healer/scan
# On-demand log scan + heal pass
# Returns: { status, timestamp, total_errors_detected,
# total_heals_triggered, recent_heals[] }
Combined Dashboard Endpoint
GET /api/v1/integrity/dashboard
# Single-call combined status for UI polling:
# { integrity: {...}, deploy: {...}, healer: {...} }
Frontend Routes
| Layer | Path | Description |
|---|---|---|
| /platform-integrity | Next.js App Router page (src/app/platform-integrity/page.tsx) | |
| /api/v1/integrity/status | Monitor status — fetched on mount and every 30s | |
| /api/v1/integrity/gaps | Live gap list — fetched on mount and every 30s | |
| /api/v1/integrity/deploy-status | Deploy pipeline status — fetched as optional supplement | |
| /api/v1/integrity/scan | POST — triggered by "Run Scan Now" button | |
| /api/v1/integrity/trigger-deploy | POST — triggered by "Deploy Now" button | |
Frontend Page File
File: platform/frontend/src/app/platform-integrity/page.tsx
The page is a React client component ('use client') that polls all three status endpoints in parallel via Promise.all every 30 seconds using setInterval in a useEffect. Two primary actions are available: runScan() (POST /scan then re-fetch all) and triggerDeploy() (POST /trigger-deploy then re-fetch all). Both actions use loading state flags (scanning, deploying) to disable buttons and show spinner icons during execution.
Data Model
All three services maintain in-memory state only — no dedicated database tables are used by the Platform Integrity Monitor. State is lost on API pod restart. The frontend TypeScript interfaces are defined in page.tsx:
IntegrityStatus
| Field | Type | Description |
|---|---|---|
running | boolean | Whether the 30-min scan loop is active |
last_scan | string | null | ISO 8601 timestamp of the most recent scan |
total_scans | number | Cumulative count of scans since startup |
pages_generated_total | number | Total auto-generated pages since startup |
pending_deploy | boolean | True when generated pages are awaiting a deploy |
recent_scans | ScanResult[] | Last 5 scan result objects |
generated_pages | GeneratedPage[] | Last 10 auto-generated page records |
ScanResult
| Field | Type | Description |
|---|---|---|
timestamp | string | ISO 8601 scan start time |
api_routes | number | Count of discovered API route prefixes |
frontend_pages | number | Count of discovered Next.js pages |
missing_pages | number | Count of gaps detected |
pages_generated | number | Count of pages auto-generated in this scan |
gaps | string[] | List of API route prefixes that had no frontend page |
generated | string[] | List of file paths for pages generated in this scan |
GapsResponse
| Field | Type | Description |
|---|---|---|
api_routes | string[] | Sorted list of all discovered API route prefixes |
frontend_pages | string[] | Sorted list of all discovered frontend page routes |
gaps | GapInfo[] | Current gap records with api_prefix, frontend_path, page_name |
total_api_routes | number | Total API route count |
total_frontend_pages | number | Total frontend page count |
total_gaps | number | Current gap count |
DeployRecord
| Field | Type | Description |
|---|---|---|
started_at | string | ISO 8601 deploy start time |
completed_at | string | undefined | ISO 8601 completion time (absent if failed mid-way) |
steps | DeployStep[] | Ordered list of deploy step results |
status | string | 'running' | 'success' | 'partial' | 'failed' |
error | string | undefined | Top-level error message if deploy failed |
DeployStep
| Field | Type | Description |
|---|---|---|
description | string | Human-readable step description (e.g. "check_deployment", "rollout_restart") |
exit_code | number | 0 for success, non-zero for failure |
stdout | string | undefined | Step stdout output |
stderr | string | undefined | Step stderr output |
error | string | undefined | Error message if step failed |
timestamp | string | ISO 8601 timestamp when step completed |
UI Layout
Page Structure
The page uses a p-6 max-w-7xl mx-auto space-y-6 container and renders seven distinct sections:
- Header — Title "Platform Integrity Monitor" and subtitle "Autonomous frontend scaffold & deploy pipeline — Scans every 30 minutes". Action buttons: "Run Scan Now" (orange,
ScanSearchicon with spin while scanning) and "Deploy Now" (slate-800,Rocketicon with bounce while deploying). - Status Cards Row — 4-column grid: Monitor Status (green/slate
Activityicon showing Running/Stopped with last scan time), Coverage (blueArrowRightLefticon showing frontend pages / API routes ratio), Active Gaps (amber/greenAlertCircleicon with gap count), Pages Generated (orangeFileCode2icon with total count and scan count). - Pending Deploy Banner — Conditionally shown when
status.pending_deployis true. Amber background with Rocket icon and an inline "Deploy Now" button. - Gaps & Generated Pages (2-col grid) — Left panel lists current gaps with API prefix and expected frontend path. Right panel shows the generated pages log with route, source API route, and timestamp. Both panels show empty states with centered icons when no data exists.
- Recent Scans Table — Shows last 5 scans (most recent first) with columns: Timestamp, API Routes count, Frontend Pages count, Gaps (amber/green badge), Generated (orange/slate badge).
- Deploy Pipeline — Lists recent deployments in reverse chronological order. Each deploy shows status icon (CheckCircle2 / Loader2 / XCircle), status text with color coding, start timestamp, optional error message, and a per-step checklist with individual exit code status icons.
- Info Footer — Orange-tinted description box explaining scan intervals (30 min), deploy check interval (5 min), and the end-to-end automated flow.
Prerequisites
- Kubernetes In-Cluster Config — The API pod must run inside the
threatopsKubernetes namespace with a ServiceAccount that has RBAC permissions to read/patch Deployments, read Pods, and list Events. Theplatform/infrastructure/kubernetes/base/api-rbac.yamldefines these permissions. - kubernetes Python Package —
kubernetesmust be inplatform/api/requirements.txt. Used by bothauto_deploy.pyandlog_healer.pyfor all K8s API calls. - Filesystem Access — The API container must have read/write access to the frontend source tree at
platform/frontend/src/app/and read access toplatform/api/app/routers/for gap detection to function. In the deployed AKS setup, this is handled via a shared volume mount. - Frontend Factory Service —
app/services/frontend_factory.pyshould be importable for the higher-quality page scaffolding path. If unavailable, falls back to a basic template writer. - Self-Healing Engine —
app/services/self_healing.pymust be available for the LogHealer'sREFRESH_DB_POOLandRESET_CIRCUIT_BREAKERactions. - FastAPI Lifespan Handler — All three services must be started in the application lifespan (
app/main.py). They are background async tasks and do not block startup. - Deployments Present — The
threatops-frontendandthreatops-apiKubernetes Deployments must exist in thethreatopsnamespace for deploy and log operations to succeed. - No Database Required — All state is in-memory. The module operates independently of PostgreSQL and Redis.