Platform Integrity Monitor

Complete

Overview

The Platform Integrity Monitor is a self-healing, autonomous infrastructure subsystem built into the ThreatOps SOCaaS API. It addresses a fundamental operational challenge of a fast-moving, multi-module platform: keeping the frontend UI in sync with the backend API surface as new features are continuously deployed.

In practice this means: every time a new FastAPI router is added to the backend, there must be a corresponding Next.js page in the frontend. Without automation, this becomes a manual gap-tracking problem. The Platform Integrity Monitor solves it by continuously scanning both the API router directory and the Next.js app directory, detecting mismatches, auto-generating placeholder frontend pages, queuing Kubernetes rollout deploys, and healing runtime errors — all without human intervention.

The module is composed of three tightly integrated services: PlatformIntegrityMonitor (gap detection + page scaffolding), AutoDeployPipeline (Kubernetes rollout management), and LogHealer (error-pattern-driven auto-healing). All three run as background async loops inside the FastAPI process and expose their state through the /api/v1/integrity router.

What Was Proposed

Continuous monitoring of API routes vs. frontend pages to detect coverage gaps
Automatic generation of scaffold frontend pages for any API route that lacks a UI
Autonomous rebuild and redeploy pipeline triggered when new pages are generated
Real-time deployment status tracking with per-step audit trail
Pod log streaming and Kubernetes event inspection for production diagnostics
Log-based error pattern detection with automated healing actions
A unified dashboard endpoint combining integrity, deploy, and healer status
Manual trigger endpoints for on-demand scans, deploys, and healing passes

What's Built

Feature	Status
30-minute integrity scan loop (API vs. frontend gap detection)	✓ Complete
Filesystem-based API route discovery (scans `routers/*.py` filenames)	✓ Complete
Next.js App Router page discovery (scans `app/**/page.tsx`)	✓ Complete
14-entry ROUTE_TO_PAGE_MAP for known route-to-page relationships	✓ Complete
Frontend Factory Engine integration for page scaffolding	✓ Complete
Fallback basic template generator for direct file creation	✓ Complete
Kubernetes in-cluster rollout restart via `AppsV1Api`	✓ Complete
Rollout status polling with 120s timeout	✓ Complete
Pod log streaming via `CoreV1Api` (per-deployment, configurable tail)	✓ Complete
Kubernetes event listing for the `threatops` namespace	✓ Complete
Log-based auto-healer with 10 error patterns and 6 healing actions	✓ Complete
60-second healer scan loop with per-pattern cooldown enforcement	✓ Complete
Unified dashboard endpoint combining all three subsystem statuses	✓ Complete
Frontend page with live 30s polling, Run Scan Now, and Deploy Now buttons	✓ Complete
Scan history table (last 5 scans) and generated pages log (last 10)	✓ Complete
Deploy pipeline view with per-step status icons (CheckCircle2 / XCircle)	✓ Complete

Architecture

Service Entry Points

All three services are singletons instantiated at module import time and started by the FastAPI application lifespan handler:

from app.services.platform_integrity import integrity_monitor   # PlatformIntegrityMonitor
from app.services.auto_deploy import auto_deploy                 # AutoDeployPipeline
from app.services.log_healer import log_healer                   # LogHealer

# Started in app/main.py lifespan:
asyncio.create_task(integrity_monitor.start())  # 30-min scan loop
asyncio.create_task(auto_deploy.start())        # 2-min deploy watch loop
asyncio.create_task(log_healer.start())         # 60-sec healer loop

Integrity Scan Flow

Discover API Routes Reads all *.py filenames from app/routers/ (excluding __init__.py and websocket.py). Also parses app/main.py for prefix="/api/v1/..." patterns. Returns a set of route prefix strings like /incidents, /alerts, etc.

Discover Frontend Pages Recursively finds all page.tsx files under src/app/ using rglob. Converts file paths to route strings (e.g., src/app/incidents/page.tsx → /incidents). Skips dynamic routes containing [.

Find Gaps Checks the 14-entry ROUTE_TO_PAGE_MAP. For each entry, verifies both that the API route exists in the discovered set and that the expected frontend page path is absent. Only maps with both conditions produce a gap record.

Generate Missing Pages For each gap, tries the Frontend Factory Engine first (submits a PageRequirement with type data_table and features: search, sort, pagination, auto-refresh). Falls back to writing a basic Next.js template directly to the filesystem if the Factory is unavailable.

Flag for Deploy If any pages were generated, sets integrity_monitor.pending_deploy = True. The AutoDeployPipeline checks this flag every 2 minutes and triggers a rollout restart automatically.

Auto-Deploy Pipeline

The AutoDeployPipeline uses the Kubernetes Python client (kubernetes package) loaded with in-cluster config (config.load_incluster_config()) and falls back to kubeconfig for local development. Deploys execute three sequential steps:

Check Deployment — Calls AppsV1Api.read_namespaced_deployment(name, "threatops") to verify the deployment exists and retrieve current replica counts.
Rollout Restart — Patches the deployment template annotation kubectl.kubernetes.io/restartedAt with the current UTC timestamp, triggering a rolling restart.
Wait for Rollout — Polls ready_replicas and updated_replicas every 5 seconds until they equal spec.replicas, or times out after 120 seconds.

Default deployment target: threatops-frontend. Specific deployments can be targeted via POST /api/v1/integrity/trigger-deploy/{deployment_name}.

Three Subsystems

1. PlatformIntegrityMonitor

File: platform/api/app/services/platform_integrity.py

The core gap detection and scaffolding engine. Maintains in-memory state: scan_results (last 50 scans), generated_pages (cumulative log), pending_deploy flag, and last_scan timestamp. The 14-entry ROUTE_TO_PAGE_MAP defines the expected API-to-frontend page relationships:

ROUTE_TO_PAGE_MAP = {
    "/incidents":       "/incidents",
    "/alerts":          "/alerts",
    "/detection-rules": "/detection-rules",
    "/tenants":         "/tenants",
    "/reports":         "/reports",
    "/compliance":      "/compliance",
    "/playbooks":       "/playbooks",
    "/advisories":      "/threat-advisories",
    "/autonomous":      "/autonomous-soc",
    "/ml":              "/ml-models",
    "/platform":        "/platform-health",
    "/siem":            "/siem-status",
    "/audit":           "/audit-logs",
    "/blog":            "/blog",
}

2. AutoDeployPipeline

File: platform/api/app/services/auto_deploy.py

Manages Kubernetes-native deploys using the official kubernetes Python client. Maintains deploy_history (last 20 records) and an is_deploying lock to prevent concurrent deploys. Each DeployRecord contains a list of DeployStep objects with description, exit status, stdout, stderr, and timestamp. Supports both pod log streaming (by label selector) and namespace event listing.

3. LogHealer

File: platform/api/app/services/log_healer.py

Autonomous error-pattern detector and healer. Scans Kubernetes events and pod logs from threatops-api and threatops-frontend every 60 seconds. Matches log text against 10 regex patterns and dispatches one of 6 healing actions with per-pattern cooldown enforcement:

Pattern Name	Severity	Regex Match	Healing Action	Cooldown
`crash_loop`	critical	CrashLoopBackOff / Back-off restarting	ROLLOUT_RESTART	300s
`oom_killed`	critical	OOMKilled / Out of memory / MemoryPressure	ROLLOUT_RESTART	300s
`image_pull_error`	critical	ImagePullBackOff / ErrImagePull / Failed to pull image	NOTIFY	600s
`db_connection_refused`	high	connection refused.5432 / asyncpg.ConnectionRefused	REFRESH_DB_POOL	60s
`db_pool_exhausted`	high	connection pool exhausted / QueuePool limit	REFRESH_DB_POOL	60s
`readiness_probe_failed`	high	Readiness probe failed / Liveness probe failed	ROLLOUT_RESTART	180s
`ssl_certificate_error`	high	SSL.certificate.expired / TLS handshake.*failed	NOTIFY	600s
`disk_pressure`	high	DiskPressure / no space left on device	CLEAR_CACHE	300s
`redis_connection_error`	medium	redis.ConnectionError / redis.TimeoutError	RESET_CIRCUIT_BREAKER	120s
`unhandled_exception`	medium	Traceback (most recent call last) / 500 Internal Server Error	NOTIFY	60s

The six healing action types and their implementations:

ROLLOUT_RESTART — Calls auto_deploy.deploy(deployment_name) to trigger a Kubernetes rolling restart
REFRESH_DB_POOL — Calls self_healing_engine.force_heal("database") to refresh the SQLAlchemy connection pool
RESET_CIRCUIT_BREAKER — Iterates self_healing_engine.circuit_breakers and resets any open breakers to closed state with failure_count=0
CLEAR_CACHE — Triggers Python garbage collection (gc.collect()) and marks cache cleared
SCALE_UP — Defined in enum; reserved for future Kubernetes HPA scale-out integration
NOTIFY — Records an alert in the healing log for operator review (used for issues requiring manual resolution: image pull errors, SSL errors, unhandled exceptions)

API Endpoints

Router file: platform/api/app/routers/integrity.py — Prefix: /api/v1/integrity

Integrity Monitor Endpoints

GET   /api/v1/integrity/status
  # Returns PlatformIntegrityMonitor state:
  # running (bool), last_scan (ISO string), total_scans (int),
  # pages_generated_total (int), pending_deploy (bool),
  # recent_scans (last 5), generated_pages (last 10)

POST  /api/v1/integrity/scan
  # Triggers a manual integrity scan immediately
  # Returns: { status: "completed", result: { timestamp, api_routes, frontend_pages,
  #             missing_pages, pages_generated, gaps[], generated[] } }

GET   /api/v1/integrity/gaps
  # Returns the current gap snapshot (live, not from scan history):
  # { api_routes[], frontend_pages[], gaps[{ api_prefix, frontend_path, page_name }],
  #   total_api_routes, total_frontend_pages, total_gaps }

Deploy Pipeline Endpoints

GET   /api/v1/integrity/deploy-status
  # AutoDeployPipeline state: is_deploying, total_deploys,
  # recent_deploys (last 5), last_deploy

POST  /api/v1/integrity/deploy
  # Marks pending deploy as complete (clears the pending_deploy flag)
  # Does NOT trigger a new deploy

POST  /api/v1/integrity/trigger-deploy
  # Force triggers rollout restart for threatops-frontend
  # Returns 409 if a deploy is already in progress
  # Returns: { status: "triggered", result: DeployRecord }

POST  /api/v1/integrity/trigger-deploy/{deployment_name}
  # Trigger rollout restart for a specific deployment by name
  # e.g. POST /api/v1/integrity/trigger-deploy/threatops-api

Pod Observability Endpoints

GET   /api/v1/integrity/pods/logs/{deployment_name}?tail=100
  # Stream recent pod logs for a named deployment
  # Returns: { deployment, pods[{ pod, status, logs }] }

GET   /api/v1/integrity/pods/events
  # List last 30 Kubernetes events in the threatops namespace
  # Returns: { namespace: "threatops", events[{ type, reason, message, object, timestamp, count }] }

Log Healer Endpoints

GET   /api/v1/integrity/healer/status
  # LogHealer state: running, last_scan, scan_interval_seconds,
  # total_errors_detected, total_heals_triggered, error_counts,
  # recent_heals (last 20), patterns_monitored, pattern_list

POST  /api/v1/integrity/healer/scan
  # On-demand log scan + heal pass
  # Returns: { status, timestamp, total_errors_detected,
  #             total_heals_triggered, recent_heals[] }

Combined Dashboard Endpoint

GET   /api/v1/integrity/dashboard
  # Single-call combined status for UI polling:
  # { integrity: {...}, deploy: {...}, healer: {...} }

Frontend Routes

Layer	Path	Description
/platform-integrity	Next.js App Router page (src/app/platform-integrity/page.tsx)
/api/v1/integrity/status	Monitor status — fetched on mount and every 30s
/api/v1/integrity/gaps	Live gap list — fetched on mount and every 30s
/api/v1/integrity/deploy-status	Deploy pipeline status — fetched as optional supplement
/api/v1/integrity/scan	POST — triggered by "Run Scan Now" button
/api/v1/integrity/trigger-deploy	POST — triggered by "Deploy Now" button

Frontend Page File

File: platform/frontend/src/app/platform-integrity/page.tsx

The page is a React client component ('use client') that polls all three status endpoints in parallel via Promise.all every 30 seconds using setInterval in a useEffect. Two primary actions are available: runScan() (POST /scan then re-fetch all) and triggerDeploy() (POST /trigger-deploy then re-fetch all). Both actions use loading state flags (scanning, deploying) to disable buttons and show spinner icons during execution.

Data Model

All three services maintain in-memory state only — no dedicated database tables are used by the Platform Integrity Monitor. State is lost on API pod restart. The frontend TypeScript interfaces are defined in page.tsx:

IntegrityStatus

Field	Type	Description
`running`	boolean	Whether the 30-min scan loop is active
`last_scan`	string \| null	ISO 8601 timestamp of the most recent scan
`total_scans`	number	Cumulative count of scans since startup
`pages_generated_total`	number	Total auto-generated pages since startup
`pending_deploy`	boolean	True when generated pages are awaiting a deploy
`recent_scans`	ScanResult[]	Last 5 scan result objects
`generated_pages`	GeneratedPage[]	Last 10 auto-generated page records

ScanResult

Field	Type	Description
`timestamp`	string	ISO 8601 scan start time
`api_routes`	number	Count of discovered API route prefixes
`frontend_pages`	number	Count of discovered Next.js pages
`missing_pages`	number	Count of gaps detected
`pages_generated`	number	Count of pages auto-generated in this scan
`gaps`	string[]	List of API route prefixes that had no frontend page
`generated`	string[]	List of file paths for pages generated in this scan

GapsResponse

Field	Type	Description
`api_routes`	string[]	Sorted list of all discovered API route prefixes
`frontend_pages`	string[]	Sorted list of all discovered frontend page routes
`gaps`	GapInfo[]	Current gap records with api_prefix, frontend_path, page_name
`total_api_routes`	number	Total API route count
`total_frontend_pages`	number	Total frontend page count
`total_gaps`	number	Current gap count

DeployRecord

Field	Type	Description
`started_at`	string	ISO 8601 deploy start time
`completed_at`	string \| undefined	ISO 8601 completion time (absent if failed mid-way)
`steps`	DeployStep[]	Ordered list of deploy step results
`status`	string	'running' \| 'success' \| 'partial' \| 'failed'
`error`	string \| undefined	Top-level error message if deploy failed

DeployStep

Field	Type	Description
`description`	string	Human-readable step description (e.g. "check_deployment", "rollout_restart")
`exit_code`	number	0 for success, non-zero for failure
`stdout`	string \| undefined	Step stdout output
`stderr`	string \| undefined	Step stderr output
`error`	string \| undefined	Error message if step failed
`timestamp`	string	ISO 8601 timestamp when step completed

UI Layout

Page Structure

The page uses a p-6 max-w-7xl mx-auto space-y-6 container and renders seven distinct sections:

Header — Title "Platform Integrity Monitor" and subtitle "Autonomous frontend scaffold & deploy pipeline — Scans every 30 minutes". Action buttons: "Run Scan Now" (orange, ScanSearch icon with spin while scanning) and "Deploy Now" (slate-800, Rocket icon with bounce while deploying).
Status Cards Row — 4-column grid: Monitor Status (green/slate Activity icon showing Running/Stopped with last scan time), Coverage (blue ArrowRightLeft icon showing frontend pages / API routes ratio), Active Gaps (amber/green AlertCircle icon with gap count), Pages Generated (orange FileCode2 icon with total count and scan count).
Pending Deploy Banner — Conditionally shown when status.pending_deploy is true. Amber background with Rocket icon and an inline "Deploy Now" button.
Gaps & Generated Pages (2-col grid) — Left panel lists current gaps with API prefix and expected frontend path. Right panel shows the generated pages log with route, source API route, and timestamp. Both panels show empty states with centered icons when no data exists.
Recent Scans Table — Shows last 5 scans (most recent first) with columns: Timestamp, API Routes count, Frontend Pages count, Gaps (amber/green badge), Generated (orange/slate badge).
Deploy Pipeline — Lists recent deployments in reverse chronological order. Each deploy shows status icon (CheckCircle2 / Loader2 / XCircle), status text with color coding, start timestamp, optional error message, and a per-step checklist with individual exit code status icons.
Info Footer — Orange-tinted description box explaining scan intervals (30 min), deploy check interval (5 min), and the end-to-end automated flow.

Prerequisites

Kubernetes In-Cluster Config — The API pod must run inside the threatops Kubernetes namespace with a ServiceAccount that has RBAC permissions to read/patch Deployments, read Pods, and list Events. The platform/infrastructure/kubernetes/base/api-rbac.yaml defines these permissions.
kubernetes Python Package — kubernetes must be in platform/api/requirements.txt. Used by both auto_deploy.py and log_healer.py for all K8s API calls.
Filesystem Access — The API container must have read/write access to the frontend source tree at platform/frontend/src/app/ and read access to platform/api/app/routers/ for gap detection to function. In the deployed AKS setup, this is handled via a shared volume mount.
Frontend Factory Service — app/services/frontend_factory.py should be importable for the higher-quality page scaffolding path. If unavailable, falls back to a basic template writer.
Self-Healing Engine — app/services/self_healing.py must be available for the LogHealer's REFRESH_DB_POOL and RESET_CIRCUIT_BREAKER actions.
FastAPI Lifespan Handler — All three services must be started in the application lifespan (app/main.py). They are background async tasks and do not block startup.
Deployments Present — The threatops-frontend and threatops-api Kubernetes Deployments must exist in the threatops namespace for deploy and log operations to succeed.
No Database Required — All state is in-memory. The module operates independently of PostgreSQL and Redis.