Platform Integrity Monitor

Complete

Overview

The Platform Integrity Monitor is a self-healing, autonomous infrastructure subsystem built into the ThreatOps SOCaaS API. It addresses a fundamental operational challenge of a fast-moving, multi-module platform: keeping the frontend UI in sync with the backend API surface as new features are continuously deployed.

In practice this means: every time a new FastAPI router is added to the backend, there must be a corresponding Next.js page in the frontend. Without automation, this becomes a manual gap-tracking problem. The Platform Integrity Monitor solves it by continuously scanning both the API router directory and the Next.js app directory, detecting mismatches, auto-generating placeholder frontend pages, queuing Kubernetes rollout deploys, and healing runtime errors — all without human intervention.

The module is composed of three tightly integrated services: PlatformIntegrityMonitor (gap detection + page scaffolding), AutoDeployPipeline (Kubernetes rollout management), and LogHealer (error-pattern-driven auto-healing). All three run as background async loops inside the FastAPI process and expose their state through the /api/v1/integrity router.

What Was Proposed

What's Built

FeatureStatus
30-minute integrity scan loop (API vs. frontend gap detection)✓ Complete
Filesystem-based API route discovery (scans routers/*.py filenames)✓ Complete
Next.js App Router page discovery (scans app/**/page.tsx)✓ Complete
14-entry ROUTE_TO_PAGE_MAP for known route-to-page relationships✓ Complete
Frontend Factory Engine integration for page scaffolding✓ Complete
Fallback basic template generator for direct file creation✓ Complete
Kubernetes in-cluster rollout restart via AppsV1Api✓ Complete
Rollout status polling with 120s timeout✓ Complete
Pod log streaming via CoreV1Api (per-deployment, configurable tail)✓ Complete
Kubernetes event listing for the threatops namespace✓ Complete
Log-based auto-healer with 10 error patterns and 6 healing actions✓ Complete
60-second healer scan loop with per-pattern cooldown enforcement✓ Complete
Unified dashboard endpoint combining all three subsystem statuses✓ Complete
Frontend page with live 30s polling, Run Scan Now, and Deploy Now buttons✓ Complete
Scan history table (last 5 scans) and generated pages log (last 10)✓ Complete
Deploy pipeline view with per-step status icons (CheckCircle2 / XCircle)✓ Complete

Architecture

Service Entry Points

All three services are singletons instantiated at module import time and started by the FastAPI application lifespan handler:

from app.services.platform_integrity import integrity_monitor   # PlatformIntegrityMonitor
from app.services.auto_deploy import auto_deploy                 # AutoDeployPipeline
from app.services.log_healer import log_healer                   # LogHealer

# Started in app/main.py lifespan:
asyncio.create_task(integrity_monitor.start())  # 30-min scan loop
asyncio.create_task(auto_deploy.start())        # 2-min deploy watch loop
asyncio.create_task(log_healer.start())         # 60-sec healer loop

Integrity Scan Flow

1
Discover API Routes Reads all *.py filenames from app/routers/ (excluding __init__.py and websocket.py). Also parses app/main.py for prefix="/api/v1/..." patterns. Returns a set of route prefix strings like /incidents, /alerts, etc.
2
Discover Frontend Pages Recursively finds all page.tsx files under src/app/ using rglob. Converts file paths to route strings (e.g., src/app/incidents/page.tsx/incidents). Skips dynamic routes containing [.
3
Find Gaps Checks the 14-entry ROUTE_TO_PAGE_MAP. For each entry, verifies both that the API route exists in the discovered set and that the expected frontend page path is absent. Only maps with both conditions produce a gap record.
4
Generate Missing Pages For each gap, tries the Frontend Factory Engine first (submits a PageRequirement with type data_table and features: search, sort, pagination, auto-refresh). Falls back to writing a basic Next.js template directly to the filesystem if the Factory is unavailable.
5
Flag for Deploy If any pages were generated, sets integrity_monitor.pending_deploy = True. The AutoDeployPipeline checks this flag every 2 minutes and triggers a rollout restart automatically.

Auto-Deploy Pipeline

The AutoDeployPipeline uses the Kubernetes Python client (kubernetes package) loaded with in-cluster config (config.load_incluster_config()) and falls back to kubeconfig for local development. Deploys execute three sequential steps:

  1. Check Deployment — Calls AppsV1Api.read_namespaced_deployment(name, "threatops") to verify the deployment exists and retrieve current replica counts.
  2. Rollout Restart — Patches the deployment template annotation kubectl.kubernetes.io/restartedAt with the current UTC timestamp, triggering a rolling restart.
  3. Wait for Rollout — Polls ready_replicas and updated_replicas every 5 seconds until they equal spec.replicas, or times out after 120 seconds.

Default deployment target: threatops-frontend. Specific deployments can be targeted via POST /api/v1/integrity/trigger-deploy/{deployment_name}.

Three Subsystems

1. PlatformIntegrityMonitor

File: platform/api/app/services/platform_integrity.py

The core gap detection and scaffolding engine. Maintains in-memory state: scan_results (last 50 scans), generated_pages (cumulative log), pending_deploy flag, and last_scan timestamp. The 14-entry ROUTE_TO_PAGE_MAP defines the expected API-to-frontend page relationships:

ROUTE_TO_PAGE_MAP = {
    "/incidents":       "/incidents",
    "/alerts":          "/alerts",
    "/detection-rules": "/detection-rules",
    "/tenants":         "/tenants",
    "/reports":         "/reports",
    "/compliance":      "/compliance",
    "/playbooks":       "/playbooks",
    "/advisories":      "/threat-advisories",
    "/autonomous":      "/autonomous-soc",
    "/ml":              "/ml-models",
    "/platform":        "/platform-health",
    "/siem":            "/siem-status",
    "/audit":           "/audit-logs",
    "/blog":            "/blog",
}

2. AutoDeployPipeline

File: platform/api/app/services/auto_deploy.py

Manages Kubernetes-native deploys using the official kubernetes Python client. Maintains deploy_history (last 20 records) and an is_deploying lock to prevent concurrent deploys. Each DeployRecord contains a list of DeployStep objects with description, exit status, stdout, stderr, and timestamp. Supports both pod log streaming (by label selector) and namespace event listing.

3. LogHealer

File: platform/api/app/services/log_healer.py

Autonomous error-pattern detector and healer. Scans Kubernetes events and pod logs from threatops-api and threatops-frontend every 60 seconds. Matches log text against 10 regex patterns and dispatches one of 6 healing actions with per-pattern cooldown enforcement:

Pattern NameSeverityRegex MatchHealing ActionCooldown
crash_loopcriticalCrashLoopBackOff / Back-off restartingROLLOUT_RESTART300s
oom_killedcriticalOOMKilled / Out of memory / MemoryPressureROLLOUT_RESTART300s
image_pull_errorcriticalImagePullBackOff / ErrImagePull / Failed to pull imageNOTIFY600s
db_connection_refusedhighconnection refused.*5432 / asyncpg.*ConnectionRefusedREFRESH_DB_POOL60s
db_pool_exhaustedhighconnection pool exhausted / QueuePool limitREFRESH_DB_POOL60s
readiness_probe_failedhighReadiness probe failed / Liveness probe failedROLLOUT_RESTART180s
ssl_certificate_errorhighSSL.*certificate.*expired / TLS handshake.*failedNOTIFY600s
disk_pressurehighDiskPressure / no space left on deviceCLEAR_CACHE300s
redis_connection_errormediumredis.*ConnectionError / redis.*TimeoutErrorRESET_CIRCUIT_BREAKER120s
unhandled_exceptionmediumTraceback (most recent call last) / 500 Internal Server ErrorNOTIFY60s

The six healing action types and their implementations:

API Endpoints

Router file: platform/api/app/routers/integrity.py — Prefix: /api/v1/integrity

Integrity Monitor Endpoints

GET   /api/v1/integrity/status
  # Returns PlatformIntegrityMonitor state:
  # running (bool), last_scan (ISO string), total_scans (int),
  # pages_generated_total (int), pending_deploy (bool),
  # recent_scans (last 5), generated_pages (last 10)

POST  /api/v1/integrity/scan
  # Triggers a manual integrity scan immediately
  # Returns: { status: "completed", result: { timestamp, api_routes, frontend_pages,
  #             missing_pages, pages_generated, gaps[], generated[] } }

GET   /api/v1/integrity/gaps
  # Returns the current gap snapshot (live, not from scan history):
  # { api_routes[], frontend_pages[], gaps[{ api_prefix, frontend_path, page_name }],
  #   total_api_routes, total_frontend_pages, total_gaps }

Deploy Pipeline Endpoints

GET   /api/v1/integrity/deploy-status
  # AutoDeployPipeline state: is_deploying, total_deploys,
  # recent_deploys (last 5), last_deploy

POST  /api/v1/integrity/deploy
  # Marks pending deploy as complete (clears the pending_deploy flag)
  # Does NOT trigger a new deploy

POST  /api/v1/integrity/trigger-deploy
  # Force triggers rollout restart for threatops-frontend
  # Returns 409 if a deploy is already in progress
  # Returns: { status: "triggered", result: DeployRecord }

POST  /api/v1/integrity/trigger-deploy/{deployment_name}
  # Trigger rollout restart for a specific deployment by name
  # e.g. POST /api/v1/integrity/trigger-deploy/threatops-api

Pod Observability Endpoints

GET   /api/v1/integrity/pods/logs/{deployment_name}?tail=100
  # Stream recent pod logs for a named deployment
  # Returns: { deployment, pods[{ pod, status, logs }] }

GET   /api/v1/integrity/pods/events
  # List last 30 Kubernetes events in the threatops namespace
  # Returns: { namespace: "threatops", events[{ type, reason, message, object, timestamp, count }] }

Log Healer Endpoints

GET   /api/v1/integrity/healer/status
  # LogHealer state: running, last_scan, scan_interval_seconds,
  # total_errors_detected, total_heals_triggered, error_counts,
  # recent_heals (last 20), patterns_monitored, pattern_list

POST  /api/v1/integrity/healer/scan
  # On-demand log scan + heal pass
  # Returns: { status, timestamp, total_errors_detected,
  #             total_heals_triggered, recent_heals[] }

Combined Dashboard Endpoint

GET   /api/v1/integrity/dashboard
  # Single-call combined status for UI polling:
  # { integrity: {...}, deploy: {...}, healer: {...} }

Frontend Routes

LayerPathDescription
/platform-integrityNext.js App Router page (src/app/platform-integrity/page.tsx)
/api/v1/integrity/statusMonitor status — fetched on mount and every 30s
/api/v1/integrity/gapsLive gap list — fetched on mount and every 30s
/api/v1/integrity/deploy-statusDeploy pipeline status — fetched as optional supplement
/api/v1/integrity/scanPOST — triggered by "Run Scan Now" button
/api/v1/integrity/trigger-deployPOST — triggered by "Deploy Now" button

Frontend Page File

File: platform/frontend/src/app/platform-integrity/page.tsx

The page is a React client component ('use client') that polls all three status endpoints in parallel via Promise.all every 30 seconds using setInterval in a useEffect. Two primary actions are available: runScan() (POST /scan then re-fetch all) and triggerDeploy() (POST /trigger-deploy then re-fetch all). Both actions use loading state flags (scanning, deploying) to disable buttons and show spinner icons during execution.

Data Model

All three services maintain in-memory state only — no dedicated database tables are used by the Platform Integrity Monitor. State is lost on API pod restart. The frontend TypeScript interfaces are defined in page.tsx:

IntegrityStatus

FieldTypeDescription
runningbooleanWhether the 30-min scan loop is active
last_scanstring | nullISO 8601 timestamp of the most recent scan
total_scansnumberCumulative count of scans since startup
pages_generated_totalnumberTotal auto-generated pages since startup
pending_deploybooleanTrue when generated pages are awaiting a deploy
recent_scansScanResult[]Last 5 scan result objects
generated_pagesGeneratedPage[]Last 10 auto-generated page records

ScanResult

FieldTypeDescription
timestampstringISO 8601 scan start time
api_routesnumberCount of discovered API route prefixes
frontend_pagesnumberCount of discovered Next.js pages
missing_pagesnumberCount of gaps detected
pages_generatednumberCount of pages auto-generated in this scan
gapsstring[]List of API route prefixes that had no frontend page
generatedstring[]List of file paths for pages generated in this scan

GapsResponse

FieldTypeDescription
api_routesstring[]Sorted list of all discovered API route prefixes
frontend_pagesstring[]Sorted list of all discovered frontend page routes
gapsGapInfo[]Current gap records with api_prefix, frontend_path, page_name
total_api_routesnumberTotal API route count
total_frontend_pagesnumberTotal frontend page count
total_gapsnumberCurrent gap count

DeployRecord

FieldTypeDescription
started_atstringISO 8601 deploy start time
completed_atstring | undefinedISO 8601 completion time (absent if failed mid-way)
stepsDeployStep[]Ordered list of deploy step results
statusstring'running' | 'success' | 'partial' | 'failed'
errorstring | undefinedTop-level error message if deploy failed

DeployStep

FieldTypeDescription
descriptionstringHuman-readable step description (e.g. "check_deployment", "rollout_restart")
exit_codenumber0 for success, non-zero for failure
stdoutstring | undefinedStep stdout output
stderrstring | undefinedStep stderr output
errorstring | undefinedError message if step failed
timestampstringISO 8601 timestamp when step completed

UI Layout

Page Structure

The page uses a p-6 max-w-7xl mx-auto space-y-6 container and renders seven distinct sections:

  1. Header — Title "Platform Integrity Monitor" and subtitle "Autonomous frontend scaffold & deploy pipeline — Scans every 30 minutes". Action buttons: "Run Scan Now" (orange, ScanSearch icon with spin while scanning) and "Deploy Now" (slate-800, Rocket icon with bounce while deploying).
  2. Status Cards Row — 4-column grid: Monitor Status (green/slate Activity icon showing Running/Stopped with last scan time), Coverage (blue ArrowRightLeft icon showing frontend pages / API routes ratio), Active Gaps (amber/green AlertCircle icon with gap count), Pages Generated (orange FileCode2 icon with total count and scan count).
  3. Pending Deploy Banner — Conditionally shown when status.pending_deploy is true. Amber background with Rocket icon and an inline "Deploy Now" button.
  4. Gaps & Generated Pages (2-col grid) — Left panel lists current gaps with API prefix and expected frontend path. Right panel shows the generated pages log with route, source API route, and timestamp. Both panels show empty states with centered icons when no data exists.
  5. Recent Scans Table — Shows last 5 scans (most recent first) with columns: Timestamp, API Routes count, Frontend Pages count, Gaps (amber/green badge), Generated (orange/slate badge).
  6. Deploy Pipeline — Lists recent deployments in reverse chronological order. Each deploy shows status icon (CheckCircle2 / Loader2 / XCircle), status text with color coding, start timestamp, optional error message, and a per-step checklist with individual exit code status icons.
  7. Info Footer — Orange-tinted description box explaining scan intervals (30 min), deploy check interval (5 min), and the end-to-end automated flow.

Prerequisites