Security Analytics

Complete

Overview

The Security Analytics module exposes the internals of the ThreatOps classical machine learning pipeline to operators and analysts. Rather than treating AI triage as a black box, this module provides a live window into model health, training progress, version history, feature importance rankings, and the ensemble scoring strategy that drives every alert disposition decision on the platform.

The frontend page at /analytics polls the ML backend every 30 seconds and renders the state of three sklearn models — an Alert Classifier, an Anomaly Detector, and a Threat Scorer — alongside a training buffer progress bar, the 70/30 ensemble blend visualization, a ranked bar chart of the top 17 RandomForest feature importances, and a summary of the four possible disposition outcomes and their confidence thresholds. The backend is served entirely through the /api/v1/ml router, which also handles real-time predictions, analyst feedback recording, forced retraining, and model version history retrieval for rollback workflows.

What Was Proposed

What's Built

Alert Classifier (RandomForest, sklearn)✓ Complete
Anomaly Detector (IsolationForest, sklearn)✓ Complete
Threat Scorer (GradientBoosting, sklearn)✓ Complete
70/30 ensemble blend (ML + rule-based)✓ Complete
Auto-retrain at 100 analyst feedback samples✓ Complete
Analyst feedback recording endpoint with disposition validation✓ Complete
Feature importance extraction (RandomForest)✓ Complete
Model version history (Azure Blob persistent storage)✓ Complete
Real-time prediction endpoint with sample test✓ Complete
ML health check with latency percentiles✓ Complete
Performance log with historical retrain records✓ Complete
Frontend analytics page (model cards, training buffer, ensemble viz, feature chart)✓ Complete
Four disposition outcomes with threshold documentation✓ Complete

Architecture

ML Router

File: platform/api/app/routers/ml.py — Prefix: /api/v1/ml

A thin FastAPI router that delegates all logic to the ml_pipeline singleton imported from app/ml/training_pipeline.py. The router is stateless — it does not hold any model state itself. The pipeline singleton manages model loading from Azure Blob storage at startup, maintains the training feedback buffer in memory, and schedules retraining when the buffer crosses the configured threshold (default: 100 samples).

ML Training Pipeline (app/ml/training_pipeline.py)

The singleton ml_pipeline encapsulates three sklearn model wrappers:

The pipeline exposes predict_ensemble(alert: dict), which runs all three models and blends results using the 70/30 formula: confidence = 0.7 * ml_score + 0.3 * rule_score. It also manages record_feedback(), which appends analyst-labeled samples to the training buffer and triggers trigger_retrain() when the buffer reaches threshold.

Model Store (app/ml/model_store.py)

Handles persistence of trained model artifacts to and from Azure Blob storage (container: ml-models on storage account stroconmlmodels). Each model version is saved with a timestamp and version number. The store provides get_all_model_versions() and get_version_history(model_name) for the rollback UI.

ML Models

RandomForest

Alert Classifier

Supervised multi-class classifier that categorizes incoming alerts by threat category. Trained on labeled analyst feedback (true_positive, false_positive, suspicious, benign). Provides accuracy metric and feature importances. Version tracked as stats.classifier.version.

IsolationForest

Anomaly Detector

Unsupervised anomaly detection with contamination=0.10, meaning 10% of training samples are expected to be anomalous. Does not require labeled data. Displayed as a static 10% contamination rate in the UI since it has no accuracy metric in the traditional sense. Version tracked as stats.anomaly_detector.version.

GradientBoosting

Threat Scorer

Produces a continuous risk score from 0 to 100 for each alert. Built with 100 estimators. The score feeds directly into the ensemble blending formula. Higher scores push alerts toward critical_immediate disposition. Version tracked as stats.threat_scorer.version.

API Endpoints

ML Router — /api/v1/ml

MethodPathDescription
GET /api/v1/ml/stats Current model versions, accuracy, training buffer size, last retrain timestamp, per-class distribution, and data drift indicators
GET /api/v1/ml/health ML pipeline health: model load status per model, prediction latency percentiles (p50, p95, min, max, mean), buffer size, last prediction timestamp, scheduler status, storage backend info
POST /api/v1/ml/predict Run full ensemble prediction on a provided alert dict; returns classification, anomaly flag, risk score, blended confidence, and disposition recommendation
GET /api/v1/ml/predict/sample Run prediction on a hardcoded sample alert (PowerShell execution on DESKTOP-ADMIN01) for integration testing; returns result + sample_alert
POST /api/v1/ml/feedback Record analyst feedback for training; accepts alert dict, disposition (true_positive | false_positive | suspicious | benign), and optional risk_score override; triggers retrain if buffer threshold reached
POST /api/v1/ml/retrain Force immediate retrain of all models with accumulated buffer data (admin use; bypasses threshold check)
GET /api/v1/ml/performance Historical model performance log plus current stats snapshot; shows accuracy trends across retrain events
GET /api/v1/ml/feature-importances RandomForest feature importance dict sorted descending; used by the analytics page for the ranked bar chart visualization
GET /api/v1/ml/versions Version history for all three models from Azure Blob model_store
GET /api/v1/ml/versions/{model_name} Version history for a specific model (alert_classifier | anomaly_detector | threat_scorer); returns latest version and full history list

Frontend Route

RouteFileDescription
/analytics src/app/analytics/page.tsx Security Analytics page — model cards, training buffer, ensemble strategy, feature importances bar chart, disposition breakdown

The page is a Next.js App Router client component ("use client"). On mount it fires two requests in parallel via Promise.allSettled: GET /api/v1/ml/stats and GET /api/v1/ml/feature-importances. Either can fail independently without breaking the page — the component renders whatever data arrives. Auto-refreshes every 30 seconds via setInterval. API calls use the shared api client from @/lib/api-client.

Ensemble Strategy

The ensemble combines outputs from all three models into a single blended confidence score that drives the final alert disposition. The formula is applied in ml_pipeline.predict_ensemble():

blended_confidence = (0.7 × ml_score) + (0.3 × rule_score)

The 70% ML weight reflects confidence in the trained models when sufficient labeled data is available. The 30% rule-based weight preserves expert-authored detection logic as a safety net, particularly for novel attack patterns that the models have not yet seen in training data. The split is visualized in the frontend as a two-segment progress bar (cyan for ML, emerald for rule-based).

Score Sources

ComponentWeightSource
ML Score70%Blended output of AlertClassifier + AnomalyDetector + ThreatScorer predictions
Rule Score30%Rule-based confidence from RiskScoringModel in the 6-stage triage pipeline

The 6-stage triage pipeline (entity extraction, threat intel enrichment, UEBA, correlation, rule scoring, ML ensemble) feeds both the rule_score and the raw alert features into the ML models. The full pipeline is documented in the Alerts module.

Disposition Logic

The final blended confidence score maps to one of four disposition labels. These labels drive what the Autonomous SOC Engine does with the alert — auto-close it, notify an analyst, open a SOAR playbook, or page the incident response team.

benign_auto_resolved

Confidence > 85% that alert is benign. Automatically closed without analyst review. Feeds a "benign" label into the ML training buffer for future retraining.

requires_investigation

Confidence between 50% and 85%. Insufficient certainty for automated closure. Alert queued for analyst review. The ML models will receive the analyst's final verdict as feedback.

suspicious_escalate

Confidence below 50% or anomaly score high. Alert escalated to the incidents queue. A SOAR playbook may be triggered depending on severity. Analyst must investigate and close.

critical_immediate

High risk score (>80 on 0–100 scale) and high severity. Immediate escalation. Incident created automatically. On-call analyst paged. SOAR playbook triggered without waiting for analyst action.

The thresholds are: >85% blended confidence → auto-resolve; 50–85% → investigate; <50% → escalate; high threat score + high severity → critical immediate. These thresholds are implemented in TriageService in app/services/triage_service.py.

Data Models

The analytics module does not define its own SQLAlchemy database models. All data is maintained in-memory by the ml_pipeline singleton, with model artifacts persisted to Azure Blob via model_store. The structures below document the API response shapes consumed by the frontend.

MLStats Response (GET /api/v1/ml/stats)

{
  "classifier": {
    "version": "1.3.0",
    "accuracy": 0.923,
    "samples_trained": 1450
  },
  "anomaly_detector": {
    "version": "1.1.0",
    "accuracy": null,              // unsupervised — no accuracy metric
    "samples_trained": null
  },
  "threat_scorer": {
    "version": "1.2.0",
    "accuracy": 0.891,
    "samples_trained": 1450
  },
  "training_buffer": 34,           // samples accumulated since last retrain
  "retrain_threshold": 100,        // samples needed to trigger auto-retrain
  "total_retrains": 14,
  "last_retrain": "2026-02-28T18:42:00",
  "feature_importances": {
    "severity_encoded": 0.187,
    "rule_confidence": 0.142,
    "correlation_count": 0.118,
    // ... up to 17 features
  }
}

FeedbackRequest (POST /api/v1/ml/feedback)

{
  "alert": {
    "title": "Lateral movement detected on DC01",
    "severity": "high",
    "source_siem": "sentinel",
    "mitre_tactic": "lateral_movement",
    "rule_confidence": 0.72,
    "correlation_count": 5
  },
  "disposition": "true_positive",   // true_positive | false_positive | suspicious | benign
  "risk_score": 87.5                // optional analyst-provided override
}

PredictResponse (POST /api/v1/ml/predict)

{
  "classification": "suspicious",
  "anomaly_score": -0.31,           // IsolationForest raw score (lower = more anomalous)
  "risk_score": 74.2,               // ThreatScorer output (0–100)
  "ml_confidence": 0.68,
  "rule_score": 0.72,
  "blended_confidence": 0.692,      // 0.7 * ml_confidence + 0.3 * rule_score
  "disposition": "requires_investigation",
  "reasoning": "High rule confidence, moderate ML confidence. Correlation count of 5 suggests related activity."
}

FeatureImportances Response (GET /api/v1/ml/feature-importances)

{
  "feature_importances": {
    "severity_encoded": 0.187,
    "rule_confidence": 0.142,
    "correlation_count": 0.118,
    "hour_of_day": 0.094,
    "source_siem_encoded": 0.082,
    "mitre_tactic_encoded": 0.079,
    "ioc_count": 0.071,
    "asset_count": 0.063,
    // ... additional features
  }
}

ModelVersionHistory (GET /api/v1/ml/versions/{model_name})

{
  "model_name": "alert_classifier",
  "latest": {
    "version": "1.3.0",
    "saved_at": "2026-02-28T18:42:00",
    "accuracy": 0.923,
    "samples": 1450,
    "blob_path": "ml-models/alert_classifier/v1.3.0.pkl"
  },
  "history": [ /* array of version records */ ],
  "total_versions": 14
}

Prerequisites

UI Layout

Analytics Page — /analytics

The page uses a white/slate design with colored badges and gradient bars. No chart library dependency — all visualizations are built with plain CSS div elements and inline width styles driven by the data values.