ML Models & Training Pipeline

Three-model sklearn ensemble for alert classification, anomaly detection, and threat scoring with continuous learning from analyst feedback.

Status
Built
ML Models
3
Pipeline Files
5
Frontend
Live

Overview

The ML Pipeline provides the AI scoring backbone for ThreatOps. It uses three classical ML models from scikit-learn, bootstrapped with synthetic training data and continuously improved through analyst feedback. The ensemble combines a Random Forest classifier (alert disposition), an Isolation Forest anomaly detector (anomaly flagging), and a Gradient Boosting threat scorer (0-100 risk score). Models are persisted via Azure Blob Storage with local fallback. A background scheduler checks hourly for auto-retraining when 50+ new samples are available.

What Was Proposed

What's Built

FeatureStatusDetails
AlertClassifierCompleteRandom Forest, 4 classes (true_positive, false_positive, suspicious, benign), bootstrapped with 2000 synthetic samples
AnomalyDetectorCompleteIsolation Forest, contamination-based anomaly flagging
ThreatScorerCompleteGradient Boosting Regressor, outputs 0-100 risk score
Feature EngineeringCompleteAlertFeatureExtractor: severity mapping, SIEM source encoding, MITRE tactic mapping, temporal features, IOC/asset counts
Training PipelineCompleteContinuous learning: collect feedback, accumulate samples, auto-retrain at 100 (feedback) or 50 (scheduled)
Model StoreCompleteAzure Blob Storage (stroconmlmodels, container: ml-models) with local /tmp fallback. Version tracking per model
Ensemble PredictionCompleteCombined prediction: disposition + confidence + risk_score + is_anomaly + anomaly_score
Health MonitoringCompleteModel load status, prediction latency percentiles (p50, p95), training buffer size, scheduler status
Feature ImportancesCompleteRandom Forest feature importances exposed for interpretability
Version HistoryCompletePer-model version history from ModelStore for rollback UI
Background SchedulerCompleteHourly check, auto-retrain at 50+ samples since last retrain

Architecture

ML Pipeline Architecture
platform/api/app/ml/
  __init__.py
  feature_engineering.py   -- AlertFeatureExtractor (raw alert -> feature vector)
  models.py                -- AlertClassifier (RF), AnomalyDetector (IF), ThreatScorer (GB)
  training_pipeline.py     -- TrainingPipeline (feedback collection, retrain, scheduler)
  model_store.py           -- ModelStore (Azure Blob Storage + local fallback)

Prediction Flow:
  Raw Alert -> AlertFeatureExtractor.extract() -> feature_vector (numpy array)
                |
                +-> AlertClassifier.predict()    -> disposition, confidence
                +-> AnomalyDetector.predict()    -> is_anomaly, anomaly_score
                +-> ThreatScorer.predict()       -> risk_score (0-100)
                |
                v
           Ensemble Result: { disposition, confidence, risk_score, is_anomaly, anomaly_score }

Training Flow:
  Analyst Feedback -> record_feedback() -> training_buffer (JSON, persisted)
                                            |
                                            +-- buffer >= 100 samples -> trigger_retrain()
                                            +-- scheduler (hourly) -> 50+ new samples -> retrain
                                            |
                                            v
                                    retrain all 3 models -> persist via ModelStore -> log performance
Score Blending (6-Stage Triage)
Final Score = 0.70 * ML_ensemble_score + 0.30 * rule_based_score

6-Stage Triage Pipeline:
  1. Entity Extraction
  2. Threat Intelligence Lookup
  3. UEBA (User and Entity Behavior Analytics)
  4. Alert Correlation
  5. Rule-Based Scoring (372 detection rules in 20+ modules)
  6. ML Ensemble (AlertClassifier + AnomalyDetector + ThreatScorer)

API Routing

Router prefix: /api/v1/ml — Tag: ml-models

GET/statsModel versions, accuracy, training buffer, per-class distribution
GET/healthPipeline health: load status, latency percentiles, scheduler status
POST/predictRun ML ensemble prediction on an alert
GET/predict/sampleTest prediction on a sample PowerShell alert
POST/feedbackRecord analyst feedback for model training
POST/retrainForce retrain all models (admin)
GET/performanceHistorical model performance log
GET/feature-importancesRandom Forest feature importances
GET/versionsVersion history for all models
GET/versions/{model_name}Version history for specific model

Prerequisites

Data Model

ML models use in-memory state with persistence via ModelStore. No SQLAlchemy models required.

AlertClassifier

AttributeTypeDescription
modelRandomForestClassifiersklearn RF with 100 estimators
classeslist[str]["true_positive", "false_positive", "suspicious", "benign"]
versionstrSemantic version (e.g., "1.0.0")
trained_samplesintNumber of training samples
accuracyfloatTest set accuracy

AnomalyDetector

AttributeTypeDescription
modelIsolationForestsklearn IF with contamination parameter
versionstrSemantic version

ThreatScorer

AttributeTypeDescription
modelGradientBoostingRegressorsklearn GBR, outputs 0-100
versionstrSemantic version
r2_scorefloatR-squared on test set

UI Description

File: platform/frontend/src/app/ml-models/page.tsx

The ML Models dashboard provides full visibility into the AI pipeline: