ML Models & Training Pipeline
Three-model sklearn ensemble for alert classification, anomaly detection, and threat scoring with continuous learning from analyst feedback.
Overview
The ML Pipeline provides the AI scoring backbone for ThreatOps. It uses three classical ML models from scikit-learn, bootstrapped with synthetic training data and continuously improved through analyst feedback. The ensemble combines a Random Forest classifier (alert disposition), an Isolation Forest anomaly detector (anomaly flagging), and a Gradient Boosting threat scorer (0-100 risk score). Models are persisted via Azure Blob Storage with local fallback. A background scheduler checks hourly for auto-retraining when 50+ new samples are available.
What Was Proposed
- Three-model ML ensemble: AlertClassifier, AnomalyDetector, ThreatScorer
- Feature engineering from raw alert payloads
- Continuous learning from analyst feedback (true_positive, false_positive, suspicious, benign)
- Auto-retrain when threshold samples reached
- Model versioning with rollback capability
- Azure Blob Storage persistence surviving pod restarts
- Health monitoring and latency tracking
- Score blending: 70% ML + 30% rule-based
What's Built
| Feature | Status | Details |
|---|---|---|
| AlertClassifier | Complete | Random Forest, 4 classes (true_positive, false_positive, suspicious, benign), bootstrapped with 2000 synthetic samples |
| AnomalyDetector | Complete | Isolation Forest, contamination-based anomaly flagging |
| ThreatScorer | Complete | Gradient Boosting Regressor, outputs 0-100 risk score |
| Feature Engineering | Complete | AlertFeatureExtractor: severity mapping, SIEM source encoding, MITRE tactic mapping, temporal features, IOC/asset counts |
| Training Pipeline | Complete | Continuous learning: collect feedback, accumulate samples, auto-retrain at 100 (feedback) or 50 (scheduled) |
| Model Store | Complete | Azure Blob Storage (stroconmlmodels, container: ml-models) with local /tmp fallback. Version tracking per model |
| Ensemble Prediction | Complete | Combined prediction: disposition + confidence + risk_score + is_anomaly + anomaly_score |
| Health Monitoring | Complete | Model load status, prediction latency percentiles (p50, p95), training buffer size, scheduler status |
| Feature Importances | Complete | Random Forest feature importances exposed for interpretability |
| Version History | Complete | Per-model version history from ModelStore for rollback UI |
| Background Scheduler | Complete | Hourly check, auto-retrain at 50+ samples since last retrain |
Architecture
platform/api/app/ml/
__init__.py
feature_engineering.py -- AlertFeatureExtractor (raw alert -> feature vector)
models.py -- AlertClassifier (RF), AnomalyDetector (IF), ThreatScorer (GB)
training_pipeline.py -- TrainingPipeline (feedback collection, retrain, scheduler)
model_store.py -- ModelStore (Azure Blob Storage + local fallback)
Prediction Flow:
Raw Alert -> AlertFeatureExtractor.extract() -> feature_vector (numpy array)
|
+-> AlertClassifier.predict() -> disposition, confidence
+-> AnomalyDetector.predict() -> is_anomaly, anomaly_score
+-> ThreatScorer.predict() -> risk_score (0-100)
|
v
Ensemble Result: { disposition, confidence, risk_score, is_anomaly, anomaly_score }
Training Flow:
Analyst Feedback -> record_feedback() -> training_buffer (JSON, persisted)
|
+-- buffer >= 100 samples -> trigger_retrain()
+-- scheduler (hourly) -> 50+ new samples -> retrain
|
v
retrain all 3 models -> persist via ModelStore -> log performance
Final Score = 0.70 * ML_ensemble_score + 0.30 * rule_based_score
6-Stage Triage Pipeline:
1. Entity Extraction
2. Threat Intelligence Lookup
3. UEBA (User and Entity Behavior Analytics)
4. Alert Correlation
5. Rule-Based Scoring (372 detection rules in 20+ modules)
6. ML Ensemble (AlertClassifier + AnomalyDetector + ThreatScorer)
API Routing
Router prefix: /api/v1/ml — Tag: ml-models
Prerequisites
- scikit-learn -- RandomForestClassifier, IsolationForest, GradientBoostingRegressor
- numpy -- Feature vector computation
- Azure Blob Storage -- Account:
stroconmlmodels, container:ml-models(optional, falls back to/tmp/ml_models/) - Models bootstrap with synthetic data on first start (no external training data required)
Data Model
ML models use in-memory state with persistence via ModelStore. No SQLAlchemy models required.
AlertClassifier
| Attribute | Type | Description |
|---|---|---|
| model | RandomForestClassifier | sklearn RF with 100 estimators |
| classes | list[str] | ["true_positive", "false_positive", "suspicious", "benign"] |
| version | str | Semantic version (e.g., "1.0.0") |
| trained_samples | int | Number of training samples |
| accuracy | float | Test set accuracy |
AnomalyDetector
| Attribute | Type | Description |
|---|---|---|
| model | IsolationForest | sklearn IF with contamination parameter |
| version | str | Semantic version |
ThreatScorer
| Attribute | Type | Description |
|---|---|---|
| model | GradientBoostingRegressor | sklearn GBR, outputs 0-100 |
| version | str | Semantic version |
| r2_score | float | R-squared on test set |
UI Description
File: platform/frontend/src/app/ml-models/page.tsx
The ML Models dashboard provides full visibility into the AI pipeline:
- Model Cards -- Per-model cards showing version, accuracy/R2, trained samples, last updated timestamp
- Feature Importance Chart -- Bar chart of RF feature importances for interpretability
- Training Status -- Buffer size vs threshold indicator, retrain button, auto-retrain schedule
- Live Prediction Test -- Submit a sample alert and see ensemble results: disposition, confidence, risk score, anomaly flag
- Feedback Interface -- Submit analyst feedback (disposition + optional risk score) for model training
- Performance Log -- Historical retrain log with accuracy/version changes per model
- Version History -- Per-model version timeline with rollback indicators
- Health Panel -- Pipeline health: model load status, prediction latency percentiles, storage backend info