LLM as Planner for Classic ML
Use a generative model to plan, configure, and evaluate traditional ML pipelines. Gain speed, explainability, and low latency with a governed loop that adapts to drift.
Many production wins still come from decision trees, linear models, Naive Bayes, and evolutionary search. These methods excel on structured data with tight latency and clear compliance needs. A practical way to combine their strengths with modern reasoning is to use a Generative AI model as a planner and orchestrator. The planner reads the task and constraints, selects an appropriate traditional pipeline, produces a structured plan or configuration, triggers training and evaluation, then keeps the system adapting as data shifts.
A short refresher
- Planner versus solver: the LLM plans and explains. The solver is a concrete ML pipeline such as DecisionTree, RandomForest or GBDT, Logistic or Linear, SVM, Naive Bayes, or a Genetic Algorithm for search and threshold tuning.
Why this pairing works:
- Classic models often deliver lower cost and lower latency on tabular tasks.
- LLMs are good at decomposing problems, proposing features, and filling templates.
- A governed feedback loop supports continuous adaptation without chaos.
Core architecture that is lean, safe, and reproducible
1) Planner
The planner ingests a task description, a data schema or profile, constraints such as SLA or explainability, and the latest evaluation report. It emits a structured plan.
{
"task": "binary_classification",
"candidate_algorithms": ["DecisionTreeClassifier","LogisticRegression","GradientBoosting"],
"chosen": "DecisionTreeClassifier",
"params": {"max_depth": 12, "min_samples_leaf": 50, "class_weight": "balanced"},
"features": [
"sender_domain_age",
"url_count",
"has_dkim",
"body_char_ngrams_tfidf:5000"
],
"target": "is_spam",
"cv": {"type":"stratified_kfold","folds":5},
"metrics": ["roc_auc","f1","precision@0.5"],
"constraints": {"p99_latency_ms": 3, "explainability": "high"}
}
Prefer configurations over free form code. Plans are easier to review, safer to execute, and more reproducible.
2) Tool layer with typed interfaces
Expose a small set of trusted tools:
load_data(snapshot_id) β df
split(cv_spec) β folds
train_decision_tree(params, X, y) β model
evaluate(model, X_val, y_val, metrics) β report
explain(model, X_val) β feature_importances, exemplars
export_model(model) β artifact_uri
deploy_canary(artifact_uri, traffic=0.1)
monitor(metric_spec) β time_series
Each tool uses typed inputs and outputs, resource limits, and timeouts. No arbitrary shell, no internet.
3) Execution harness
- Pinned libraries and random seeds for determinism.
- Version everything from data snapshot to plan to model to evaluation report.
- Human gates for sensitive or high risk changes.
4) Governed loop
- Champion and challenger with promotion rules.
- Drift monitoring for features, labels, and performance.
- Incident playbooks for rollback and alerting.
Why configurations beat code when an LLM is involved
- Safety from hallucinated imports or insecure patterns.
- Simple reviews through diffs and linters.
- Faster approvals and audits.
- Tight control over libraries such as scikit-learn, LightGBM, XGBoost, or online learning frameworks when needed.
A concrete walk through: adaptive spam filtering
Objective. Minimize expected cost
$$ \mathbb{E}[\text{cost}] = c_{FP}\cdot FP + c_{FN}\cdot FN $$
subject to p99 latency under 3 milliseconds and strong explainability.
Planning. The planner proposes DecisionTree or GBDT with balanced classes, a feature set that includes domain age, SPF or DKIM or DMARC status, URL and TLD statistics, character n gram TF IDF, emoji density, and homoglyph ratio. It suggests stratified five fold cross validation with ROC AUC, F1, and precision at a business aligned threshold. It adds leakage checks and segment analysis for VIP customers and regions.
Training and evaluation. The harness runs cross validation, logs per fold metrics, confusion matrices, and calibration curves. If GBDT wins within the latency budget, it becomes the challenger.
Canary and monitoring. Deploy the challenger to a fraction of traffic. Watch false positives for VIPs, drift in URL and TLD distributions, and latency headroom.
Adaptation to new attacks. Suppose a new wave relies on emoji obfuscation. Drift alerts fire. The planner proposes new features such as emoji_density
, unicode_homoglyph_ratio
, and suspicious block flags. It may invoke a small genetic search over feature interactions and thresholds while honoring the same latency budget. Retraining produces a new challenger that must beat the champion on global and segment metrics before promotion.
Explainability and audit. Decision paths and SHAP style summaries are stored along with counterfactual checks such as what happens if has_dkim
flips to true. These artifacts travel with the model.
Where this pattern shines
- Structured problems such as fraud, churn, quality scoring, routing, and spam or phishing.
- Tight inference budgets measured in a few milliseconds.
- Compliance and policy reviews that favor interpretable models.
- Rapid iteration through feature proposals and retrains instead of heavy model overhauls.
When this is not the right fit
- Vision, audio, and video tasks that rely on modern deep learning for accuracy.
- Ultra hard real time loops where even a planning pass is too expensive. In those cases fix the solver family and let the planner tune thresholds or features offline.
Guardrails that make it production worthy
- Clear objectives and cost aligned thresholds.
- A frozen, leakage free holdout for final checks.
- Security through allow listed tools, isolation, and resource caps.
- Provenance for data, plans, models, and reports through hashing and manifests.
- Fairness and segment health dashboards with promotion blocks on degradation.
- Safe fallbacks and automatic rollback triggers.
Implementation blueprint
Tool specification. The planner fills these schemas.
{
"tools": [
{"name":"load_data","input_schema":{"snapshot_id":"string"},"output_schema":{"df":"table"}},
{"name":"split","input_schema":{"cv_spec":"json"},"output_schema":{"folds":"list"}},
{"name":"train_decision_tree","input_schema":{"params":"json","X":"table","y":"vector"},"output_schema":{"model":"model"}},
{"name":"evaluate","input_schema":{"model":"model","X_val":"table","y_val":"vector","metrics":"list"},"output_schema":{"report":"json"}},
{"name":"explain","input_schema":{"model":"model","X_val":"table"},"output_schema":{"explanations":"json"}},
{"name":"export_model","input_schema":{"model":"model"},"output_schema":{"artifact_uri":"string"}},
{"name":"deploy_canary","input_schema":{"artifact_uri":"string","traffic":"float"},"output_schema":{"deployment_id":"string"}},
{"name":"monitor","input_schema":{"metric_spec":"json"},"output_schema":{"time_series":"json"}}
]
}
Promotion rule. A challenger must beat the champion on the primary metric by a predefined margin on validation and canary, must not degrade any key segment beyond a tolerance, and must meet latency and explainability constraints.
FAQ
Why not fine tune the LLM for everything
Many tasks are cheaper, faster, and more interpretable with classic ML. Save large models for cases that demand them.
What about RAG or agents
The planner is a focused agent with a minimal, typed toolbelt. Narrow scope improves reliability.
Is chain of thought required
Reasoning is required. Store concise rationales and the structured plan for auditability. Verbose traces are optional.
Can this run online
Yes. Swap in incremental learners such as Hoeffding trees or online Naive Bayes and schedule micro updates under drift budgets.
Closing thought
Let the LLM decide. Let classic ML deliver. With a planner that proposes features and configurations, and a harness that enforces metrics and safety, you get adaptability to new attack vectors without giving up speed, clarity, or reliability.