LLM as Planner for Classic ML

Published on August 28, 2025

Use a generative model to plan, configure, and evaluate traditional ML pipelines. Gain speed, explainability, and low latency with a governed loop that adapts to drift.

Many production wins still come from decision trees, linear models, Naive Bayes, and evolutionary search. These methods excel on structured data with tight latency and clear compliance needs. A practical way to combine their strengths with modern reasoning is to use a Generative AI model as a planner and orchestrator. The planner reads the task and constraints, selects an appropriate traditional pipeline, produces a structured plan or configuration, triggers training and evaluation, then keeps the system adapting as data shifts.


A short refresher

Why this pairing works:


Core architecture that is lean, safe, and reproducible

Diagram of an LLM planner orchestrating classic ML pipelines with governed loop and typed tools
Figure: The planner reads task and constraints, emits a configuration, executes via typed tools, and closes the loop with evaluation, deployment, and monitoring.

1) Planner

The planner ingests a task description, a data schema or profile, constraints such as SLA or explainability, and the latest evaluation report. It emits a structured plan.

{
  "task": "binary_classification",
  "candidate_algorithms": ["DecisionTreeClassifier","LogisticRegression","GradientBoosting"],
  "chosen": "DecisionTreeClassifier",
  "params": {"max_depth": 12, "min_samples_leaf": 50, "class_weight": "balanced"},
  "features": [
    "sender_domain_age",
    "url_count",
    "has_dkim",
    "body_char_ngrams_tfidf:5000"
  ],
  "target": "is_spam",
  "cv": {"type":"stratified_kfold","folds":5},
  "metrics": ["roc_auc","f1","precision@0.5"],
  "constraints": {"p99_latency_ms": 3, "explainability": "high"}
}

Prefer configurations over free form code. Plans are easier to review, safer to execute, and more reproducible.

2) Tool layer with typed interfaces

Expose a small set of trusted tools:

Each tool uses typed inputs and outputs, resource limits, and timeouts. No arbitrary shell, no internet.

3) Execution harness

4) Governed loop


Why configurations beat code when an LLM is involved


A concrete walk through: adaptive spam filtering

Objective. Minimize expected cost

$$ \mathbb{E}[\text{cost}] = c_{FP}\cdot FP + c_{FN}\cdot FN $$

subject to p99 latency under 3 milliseconds and strong explainability.

Planning. The planner proposes DecisionTree or GBDT with balanced classes, a feature set that includes domain age, SPF or DKIM or DMARC status, URL and TLD statistics, character n gram TF IDF, emoji density, and homoglyph ratio. It suggests stratified five fold cross validation with ROC AUC, F1, and precision at a business aligned threshold. It adds leakage checks and segment analysis for VIP customers and regions.

Training and evaluation. The harness runs cross validation, logs per fold metrics, confusion matrices, and calibration curves. If GBDT wins within the latency budget, it becomes the challenger.

Canary and monitoring. Deploy the challenger to a fraction of traffic. Watch false positives for VIPs, drift in URL and TLD distributions, and latency headroom.

Adaptation to new attacks. Suppose a new wave relies on emoji obfuscation. Drift alerts fire. The planner proposes new features such as emoji_density, unicode_homoglyph_ratio, and suspicious block flags. It may invoke a small genetic search over feature interactions and thresholds while honoring the same latency budget. Retraining produces a new challenger that must beat the champion on global and segment metrics before promotion.

Explainability and audit. Decision paths and SHAP style summaries are stored along with counterfactual checks such as what happens if has_dkim flips to true. These artifacts travel with the model.


Where this pattern shines

When this is not the right fit


Guardrails that make it production worthy

  1. Clear objectives and cost aligned thresholds.
  2. A frozen, leakage free holdout for final checks.
  3. Security through allow listed tools, isolation, and resource caps.
  4. Provenance for data, plans, models, and reports through hashing and manifests.
  5. Fairness and segment health dashboards with promotion blocks on degradation.
  6. Safe fallbacks and automatic rollback triggers.

Implementation blueprint

Tool specification. The planner fills these schemas.

{
  "tools": [
    {"name":"load_data","input_schema":{"snapshot_id":"string"},"output_schema":{"df":"table"}},
    {"name":"split","input_schema":{"cv_spec":"json"},"output_schema":{"folds":"list"}},
    {"name":"train_decision_tree","input_schema":{"params":"json","X":"table","y":"vector"},"output_schema":{"model":"model"}},
    {"name":"evaluate","input_schema":{"model":"model","X_val":"table","y_val":"vector","metrics":"list"},"output_schema":{"report":"json"}},
    {"name":"explain","input_schema":{"model":"model","X_val":"table"},"output_schema":{"explanations":"json"}},
    {"name":"export_model","input_schema":{"model":"model"},"output_schema":{"artifact_uri":"string"}},
    {"name":"deploy_canary","input_schema":{"artifact_uri":"string","traffic":"float"},"output_schema":{"deployment_id":"string"}},
    {"name":"monitor","input_schema":{"metric_spec":"json"},"output_schema":{"time_series":"json"}}
  ]
}

Promotion rule. A challenger must beat the champion on the primary metric by a predefined margin on validation and canary, must not degrade any key segment beyond a tolerance, and must meet latency and explainability constraints.


FAQ

Why not fine tune the LLM for everything
Many tasks are cheaper, faster, and more interpretable with classic ML. Save large models for cases that demand them.

What about RAG or agents
The planner is a focused agent with a minimal, typed toolbelt. Narrow scope improves reliability.

Is chain of thought required
Reasoning is required. Store concise rationales and the structured plan for auditability. Verbose traces are optional.

Can this run online
Yes. Swap in incremental learners such as Hoeffding trees or online Naive Bayes and schedule micro updates under drift budgets.


Closing thought

Let the LLM decide. Let classic ML deliver. With a planner that proposes features and configurations, and a harness that enforces metrics and safety, you get adaptability to new attack vectors without giving up speed, clarity, or reliability.