Building ML Models in Microsoft Fabric
Data Engineering
Data Engineering12 min read

Building ML Models in Microsoft Fabric

Train and deploy machine learning models in Microsoft Fabric using Data Science capabilities. MLflow integration, model registry, and batch scoring.

By Administrator

Microsoft Fabric Data Science provides a complete ML lifecycle environment integrated directly into the analytics platform—from data exploration and feature engineering to model training, experiment tracking, deployment, and batch scoring. Unlike standalone ML platforms (SageMaker, Vertex AI, Databricks ML) that require separate data movement pipelines, Fabric Data Science operates directly on OneLake data, eliminating the traditional gap between data engineering and data science. Models trained in Fabric can score data in Lakehouses, power predictions in Power BI reports, and run as batch scoring jobs—all within the same capacity and governance framework. Our Microsoft Fabric consulting team helps organizations implement production ML workflows within the Fabric platform.

Fabric Data Science Architecture

| Component | Purpose | Key Feature | |---|---|---| | Notebooks | Interactive model development | Python, PySpark, R; pre-installed ML libraries | | Experiments (MLflow) | Track training runs | Parameters, metrics, artifacts, model comparison | | Model Registry | Version and manage models | Stage management, lineage tracking, deployment | | Batch Scoring | Score data at scale | Spark-based, Lakehouse input/output | | PREDICT function | In-database scoring | SQL and Spark PREDICT() for real-time inference | | SynapseML | Pre-built ML capabilities | AutoML, cognitive services, distributed training |

All components share OneLake storage and Fabric capacity, so there is no data duplication between your data engineering Lakehouses and your ML training environments.

End-to-End ML Workflow

Phase 1: Data Exploration and Feature Engineering

Start in a Fabric notebook with direct access to Lakehouse tables:

Load data from OneLake: Read Delta tables directly using Spark DataFrames—no data copy or export required. The same Silver and Gold layer tables prepared by your data engineering team are immediately available for ML.

Exploratory Data Analysis (EDA): Use pandas, matplotlib, seaborn, and plotly (all pre-installed) for visualization and statistical analysis. Fabric notebooks render plots inline, making iterative EDA fast and visual.

Feature Engineering: Create training features by: - Aggregating transactional data (customer lifetime value from order history) - Encoding categorical variables (one-hot, target encoding) - Creating time-based features (days since last purchase, rolling averages) - Joining data from multiple Lakehouse tables using Spark SQL

Save features to a Feature Table: Write engineered features back to a dedicated Lakehouse table. This creates a reusable feature store that multiple experiments can reference—ensuring consistency between training and scoring.

Phase 2: Model Training

Fabric notebooks support all major ML frameworks with no additional installation:

| Framework | Best For | Pre-installed | |---|---|---| | Scikit-learn | Traditional ML (classification, regression, clustering) | Yes | | XGBoost / LightGBM | Gradient boosting (tabular data, Kaggle-winning algorithms) | Yes | | PyTorch | Deep learning (NLP, computer vision, custom architectures) | Yes | | TensorFlow/Keras | Deep learning (production deployment, TF Serving) | Yes | | SynapseML | AutoML, pre-built cognitive services, distributed training | Yes | | Prophet | Time series forecasting | Yes |

For tabular business data (customer churn prediction, demand forecasting, lead scoring), scikit-learn and XGBoost/LightGBM deliver the best results with the simplest workflow. Deep learning frameworks are needed primarily for unstructured data (text classification, image recognition).

Phase 3: Experiment Tracking with MLflow

Fabric natively integrates MLflow for experiment management. Every training run should be tracked:

Autologging: Enable MLflow autologging at the start of your notebook. Fabric automatically logs all training parameters, performance metrics, and model artifacts for scikit-learn, XGBoost, LightGBM, PyTorch, and TensorFlow models without writing explicit logging code.

Experiment Comparison: The Fabric Experiments UI provides a visual comparison of all runs in an experiment—parameter values, metric charts, and artifact inspection side by side. Identify the best-performing model configuration quickly without building custom comparison code.

Key Metrics to Track: - Classification: Accuracy, Precision, Recall, F1, AUC-ROC, confusion matrix - Regression: RMSE, MAE, R-squared, residual distribution - Forecasting: MAPE, MASE, forecast vs actuals visualization - Training metadata: Training duration, data size, feature count, hyperparameters

Phase 4: Model Registration and Versioning

Once you identify the best model from your experiments, register it in the Fabric Model Registry:

  1. From the experiment run details, click "Register Model"
  2. Name the model descriptively (e.g., "customer-churn-classifier-v3")
  3. Add a description documenting the model's purpose, training data, and performance
  4. Set the model stage: None → Staging → Production → Archived

The registry tracks model lineage—linking each registered model back to the specific experiment run, training code, data version, and hyperparameters that produced it. This is essential for audit, reproducibility, and compliance.

Phase 5: Deployment and Scoring

Batch Scoring: The most common deployment pattern for business analytics. Create a notebook that loads the registered model, reads new data from a Lakehouse table, generates predictions, and writes results back to a Lakehouse table. Schedule this notebook to run daily, weekly, or after each data refresh.

PREDICT Function: Fabric supports a PREDICT() function usable in SQL and Spark notebooks. Register your model, then call PREDICT directly in SQL queries against Lakehouse tables—enabling prediction scoring without Python code.

Power BI Integration: Connect Power BI to the Lakehouse table containing prediction results. Build reports that show predicted customer churn risk alongside actual customer metrics, forecast demand alongside inventory levels, or scored leads alongside sales pipeline data.

AutoML with SynapseML

For organizations without dedicated data science teams, SynapseML provides automated machine learning:

  • Automated feature engineering: Detects feature types and applies appropriate transformations
  • Algorithm selection: Tests multiple algorithms (logistic regression, random forest, gradient boosting, neural networks) and selects the best performer
  • Hyperparameter tuning: Grid search and Bayesian optimization across the parameter space
  • Model explanation: Generates feature importance rankings and partial dependence plots

AutoML does not replace expert data science for complex problems, but it provides a strong baseline that can be deployed quickly while more sophisticated models are developed.

Best Practices

  • Version training data: Use Delta Lake time travel to pin the exact dataset version used for each experiment. This ensures reproducibility.
  • Separate feature engineering from training: Reusable feature tables enable multiple experiments without re-computing features each time
  • Enable autologging early: Track everything from the first experiment. You cannot retroactively log parameters from untracked runs.
  • Use the staging workflow: Never deploy directly to production. Stage models, validate against holdout data, compare with the current production model, then promote.
  • Monitor model drift: Schedule periodic comparisons of model predictions vs actual outcomes. When accuracy degrades beyond threshold, retrain.
  • Document business context: Register models with clear descriptions of what they predict, what actions the business should take on predictions, and known limitations.

Related Resources

Frequently Asked Questions

What ML frameworks does Fabric support?

Fabric supports Scikit-learn, PyTorch, TensorFlow, XGBoost, LightGBM, and other Python-based ML libraries. You can install additional packages as needed.

Can I use AutoML in Microsoft Fabric?

Yes, Fabric includes automated machine learning capabilities that can automatically select algorithms, tune hyperparameters, and generate feature engineering suggestions.

Microsoft FabricMLData ScienceMLflow

Industry Solutions

See how we apply these solutions across industries:

Need Help With Power BI?

Our experts can help you implement the solutions discussed in this article.

Ready to Transform Your Data Strategy?

Get a free consultation to discuss how Power BI and Microsoft Fabric can drive insights and growth for your organization.