Navigating the Fabric of Modern Data

Our Journey with data platform engineering in Microsoft Fabric

What a journey it has been, discovering the world of Modern Data Platform Engineering in Fabric. Our exploration through the landscape of Fabric has taught us a lot. At Navara, we’ve always been committed to adhering to the best practices of Software Engineering. So, how did we uphold these standards in our approach with this groundbreaking and unique technology? In this blog post, we invite you to join us on our journey through each step, as well as the many paths that are yet to be discovered.

To understand what we did and why we did it, it’s crucial to first grasp what Microsoft Fabric is. Microsoft Fabric is an integrated data platform that unifies data engineering, integration, and analytics in a single environment. It transforms how companies handle data by simplifying workflows, enhancing collaboration, and reducing the need for multiple separate tools. It offers robust connections with other major players in the big data ecosystem like Snowflake and Databricks. Offering easy access to storing data, making sense out of that data and optionally making use of it in AI. With Fabric, companies can build their own end-to-end data solutions while needing little data engineering and data science knowledge.

Therefore, our approach should empower these users to be self-sufficient and effectively leverage Fabric’s capabilities. We focused on creating a platform that not only adheres to best practices but also guides users in adopting these standards effortlessly—ensuring that even those without deep technical backgrounds can confidently build, maintain, and evolve their data solutions.

What it promises to be and what it currently is, is still somewhat apart. Those limitations, however, is something that can be solved by introducing some engineering grit. At the same time, Microsoft is offering significant backing on this project and the number of major players that have joined the Microsoft Fabric platform gives assurance that it is here to stay and improve over the long-term.

For now though, we have been working with Fabric for a while and we came to enjoy Fabric’s potential, such as the ease of use for setting up complex data frameworks, and dislike some of its limitations, like the absence of proper CI/CD integration. So, we set out to harness its capabilities while maintaining our commitment to software engineering best practices. As we delve into our Fabric journey, we’ll explore how we’ve implemented Software Engineering best practices in the form of CI/CD, Monitoring, Data Engineering Principles and more.

Data Engineering

As part of building our data platform, we started by designing, implementing and testing Best Practices in the Data Engineering domain such as the Medaillon Architecture, compartmentalized data pipelines and re-usable Notebooks.

We implemented the well known data design pattern, the Medallion architecture. The Medallion architecture is a data design pattern that organizes data into three distinct layers:

  • Bronze (Raw): This layer contains the raw, unprocessed data as it’s ingested from various sources.
  • Silver (Refined): Here, the data is cleansed, validated, and transformed into a more structured format.
  • Gold (Business-Ready): The final layer where data is aggregated and optimized for specific business use cases and analytics.

This layered approach enhances data quality, traceability, and reusability, allowing for more efficient data processing and analysis.

To further enhance the platform’s resilience, we focused on reducing the Mean Time to Recovery (MTTR) by compartmentalizing data pipelines. By breaking down pipelines into smaller, independent units, we can isolate and address issues more efficiently. This modular approach not only speeds up recovery when problems arise but also makes the overall system easier to troubleshoot and scale.

Moreover, we made it a priority to de-duplicate code throughout our data engineering processes. By leveraging reusable notebooks and building Directed Acyclic Graphs to orchestrate multiple of these notebooks to run simultaneously. This practice simplified maintenance, reduced errors, and ultimately made the system more robust.

Fabric Deployment Pattern

In the Data Engineering domain we wanted to have a secure and patterned approach to setting up data and code management. When we were designing our Fabric Deployment Pattern, there were a few key considerations to keep in mind: the separation of roles, management of access levels, efficient collaboration between teams, ensuring adherence to governance standards and more. Some basic Patterns are explained in Deployment patterns for Microsoft Fabric, but they do not tell the whole story.

The Deployment Pattern we have landed on separates the responsibilities of Data Engineers, Data Modellers and Power BI developers, which emphases distinct Data Governance. The figure below provides a high-level overview of how this Deployment Pattern is structured. Currently, no domains are shown in the pattern. However, it can be organized by domain (e.g., HR, Sales, etc.) if needed, with the overall explanation remaining consistent regardless of such divisions.

For this approach, the workspaces are split into Bronze-Silver, and Data-modelling(gold) and Analytics.This Deployment Pattern emphasizes data governance and security by separating the responsibilities of Data Engineers, Data Modellers and PowerBI developers across distinct workspaces in the Medallion data architecture.

Bronze-Silver Workspace contains both raw and transformed data. Raw data is ingested, stored, and then cleaned and transformed within this workspace, ensuring consistent governance over the entire data processing pipeline. Access is tightly controlled, allowing only authorized data engineers to work with and transform the data, maintaining security and data quality throughout the process.

Data-Modelling(Gold) Workspace delivers curated, business-ready datasets derived from the Bronze-Silver workspace. This separation ensures that only vetted, refined data is available for consumption, limiting exposure and enhancing data security. The Gold layer offers controlled access, preventing unauthorized users from making changes, thereby safeguarding data integrity.

The Analytics Workspace is dedicated to PowerBI developers for creating reports and dashboards using the trusted, curated data from the Gold workspace. By keeping analytics work separate, it maintains data integrity while ensuring analytics teams have secure, governed access to consistent and reliable datasets.

By organizing these distinct workspaces, the Deployment Pattern upholds strict data governance, enforces access control, and minimizes security risks at each step of the data lifecycle.

Continuous Integration & Continuous Delivery

Robust data engineering now formed the foundation of our approach. However, we quickly realized that while Fabric offered powerful capabilities, its built-in Deployment Pipelines didn’t quite meet our standards for CI/CD. The Deployment Pipelines were somewhat unreliable and lacked the transparency we needed to track changes effectively. CI/CD is crucial for modern software development, as it automates the process of integrating code changes, running tests, and deploying updates, resulting in faster, more reliable software releases and improved team productivity. So, we rolled up our sleeves and got to work crafting a solution that would bring these engineering best practices to Fabric.

The first element we focused on was deploying infrastructure robustly. So we started creating a custom Fabric Terraform provider tailored to our specific needs. Interestingly, just as we were deep into this process, Microsoft surprised us by releasing their own provider – a plot twist we hadn’t seen coming as it wasn’t on the roadmap! Since then, we’ve been in close contact with Microsoft, to collaborate on shaping their official Terraform provider.
With a Terraform provider available now, we turned our attention to building a robust CI/CD pipeline that would leverage this foundation. We’ve built a robust CI/CD pipeline using GitHub Actions, which runs our test suites utilizing the Data Factory Testing Framework and pytest for locally executable code. To maintain the integrity of our tests, we’ve implemented mocks for Fabric-specific code. This approach allows us to thoroughly validate our changes without relying on external dependencies. End-to-end testing is something we still have planned to implement in the future.

Our workflow also embraces the classic dev/test/prod environment model, but with a Fabric twist. When changes are pushed from the Fabric environment, they’re automatically created as a pull request to the test environment (in our case in Github, but it can also be integrated with Azure DevOps).. Here, the changes face a gauntlet of automated tests. Only after passing these tests and receiving additional human scrutiny can changes be merged into the production environment. To keep everything in sync, we’ve developed custom scripts that automatically align the Fabric environment with our repository.

Observability

With our data pipelines and deployment processes in place, the next challenge was ensuring visibility into our entire data ecosystem. Data governance and monitoring are crucial if you want to have any control of your systems. We’ve integrated Microsoft Purview for comprehensive data governance, utilizing its data catalog for access control management and keeping track of what data is available for consumption. Here we also make use of the Terraform provider since we use it to automate the creation of default security roles and groups as well as workspaces for easy responsibility separation. This streamlines access management.

For data quality assurance, we’ve developed custom SQL queries that run against various data sources. These insights are communicated to data owners via Power BI reports, fostering a culture of data quality improvement. We’ve also implemented Data Activator for real-time notifications when data quality falls below predefined thresholds.

Our monitoring strategy has two main parts:

1. Data Quality and Governance:
– Microsoft Purview for data cataloging and lineage tracking.
– Custom SQL queries for data quality checks.
– Power BI for visualizing data quality metrics.
– Data Activator for anomaly detection and alerting.

2. Process Monitoring:
– Fabric’s built-in Monitor for overseeing data pipelines and notebooks.
– Custom logging system that stores ingestion logs in a dedicated warehouse.
– SQL queries against this warehouse to identify bottlenecks.
– Usage tracking to monitor table access frequency through Fabric Scanner API.
– Slack integrations and Data Activator for immediate notifications.

This dual approach ensures we maintain high data quality standards while keeping our data processes running smoothly. By leveraging both Fabric’s native tools and our custom solutions, we’ve created a robust, observable data ecosystem that allows us to proactively manage our data platform.

Future of Fabric

So, we have demonstrated the power of combining cutting-edge technology like Fabric with time-tested software engineering principles and we have shown that it is very much possible! Now, what’s next for Fabric and our journey within it? At FabCon 2024 in Stockholm, Microsoft dropped a bunch of exciting new features—full integration of Copilot, new AI tools like the AI Skill, a native Fabric Spark engine, and a lot more. It’s clear they’re doubling down on Fabric, integrating it with existing services like AI Studio and existing big data providers like Snowflake, which means there’s a lot of potential ahead.

With a strong foundation already built, our next move is to bring all these new features into our setup. We’re planning to make full use of Copilot to create more intuitive, AI-driven workflows. Adding the native Fabric Spark engine will help us get more out of our data processing, while AI Skill and other AI upgrades will let us introduce some smarter machine learning and automation features.

We’re also ready to take data analytics in Fabric to the next level. That means building advanced analytics solutions that go beyond what we’ve done before—like tapping into predictive analytics, real-time insights, and more sophisticated visualizations. Plus, we’ll be putting more focus on data governance and compliance, making sure everything stays organized, transparent, and secure as we scale up.

Another big goal is to keep collaborating with Microsoft as they roll out more features. We want to stay involved, give feedback, and help shape where Fabric goes from here. By staying ahead of the curve, we’ll make sure our platform remains a best-in-class example, ready to take on whatever the future holds. Follow us if you would like to receive more content about the ‘’Navara Fabric journey’’

If you are interested in discussing your challenges with the NAVARA Fabric team, please let us know. We would love to get into contact with like-minded users and engineers of Fabric who are keen to embrace new technologies.

 


 

Written by:
Sem Uijen, Microsoft Fabric Engineer (Msc Data Science)
Abe Brandsma, Microsoft Azure Engineer (Msc Artificial Intelligence)

Download article in PDF

 

What can we do for you?

Let us know where your challenge lies and we will make an appointment with our team of specialists to see which solution fits your organisation.