Understanding the Intermediate Layer in Data Transformation Projects

September 10th, 2024

00:00

00:00

Summary

Bridges raw data and final products
Explores purpose and structure of intermediate layer
Importance of organizing models by business groupings
Naming conventions for better understanding
Avoiding early over-optimization
Achieving a single source of truth
Practical examples and best practices
Benefits of separating intermediate models
Materializing models ephemerally or as views
Enhancing readability and troubleshooting

Sources

docs.getdbt.com

Understanding the intermediate layer in data transformation projects is vital. This stage serves as a bridge between raw data and the final data products, such as dashboards or machine learning models. By examining the intermediate layer, the purpose and structure of this crucial stage can be demystified, revealing how it simplifies complex data transformations. In the intermediate layer, raw data atoms are brought together into more intricate, connected molecular shapes. This process creates varied forms with specific purposes, which are essential steps towards developing more complex data products. For example, the project structure includes models and intermediate subdirectories that are organized based on business groupings. Unlike the staging layer, the intermediate layer is business-conformed, meaning models are split by their area of business concern rather than by their source system. Naming conventions play a key role in understanding transformations within the intermediate layer. File names typically follow a pattern such as int_[entity]s_[verb]s.sql. This approach ensures clarity, making it easier for anyone, even those who dont know SQL, to quickly grasp what is happening in each model. For instance, an intermediate model might be named int_payments_pivoted_to_orders, clearly indicating that it pivots payments to order grain. One important principle to keep in mind is to avoid over-optimizing too early. The goal is a single source of truth, where finance and marketing operate on unified models rather than separate ones. Subdirectories should be implemented thoughtfully, especially in larger projects, to maintain simplicity and avoid unnecessary complexity. Intermediate models should generally not be exposed in the main production schema, as they are not intended for final outputs like dashboards or applications. Instead, they can be materialized ephemerally to keep unnecessary models out of the warehouse or as views in a custom schema for added insight and easier troubleshooting. This helps maintain data governance and discoverability while keeping the warehouse tidy. The purposes of intermediate models are diverse, addressing various needs in data transformation. Common use cases include structural simplification, re-graining, and isolating complex operations. Structural simplification involves combining multiple entities or concepts into intermediate models that can then be joined to generate a mart. Re-graining adjusts the grain of models to ensure clarity and correctness before mixing with other components. Isolating complex operations makes them easier to refine and troubleshoot, enhancing readability and flexibility in downstream models. By understanding these aspects of the intermediate layer, data transformation projects can achieve greater clarity, simplicity, and efficiency, ultimately leading to more effective and accurate data products. The intermediate layer serves to create more intricate, connected molecular shapes from raw data. This stage is essential in breaking down the complexity of raw data and organizing it in a way that can be more easily utilized in advanced analytics and data products. One critical aspect of the intermediate layer is organizing models into subdirectories based on business groupings. This approach shifts away from organizing by source system, as seen in the staging layer, and instead focuses on business concerns. For example, finance-related models might be housed in a finance subdirectory. This organization enhances clarity and aligns data models with business needs, making it easier for stakeholders to understand and use the data appropriately. Naming conventions in the intermediate layer play a crucial role in making data transformations understandable. A typical naming pattern is int_[entity]s_[verb]s.sql. This convention helps in quickly identifying the purpose of a model. For instance, a model named int_payments_pivoted_to_orders clearly indicates that it pivots payments data to match the order grain. Such clarity is invaluable, especially for those who may not be familiar with SQL, as it allows them to understand the transformations taking place just by looking at the file names. Avoiding over-optimization too early in the project is another important principle. The goal is to achieve a single source of truth where different departments, such as finance and marketing, operate on unified data models. Over-optimizing can lead to unnecessary complexity and fragmentation. If there are fewer than ten marts models and no significant issues in developing and using them, it is advisable to forego extensive subdirectory structures until the project grows large enough to necessitate them. This approach keeps the project simple and manageable. Additionally, intermediate models should generally not be exposed in the main production schema. They are not meant for final output targets like dashboards or applications. Instead, these models can be materialized ephemerally, which keeps unnecessary models out of the warehouse and simplifies configuration. However, this can make troubleshooting more challenging, as ephemeral models are interpolated into the models that reference them. An alternative is to materialize intermediate models as views in a custom schema with special permissions, providing better insight and easier troubleshooting as the model complexity increases. Maintaining a tidy data warehouse is crucial. The organizational knowledge graph encoded into dbt includes the DAG, files, folder structures, and the warehouse output. Ensuring that schemas, tables, and views are well-named and grouped is essential for achieving a user-friendly experience. This organization aids in data governance and discoverability, ensuring that intermediate models serve their purpose without cluttering the main production environment. The intermediate layer addresses various needs in data transformation projects. Structural simplification involves combining multiple entities or concepts into intermediate models, which can then be joined to generate a mart. Re-graining adjusts the grain of models to ensure clarity and correctness before mixing with other components. Isolating complex operations makes them easier to refine and troubleshoot, enhancing readability and flexibility in downstream models. In summary, the intermediate layer is designed to simplify complex data transformations by creating organized, understandable, and manageable data models. This stage is crucial for bridging the gap between raw data and final data products, ultimately leading to more effective and accurate analytics and insights. Practical applications and best practices in the intermediate layer can significantly enhance the efficiency and accuracy of data transformation projects. One illustrative example is the model int_payments_pivoted_to_orders, which pivots payments to match the order grain. This intermediate model exemplifies how specific transformations can simplify the data and make it more useful for further analysis. Keeping intermediate models separate from the main production schema is crucial. These models are not intended for final output targets like dashboards or applications. By isolating them, data governance and discoverability are more easily controlled. This separation ensures that only the necessary models are exposed to end users, maintaining a clean and organized production environment. Materializing intermediate models can be done in a couple of ways. One popular approach is to materialize them ephemerally. This method keeps unnecessary models out of the warehouse, requiring minimal configuration. However, ephemeral models can complicate troubleshooting, as they are interpolated into the models that reference them, rather than existing independently. An alternative method is to materialize intermediate models as views in a custom schema with special permissions. This approach provides added insight into development and makes troubleshooting easier as the number and complexity of models grow. While it requires a bit more setup, the benefits in terms of clarity and manageability often outweigh the initial effort. Common use cases for intermediate models include structural simplification, re-graining, and isolating complex operations. Structural simplification involves combining a reasonable number of entities or concepts into intermediate models. This approach reduces the complexity of marts by joining intermediate models that each handle a piece of the overall complexity. The result is increased readability, flexibility, and a more accessible testing surface area. Re-graining is another important use case. This process involves adjusting the grain of models to ensure they are at the correct level of detail before combining them with other components. For example, if a mart for order items requires fanning out orders based on quantity, creating a new row for each item, this would be done in an intermediate model. This separation maintains clarity and ensures the grain is correct before mixing it with other data. Isolating complex operations within intermediate models simplifies the overall data transformation process. Moving particularly complex or difficult-to-understand logic into its own intermediate model makes it easier to refine and troubleshoot. This isolation also makes downstream models more readable and easier to understand. For instance, in the quantity fan-out example, isolating this complex logic allows for quick debugging and thorough testing, ensuring the transformation is correct before integrating it into other models. In conclusion, practical applications and best practices in managing the intermediate layer are pivotal for the success of data transformation projects. By separating intermediate models from the main production schema, choosing appropriate materialization methods, and addressing common use cases, the intermediate layer can greatly enhance the readability, flexibility, and troubleshootability of data models. These practices contribute to more effective and accurate data products, ultimately supporting better decision-making and insights.