Building a Data Warehouse — Naming Conventions
Naming conventions are one of the first steps toward ensuring proper cataloging, development consistency, and faster onboarding experiences. However, they are often overlooked until later stages, with teams investing large amounts of time in refactoring — and we all know where that leads to.
This document proposes a set of base conventions to be used as a template for data warehousing naming conventions, thus ensuring standards are enforced from the outset of a data engineering project.
TLDR; The document takes us through naming conventions for four different object types:
- File storage naming conventions — How to name and structure file storage for data lakes
- Data object naming conventions — Naming data objects such as tables and view
- Table naming conventions — Naming tables as per the Kimball schema design
- Attribute (column) naming conventions
- Column Naming Conventions (Abbreviations)
Why are naming conventions important?
- It improves readability.
- It speeds up development.
- Onboarding new team members is faster.
It improves the discovery process
When looking for a specific attribute or understanding what it means becomes a simple task just by looking at how they’re named. This is definitely a critical aspect to take into consideration when considering data teams exist to support businesses. More time will be spent in reading these attributes than writing them.
It speeds up development
Having a hard time naming a table, dimension, or attribute in a clear and simple form? Struggling to include additional modifiers that support variations of the same attributes? Naming conventions ensure the developer spends less time being creative on names and can simply focus on developing code.
In addition, having clear conventions facilitates the developer to ensure meaning is not lost, including the proper granularity.
Onboarding new team members
Clear processes are often the fastest way to onboard new team members and make sure they are as productive as possible from early on. Enforcing naming conventions is another part of the fast onboarding, by facilitating not having to dwell on attribute meaning and just following the documentation and understanding design decisions.
Storage Naming Conventions
As modern data storage platforms such as Lakehouse emerge, it is increasingly important to ensure standards are met from the outset and not just in the presentation layer.
When storing data in the lake — S3, Blob Storage — a pattern should be followed according to the corresponding stage.
Object naming conventions
All objects from schemas to blobs should follow four main naming conventions:
- Use single names when possible
– e.g. common not commons - Use lowercase names
– e.g. common not COMMON - Use underscores to join multiple words
– e.g. data_source not datasource - Keep object names short but clear, ideally up to 10 characters
– e.g. dw_aggreation not data_warehouse_aggregation_layer - Time-related objects should be stored in positional notation to facilitate the lookup process
– e.g. when referring to January, use 01 instead of 1
Storage structure layer
The storage structure is highly contingent on the associated layer. However, consider the case of a bronze layer that acts as a sink. These are typically divided by sources or components. With this in mind, the structure should reflect the ability to capture multiple sources and filter the associated data based on specific time ranges.
Note: This is not applicable to formats such as Delta and Hudi, given that in this case, you should strive to keep everything under a single common structure
The Bronze storage pattern should abide by the following rule:
- bronze/%component_name%/%year%/%month%/%day%/%hour%/%minute%/%second%/%source_name%.parquet
For instance, if we consider a Bronze layer which acts as a sink, for a Shopify extraction:
- bronze/shopify/2023/01/09/12/40/12/shopify.parquet
When working with Silver layers, a similar pattern emerges yet, this time, disregarding the time metadata given it is injected directly into your storage layer — Delta, Hudi, Iceberg:
- silver/%component_name%/%partitions%/%source_name%.parquet
Data object naming conventions
Object types should be distinguished based on a suffix that determines the object’s type. The reasoning behind this approach is to allow similar objects to co-exist within the same schema, whilst using logical (views) instances to abstract access to the physical (tables) resource. This enables effective management of the physical instances (tables) without compromising the end result.
Table naming conventions
Table naming conventions are a critical aspect to assist in the process of data cataloging. A proper table naming convention can help both stakeholders and developers alike, facilitating the discovery process in the former case, and ensuring the granularity is respected when developing in the latter.
Attribute (column) naming conventions
Clear attribute conventions are arguably one of the most important steps to ensure your resources are clear and readily available for stakeholders. They are the first step to ensure you don’t get asked questions such as:
I want to know the number of orders, which column should I use? Order Qty or Order Amt?
Attribute conventions must be enforced by both developers and stakeholders alike, requiring constant communication and method to ensure operational concepts match the data warehouses’.
Attribute naming conventions (abbreviations)
Most often than not, especially when working with financial data, we find ourselves constantly reusing the same suffixes to determine specific types of modifiers. For instance, the inclusion or exclusion operators in addition to time delta computation.
This section focuses on modifier abbreviations that can be used in combination with any of the attribute naming conventions.