Article

Proper design and security are key elements of data lake organization

July 27, 2022

Design patterns of a business’s data lake are the foundation of future development and operations. Security is necessary to enforce not only compliance, but also to keep the lake from becoming a "data swamp" or a collection of unorganized data within the data lake.

Design

The overall design of a data lake is vital to its success. Coupled with security and the right organization processes, the data lake will provide valuable new insights; ultimately providing more tools to executives for making strategic decisions.

Nonetheless, if a data lake is designed poorly, it can quickly become a mess of raw data with no efﬁcient way for discovery or smooth acquisition, while increasing extracting, transforming and loading (ETL) development time and ultimately limiting success of the data lake.

The primary design effort should be focused on the storage hierarchy design and how data will ﬂow from one directory to the next. Though this is a simple concept, as laid out below, it will set up the entire foundation of how a data lake is leveraged from the data ingestion layer to the data access layer.

Azure data lake diagram

Data lake zones

Operational zones

Raw zone: This is the ﬁrst area in the data lake where data is landed and stored indeﬁnitely in its native raw format. The data will live here until it is operationalized. Operationalism occurs once value has been identiﬁed by the business, which does not occur until a measurement has been deﬁned. The purpose of this area is to keep source data in its raw and unaggregated format. An example would capture social media data before knowing how it will be used. Access to this area should be restricted from most of the business. Transformations should not occur when ingesting into the RAW zone but rather the data should be moved from source systems to the raw zone as quickly and efﬁciently as possible.

Data tagging – both automated and manual – is also included in this zone. Tagging of datasets can be stored within Azure Data Catalog. This will allow business analysts and subject matter experts to understand what data lives not only in Azure Data Lake, but Azure at large.

The folder structure for organizing data is separated by source, dataset and date ingested. Big data tools such as U-SQL allow for utilization of data across multiple folders using virtual columns in a non-iterative manner.

Related sections

Next up

Metadata and tagging are key to discovering the right data in a data lake

Discover more

Proper design and security are key elements of data lake organization

Design

Data lake zones

Related sections

Metadata and tagging are key to discovering the right data in a data lake

Security

How we can help

Other articles in the series