As the volume, velocity and variety of data grow within businesses, they increasingly depend on data lakes for data storage, governance, blending and analysis. A data lake is a system of stored data in its raw format. It enables businesses to collect a larger volume and variety of data without the rigidity and overhead of traditional data warehouse architectures. Additionally, data lakes provide a place for data-focused users to experiment with datasets and find value without involving IT or spinning up a large project.
The case for a data lake
Traditional enterprise data warehouses (EDW) and data marts require days or weeks to plan, design, model and develop before data is made visible to end-users. During this period, key elements in the business may have changed, requiring re-design and protracting time-to-value. EDW rigidity and rigor often entice end-users to build their own solutions using spreadsheets, local databases and other proprietary tools. This inevitably creates data silos, shadow IT and a fragmented data landscape. Furthermore, the scarcity of cataloged business data resources limits the data that the business uses to answer business questions, resulting in decision makers acting on incomplete information.
A well-designed data lake balances the structure provided by a data warehouse with the flexibility of a file system. It’s important to understand that a data lake is different than a file system in that raw data cannot simply be added to the lake and made available to the business. Process and design are still required to enable end-users, and security and governance still apply. The application of a specific architecture will enable the data business to be nimble while retaining control.
While the beneficiaries of a data lake solution span the entirety of the business, the users that access the lake directly are limited. Not all business users will be interested in accessing the data lake directly, nor should they spend their time on data blending and analysis. Instead, the whole of the lake is made available to data scientists and analysts, while a vetted and curated dataset is made available to the business at large.
Data lake architecture
The overall architecture and flow of a data lake can be categorized into three primary pillars: operations, discovery and organization.
Operations
Data movement involves the ingestion and extraction of data in the data lake. Data might be pushed or pulled depending on the technology chosen and purpose for the movement of data. Ingestion is the most critical component as the source systems can produce a variety of data streams and formats.

