Traditional enterprise data warehouses (EDW) and data marts require days or weeks to plan, design model and develop before data is made visible to end-users. A well-designed data lake balances the structure provided by a data warehouse with the flexibility of a file system. The overall architecture and flow of a data lake can be categorized into three primary pillars: operations, discovery and organization. This article, part of our series on Azure Data Lakes, focuses on the operations aspects of a data lake.
The operations of a data lake involve data movement, processing and orchestration, which we explore in more detail below.
Data movement
Data movement includes the tools, practices and patterns of data ingestion into the data lake. Processing within the data lake and extraction out of the data lake can also be considered data movement, but much attention should be directed at the ingestion level due to the various sources and tool types required. A data lake is often designed to support both batch ingestion as well as streaming ingestion from IoT Hubs, Event Hubs and streaming components.
Three key considerations regarding the ingestion of data into a data lake include:
- Metadata: Metadata can be captured either during the ingestion process or in some sort of batch post process within the data lake. A metadata strategy should already be in place before data ingestion planning begins. The strategy should include knowing what data is going to be captured at ingestion, which can be captured in Azure Data Catalog, HCatalog, or a custom metadata catalog that will support data integration automation as well as data exploration and discovery. The format of the data is also important; what format will be used later in the data lake for processing? It may be more beneficial to use various Hadoop Distributed File System (HDFS) specific files such as AVRO depending on the scenario.
- Batch and real time: Azure Data Lake and associated tools for HDFS allow you to work across thousands of files at petabyte size. A streaming dataset may place small files throughout the day, or a batch process may place one or many terabyte size files per day. Partitioning data at ingestion should be considered and is useful for gaining maximum query and processing performance. Also consider that not all organizational data flows will include a data lake. For example, SSIS packages integrate relational databases with your data warehouse.



