Technician analyzes real-time results to present data analytics to client

Article

Data movement, processing and orchestration are key to data lake operations

Aug. 3, 2022

Traditional enterprise data warehouses (EDW) and data marts require days or weeks to plan, design model and develop before data is made visible to end-users. A well-designed data lake balances the structure provided by a data warehouse with the ﬂexibility of a ﬁle system. The overall architecture and flow of a data lake can be categorized into three primary pillars: operations, discovery and organization. This article, part of our series on Azure Data Lakes, focuses on the operations aspects of a data lake.

The operations of a data lake involve data movement, processing and orchestration, which we explore in more detail below.

Data movement

Data movement includes the tools, practices and patterns of data ingestion into the data lake. Processing within the data lake and extraction out of the data lake can also be considered data movement, but much attention should be directed at the ingestion level due to the various sources and tool types required. A data lake is often designed to support both batch ingestion as well as streaming ingestion from IoT Hubs, Event Hubs and streaming components.

Three key considerations regarding the ingestion of data into a data lake include:

Metadata: Metadata can be captured either during the ingestion process or in some sort of batch post process within the data lake. A metadata strategy should already be in place before data ingestion planning begins. The strategy should include knowing what data is going to be captured at ingestion, which can be captured in Azure Data Catalog, HCatalog, or a custom metadata catalog that will support data integration automation as well as data exploration and discovery. The format of the data is also important; what format will be used later in the data lake for processing? It may be more beneﬁcial to use various Hadoop Distributed File System (HDFS) speciﬁc ﬁles such as AVRO depending on the scenario.
Batch and real time: Azure Data Lake and associated tools for HDFS allow you to work across thousands of ﬁles at petabyte size. A streaming dataset may place small ﬁles throughout the day, or a batch process may place one or many terabyte size ﬁles per day. Partitioning data at ingestion should be considered and is useful for gaining maximum query and processing performance. Also consider that not all organizational data ﬂows will include a data lake. For example, SSIS packages integrate relational databases with your data warehouse.

Related sections

Next up

Metadata and tagging are key to discovering the right data in a data lake

Discover more

Data movement, processing and orchestration are key to data lake operations

Data movement

Related sections

Metadata and tagging are key to discovering the right data in a data lake

Processing

Orchestration

How we can help