AZURE DATA FACTORY INTERVIEW QUESTIONS
• The amount of data generated these days is massive, and it comes from a variety of sources. There are only a few things that need to be taken care of when we migrate this data to the cloud.
Data can take any form because it comes from several sources, and each source will transport or channelize the data in a different method and in a different format. When we move this data to the cloud or a specific storage location, we must ensure that it is well managed. That is, you must change the data and remove any unneeded bits. In terms of data movement, we must ensure that data is collected from many sources and brought to a common location where it may be stored.
Cloud-based integration service for orchestrating and automating data transit and transformation.
• You can use Azure Data Factory to construct and plan data-driven processes (called pipelines) that may import data from various data sources, as well as analyse and transform the data using computing services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.
• Azure Integration Run Time: Azure Integration Run Time can copy data between cloud data storage and route the activity to a range of computing services, such as Azure HDinsight or SQL server, where the transformation occurs.
• Self Hosted Integration Run Time: Self Hosted Integration Run Time is software that is virtually identical to Azure Integration Run Time. However, it must be installed on an on-premises machine or a virtual machine in a virtual network. A Self Hosted IR can perform copy operations between a public cloud data store and a local network data store. It can also assign transformation tasks to compute resources on a private network. We make use of Self Hosted IR b.
In a data factory, there is no hard restriction on the number of integration runtime instances that can be present. There is, however, a limit on the number of VM cores that the integration runtime can employ for SSIS package execution per subscription.
A data warehouse is a conventional method of storing data that is still commonly used today. Data Lake is supplementary to Data Warehouse; for example, if you have data in a data lake, it can be kept in a data warehouse as well, but certain standards must be observed.
WAREHOUSE OF DATA LAKE
In addition to the data warehouse
Perhaps sourced to the data lake
Data can be detailed or raw. It can take any shape or form. All you have to do is take the data and deposit it into your data lake. Data is filtered, summarised, and fine-tuned.
On read, the schema (not structured, you can define your schema in n number of ways)
Schema for writing (data is written in Structured form or in a particular schema)
To proceed, use only one language.
Azure Blob Storage is a service that allows you to store massive volumes of unstructured object data, such as text or binary data. Blob Storage can be used to publish data to the public or to keep application data securely. Blob Storage is commonly used for the following purposes:
•Serving images or documents directly to browsers
•Storing data for dispersed access •Streaming video and audio
•Storing data for backup and restore disaster recovery, as well as archiving. •Storing data for analysis by an on-premises or Azure-hosted service.
Gen1 Azure Data Lake Storage The Purpose of Azure Blob Storage Storage optimised for big data analytics workloads A general-purpose object store that may be used for a wide range of storage scenarios, including big data analytics.
A hierarchical file system
Object storage with a single namespace
The Data Lake Storage Gen1 account includes folders, which contain data stored as files.
The storage account has containers, which hold data in the form of blobs.
Log files, IoT data, click streams, and big datasets are examples of batch, interactive, and streaming analytics and machine learning data.
Text or binary data of any form, such as application back end, backup data, media storage for streaming, and so on.
If something has to be handled while we are attempting to extract data from an Azure SQL server database, it will be processed and stored in the Data Lake Store.
ETL Creation Procedures
•Create a Linked Service for the SQL Server Database as the source data store.
•Assume we have a dataset of automobiles.
•Build a Linked Service for the destination data store, Azure Data Lake Store.
•Create a dataset for saving data.
•Build the pipeline and include copy activities.
•Add a trigger to the pipeline to schedule it.
HDInsight is a Platform as a Service (PaaS).
Azure Data Lake Analytics is a type of software-as-a-service.
To process a data set, we must first establish the cluster with preset nodes and then use a language such as pig or hive to process the data.
It is simply about passing a query built for data processing, and Azure Data Lake Analytics will construct the necessary computing nodes on demand based on our instructions and process the data set.
Because we configure the cluster with HD insight, we can design and operate it as we see fit. All Hadoop subprojects, including spark and kafka, can be used without restriction. It does not provide azure data lake analytics.
• To schedule a pipeline, use the scheduler trigger or time window trigger. • The trigger employs a wall-clock calendar schedule, which can schedule pipelines periodically or in calendar-based recurring patterns (for example, on Mondays at 6:00 PM and Thursdays at 9:00 PM).
• Parameters are, indeed, a first-class, top-level concept in Data Factory. You can define pipeline parameters and pass arguments when you run the pipeline on demand or via a trigger.
The parameter value that is supplied to the pipeline and run with the @parameter construct can be consumed by each activity within the pipeline.
• You will no longer be required to supply your own Azure Databricks clusters; Data Factory will handle cluster creation and tear-down.
• Delimited text and Apache Parquet files are isolated from Blob datasets and Azure Data Lake Storage Gen2 datasets.
You can still store those files using Data Lake Storage Gen2 and Blob storage. For those storage engines, use the corresponding associated service.