Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer All Questions

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

Viewing page 3 out of 7 pages

Viewing questions 21-30 out of questions

Questions # 21:

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

The five Minute Load Average remains consistent/flat

Bytes Received never exceeds 80 million bytes per second

Total Disk Space remains constant

Network I/O never spikes

Overall cluster CPU utilization is around 25%

Expert Solution

Answer

Explanation

This is the correct answer because it indicates a bottleneck caused by code executing on the driver. A bottleneck is a situation where the performance or capacity of a system is limited by a single component or resource. A bottleneck can cause slow execution, high latency, or low throughput. A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor. When evaluating the Ganglia Metrics for this cluster, one can look for indicators that show how the cluster resources are being utilized, such as CPU, memory, disk, or network. If the overall cluster CPU utilization is around 25%, it means that only one out of the four nodes (driver + 3 executors) is using its full CPU capacity, while the other three nodes are idle or underutilized. This suggests that the code executing on the driver is taking too long or consuming too much CPU resources, preventing the executors from receiving tasks or data to process. This can happen when the code has driver-side operations that are not parallelized or distributed, such as collecting large amounts of data to the driver, performing complex calculations on the driver, or using non-Spark libraries on the driver. Verified References: [Databricks Certified Data Engineer Professional], under “Spark Core” section; Databricks Documentation, under “View cluster status and event logs - Ganglia metrics” section; Databricks Documentation, under “Avoid collecting large RDDs” section.

In a Spark cluster, the driver node is responsible for managing the execution of the Spark application, including scheduling tasks, managing the execution plan, and interacting with the cluster manager. If the overall cluster CPU utilization is low (e.g., around 25%), it may indicate that the driver node is not utilizing the available resources effectively and might be a bottleneck.

Questions # 22:

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Expert Solution

Questions # 23:

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

Integration

Unit

Manual

functional

Expert Solution

Questions # 24:

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

%python

% Databricks notebook source

-- Databricks notebook source

//Databricks notebook source

Expert Solution

Questions # 25:

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Expert Solution

Questions # 26:

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

"Can Manage" privileges on the required cluster

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

Cluster creation allowed. "Can Attach To" privileges on the required cluster

"Can Restart" privileges on the required cluster

Cluster creation allowed. "Can Restart" privileges on the required cluster

Expert Solution

Questions # 27:

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

importlib.resource path

,sys.path

os-path

pypi.path

pylib.source

Expert Solution

Questions # 28:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

Delta Lake statistics are not optimized for free text fields with high cardinality.

Text data cannot be stored with Delta Lake.

ZORDER ON review will need to be run to see performance gains.

The Delta log creates a term matrix for free text fields to support selective filtering.

Delta Lake statistics are only collected on the first 4 columns in a table.

Expert Solution

Questions # 29:

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Expert Solution

Questions # 30:

Which statement describes Delta Lake optimized writes?

Options:

A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

Optimized writes logical partitions instead of directory partitions partition boundaries are only represented in metadata fewer small files are written.

An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

Expert Solution

Viewing page 3 out of 7 pages

Viewing questions 21-30 out of questions

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: