Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer All Questions
Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Viewing page 3 out of 6 pages
Viewing questions 21-30 out of questions
Questions # 21:

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Total Disk Space remains constant

D.

Network I/O never spikes

E.

Overall cluster CPU utilization is around 25%

Expert Solution
Questions # 22:

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company's data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

A.

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

B.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

C.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

D.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

E.

Databricks notebooks send all executable code from the user's browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Expert Solution
Questions # 23:

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function.

Which kind of the test does the above line exemplify?

Options:

A.

Integration

B.

Unit

C.

Manual

D.

functional

Expert Solution
Questions # 24:

What is the first of a Databricks Python notebook when viewed in a text editor?

Options:

A.

%python

B.

% Databricks notebook source

C.

-- Databricks notebook source

D.

//Databricks notebook source

Expert Solution
Questions # 25:

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

Options:

A.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: Unlimited

B.

Cluster: New Job Cluster;

Retries: None;

Maximum Concurrent Runs: 1

C.

Cluster: Existing All-Purpose Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

D.

Cluster: New Job Cluster;

Retries: Unlimited;

Maximum Concurrent Runs: 1

E.

Cluster: Existing All-Purpose Cluster;

Retries: None;

Maximum Concurrent Runs: 1

Expert Solution
Questions # 26:

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:

A.

"Can Manage" privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster

D.

"Can Restart" privileges on the required cluster

E.

Cluster creation allowed. "Can Restart" privileges on the required cluster

Expert Solution
Questions # 27:

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:

A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Expert Solution
Questions # 28:

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.

B.

Text data cannot be stored with Delta Lake.

C.

ZORDER ON review will need to be run to see performance gains.

D.

The Delta log creates a term matrix for free text fields to support selective filtering.

E.

Delta Lake statistics are only collected on the first 4 columns in a table.

Expert Solution
Questions # 29:

The marketing team is looking to share data in an aggregate table with the sales organization, but the field names used by the teams do not match, and a number of marketing specific fields have not been approval for the sales org.

Which of the following solutions addresses the situation while emphasizing simplicity?

Options:

A.

Create a view on the marketing table selecting only these fields approved for the sales team alias the names of any fields that should be standardized to the sales naming conventions.

B.

Use a CTAS statement to create a derivative table from the marketing table configure a production jon to propagation changes.

C.

Add a parallel table write to the current production pipeline, updating a new sales table that varies as required from marketing table.

D.

Create a new table with the required schema and use Delta Lake's DEEP CLONE functionality to sync up changes committed to one table to the corresponding table.

Expert Solution
Questions # 30:

Which statement describes Delta Lake optimized writes?

Options:

A.

A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

B.

Optimized writes logical partitions instead of directory partitions partition boundaries are only represented in metadata fewer small files are written.

C.

An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

D.

Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

Expert Solution
Viewing page 3 out of 6 pages
Viewing questions 21-30 out of questions