Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer All Questions

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

Viewing page 7 out of 7 pages

Viewing questions 61-70 out of questions

Questions # 61:

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Which command allows manual confirmation that these three requirements have been met?

Options:

DESCRIBE EXTENDED dev.pii test

DESCRIBE DETAIL dev.pii test

SHOW TBLPROPERTIES dev.pii test

DESCRIBE HISTORY dev.pii test

SHOW TABLES dev

Questions # 62:

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:

userLookup.join(streamingDF, ["userid"], how="inner")

streamingDF.join(userLookup, ["user_id"], how="outer")

streamingDF.join(userLookup, ["user_id”], how="left")

streamingDF.join(userLookup, ["userid"], how="inner")

userLookup.join(streamingDF, ["user_id"], how="right")

Questions # 63:

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?

Options:

Regex

Julia

pyspsark.ml.feature

Scala Datasets

C++

Questions # 64:

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Questions # 65:

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Questions # 66:

A junior data engineer has configured a workload that posts the following JSON to the Databricks REST API endpoint 2.0/jobs/create.

Assuming that all configurations and referenced resources are available, which statement describes the result of executing this workload three times?

Options:

Three new jobs named "Ingest new data" will be defined in the workspace, and they will each run once daily.

The logic defined in the referenced notebook will be executed three times on new clusters with the configurations of the provided cluster ID.

Three new jobs named "Ingest new data" will be defined in the workspace, but no jobs will be executed.

One new job named "Ingest new data" will be defined in the workspace, but it will not be executed.

The logic defined in the referenced notebook will be executed three times on the referenced existing all purpose cluster.

Questions # 67:

A user new to Databricks is trying to troubleshoot long execution times for some pipeline logic they are working on. Presently, the user is executing code cell-by-cell, using display() calls to confirm code is producing the logically correct results as new transformations are added to an operation. To get a measure of average time to execute, the user is running each cell multiple times interactively.

Which of the following adjustments will get a more accurate measure of how code is likely to perform in production?

Options:

Scala is the only language that can be accurately tested using interactive notebooks; because the best performance is achieved by using Scala code compiled to JARs. all PySpark and Spark SQL logic should be refactored.

The only way to meaningfully troubleshoot code execution times in development notebooks Is to use production-sized data and production-sized clusters with Run All execution.

Production code development should only be done using an IDE; executing code against a local build of open source Spark and Delta Lake will provide the most accurate benchmarks for how code will perform in production.

Calling display () forces a job to trigger, while many transformations will only add to the logical query plan; because of caching, repeated execution of the same logic does not provide meaningful results.

The Jobs Ul should be leveraged to occasionally run the notebook as a job and track execution time during incremental code development because Photon can only be enabled on clusters launched for scheduled jobs.

Answer

Questions # 68:

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Options:

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Questions # 69:

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"

Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:

withWatermark("event_time", "10 minutes")

awaitArrival("event_time", "10 minutes")

await("event_time + ‘10 minutes'")

slidingWindow("event_time", "10 minutes")

delayWrite("event_time", "10 minutes")

Questions # 70:

Which statement describes integration testing?

Options:

Validates interactions between subsystems of your application

Requires an automated testing framework

Requires manual intervention

Validates an application use case

Validates behavior of individual elements of your application

Viewing page 7 out of 7 pages

Viewing questions 61-70 out of questions

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: