Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer All Questions

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

View all detail and faqs for the Databricks-Certified-Professional-Data-Engineer exam

Go to Exam

Viewing page 1 out of 6 pages

Viewing questions 1-10 out of questions

Questions # 1:

A Delta Lake table was created with the below query:

Question # 1

Realizing that the original query had a typographical error, the below code was executed:

ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store

Which result will occur after running the second command?

Options:

The table reference in the metastore is updated and no data is changed.

The table name change is recorded in the Delta transaction log.

All related files and metadata are dropped and recreated in a single ACID transaction.

The table reference in the metastore is updated and all data files are moved.

A new Delta transaction log Is created for the renamed table.

Expert Solution

Answer

Explanation

The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.

The result that will occur after running the second command is that the table reference in the metastore is updated and no data is changed. The metastore is a service that stores metadata about tables, such as their schema, location, properties, and partitions. The metastore allows users to access tables using SQL commands or Spark APIs without knowing their physical location or format. When renaming an external table using the ALTER TABLE RENAME TO command, only the table reference in the metastore is updated with the new name; no data files or directories are moved or changed in the storage system. The table will still point to the same location and use the same format as before. However, if renaming a managed table, which is a table whose metadata and data are both managed by Databricks, both the table reference in the metastore and the data files in the default warehouse directory are moved and renamed accordingly. Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks Documentation, under “Metastore” section; Databricks Documentation, under “Managed and external tables” section.

Questions # 2:

The data governance team has instituted a requirement that all tables containing Personal Identifiable Information (PH) must be clearly annotated. This includes adding column comments, table comments, and setting the custom table property "contains_pii" = true.

The following SQL DDL statement is executed to create a new table:

Question # 2

Which command allows manual confirmation that these three requirements have been met?

Options:

DESCRIBE EXTENDED dev.pii test

DESCRIBE DETAIL dev.pii test

SHOW TBLPROPERTIES dev.pii test

DESCRIBE HISTORY dev.pii test

SHOW TABLES dev

Expert Solution

Questions # 3:

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:

Whenever a database is being created, make sure that the location keyword is used

When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.

Whenever a table is being created, make sure that the location keyword is used.

When tables are created, make sure that the external keyword is used in the create table statement.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Expert Solution

Questions # 4:

Which statement regarding spark configuration on the Databricks platform is true?

Options:

Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

When the same spar configuration property is set for an interactive to the same interactive cluster.

Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

Expert Solution

Questions # 5:

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 & longitude > -20

Which statement describes how data will be filtered?

Options:

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Expert Solution

Questions # 6:

Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in maximum performance?

Options:

• Total VMs; 1

• 400 GB per Executor

• 160 Cores / Executor

• Total VMs: 8

• 50 GB per Executor

• 20 Cores / Executor

• Total VMs: 4

• 100 GB per Executor

• 40 Cores/Executor

• Total VMs:2

• 200 GB per Executor

• 80 Cores / Executor

Expert Solution

Questions # 7:

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Question # 7

Which response correctly fills in the blank to meet the specified requirements?

Question # 7

Options:

Option A

Option B

Option C

Option D

Option E

Expert Solution

Questions # 8:

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.

Why are the cloned tables no longer working?

Options:

The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.

Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.

The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.

Expert Solution

Questions # 9:

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Options:

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Expert Solution

Questions # 10:

Which of the following is true of Delta Lake and the Lakehouse?

Options:

Because Parquet compresses data row by row. strings will only be compressed when a character is repeated multiple times.

Delta Lake automatically collects statistics on the first 32 columns of each table which are leveraged in data skipping based on query filters.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Z-order can only be applied to numeric values stored in Delta Lake tables

Expert Solution

Answer

Explanation

https://docs.delta.io/2.0.0/table-properties.html

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters1. Data skipping is a performance optimization technique that aims to avoid reading irrelevant data from the storage layer1. By collecting statistics such as min/max values, null counts, and bloom filters, Delta Lake can efficiently prune unnecessary files or partitions from the query plan1. This can significantly improve the query performance and reduce the I/O cost.

The other options are false because:

Parquet compresses data column by column, not row by row2. This allows for better compression ratios, especially for repeated or similar values within a column2.

Views in the Lakehouse do not maintain a valid cache of the most recent versions of source tables at all times3. Views are logical constructs that are defined by a SQL query on one or more base tables3. Views are not materialized by default, which means they do not store any data, but only the query definition3. Therefore, views always reflect the latest state of the source tables when queried3. However, views can be cached manually using the CACHE TABLE or CREATE TABLE AS SELECT commands.

Primary and foreign key constraints can not be leveraged to ensure duplicate values are never entered into a dimension table. Delta Lake does not support enforcing primary and foreign key constraints on tables. Constraints are logical rules that define the integrity and validity of the data in a table. Delta Lake relies on the application logic or the user to ensure the data quality and consistency.

Z-order can be applied to any values stored in Delta Lake tables, not only numeric values. Z-order is a technique to optimize the layout of the data files by sorting them on one or more columns. Z-order can improve the query performance by clustering related values together and enabling more efficient data skipping. Z-order can be applied to any column that has a defined ordering, such as numeric, string, date, or boolean values.

References: Data Skipping, Parquet Format, Views, [Caching], [Constraints], [Z-Ordering]

Viewing page 1 out of 6 pages

Viewing questions 1-10 out of questions

Spring Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks Certification Databricks-Certified-Professional-Data-Engineer Questions and answers with ValidTests

Exam Databricks-Certified-Professional-Data-Engineer Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: