Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Questions and answers with ValidTests

Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 All Questions
Exam Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access

View all detail and faqs for the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam

Viewing page 4 out of 5 pages
Viewing questions 31-40 out of questions
Questions # 31:

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Question # 31

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \

.[LINE 5] \

.load()

Options:

Options:

A.

.option("subscribe", "feed")

B.

.option("subscribe.topic", "feed")

C.

.option("kafka.topic", "feed")

D.

.option("topic", "feed")

Questions # 32:

A data engineer writes the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv") # ~10 GB

df2 = spark.read.csv("product_data.csv") # ~8 MB

result = df1.join(df2, df1.product_id == df2.product_id)

Question # 32

Which join strategy will Spark use?

Options:

A.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan

B.

Broadcast join, as df2 is smaller than the default broadcast threshold

C.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently

D.

Shuffle join because no broadcast hints were provided

Questions # 33:

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:

A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy("event_year") only

Questions # 34:

45 of 55.

Which feature of Spark Connect should be considered when designing an application that plans to enable remote interaction with a Spark cluster?

Options:

A.

It is primarily used for data ingestion into Spark from external sources.

B.

It provides a way to run Spark applications remotely in any programming language.

C.

It can be used to interact with any remote cluster using the REST API.

D.

It allows for remote execution of Spark jobs.

Questions # 35:

10 of 55.

What is the benefit of using Pandas API on Spark for data transformations?

Options:

A.

It executes queries faster using all the available cores in the cluster as well as provides Pandas's rich set of features.

B.

It is available only with Python, thereby reducing the learning curve.

C.

It runs on a single node only, utilizing memory efficiently.

D.

It computes results immediately using eager execution.

Questions # 36:

A developer wants to refactor some older Spark code to leverage built-in functions introduced in Spark 3.5.0. The existing code performs array manipulations manually. Which of the following code snippets utilizes new built-in functions in Spark 3.5.0 for array operations?

Question # 36

A)

Question # 36

B)

Question # 36

C)

Question # 36

D)

Question # 36

Options:

A.

result_df = prices_df \

.withColumn("valid_price", F.when(F.col("spot_price") > F.lit(min_price), 1).otherwise(0))

B.

result_df = prices_df \

.agg(F.count_if(F.col("spot_price") >= F.lit(min_price)))

C.

result_df = prices_df \

.agg(F.min("spot_price"), F.max("spot_price"))

D.

result_df = prices_df \

.agg(F.count("spot_price").alias("spot_price")) \

.filter(F.col("spot_price") > F.lit("min_price"))

Questions # 37:

What is the difference between df.cache() and df.persist() in Spark DataFrame?

Options:

A.

Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)

B.

Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.

C.

persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.

D.

cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist() - Can be used to set different storage levels to persist the contents of the DataFrame

Questions # 38:

The following code fragment results in an error:

@F.udf(T.IntegerType())

def simple_udf(t: str) -> str:

return answer * 3.14159

Which code fragment should be used instead?

Options:

A.

@F.udf(T.IntegerType())

def simple_udf(t: int) -> int:

return t * 3.14159

B.

@F.udf(T.DoubleType())

def simple_udf(t: float) -> float:

return t * 3.14159

C.

@F.udf(T.DoubleType())

def simple_udf(t: int) -> int:

return t * 3.14159

D.

@F.udf(T.IntegerType())

def simple_udf(t: float) -> float:

return t * 3.14159

Questions # 39:

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:

A.

df_user_non_pii = df_user.drop("name", "email", "birthdate")

B.

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

C.

df_user_non_pii = df_user.select("name", "email", "birthdate")

D.

df_user_non_pii = df_user.remove("name", "email", "birthdate")

Questions # 40:

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country, and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

Options:

A.

df.write.mode("overwrite").partitionBy("country").parquet("/data/output")

B.

df.write.mode("append").partitionBy("country").parquet("/data/output")

C.

df.write.mode("overwrite").parquet("/data/output")

D.

df.write.partitionBy("country").parquet("/data/output")

Viewing page 4 out of 5 pages
Viewing questions 31-40 out of questions