Winter Sale Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: pass65

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with ValidTests

Exam Databricks-Machine-Learning-Associate All Questions
Exam Databricks-Machine-Learning-Associate Premium Access

View all detail and faqs for the Databricks-Machine-Learning-Associate exam

Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions
Questions # 1:

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

Expert Solution
Questions # 2:

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0.

Which of the following code blocks will accomplish this task?

Options:

A.

spark_df[spark_df["price"] > 0]

B.

spark_df.filter(col("price") > 0)

C.

SELECT * FROM spark_df WHERE price > 0

D.

spark_df.loc[spark_df["price"] > 0,:]

E.

spark_df.loc[:,spark_df["price"] > 0]

Expert Solution
Questions # 3:

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Options:

A.

The vectorized pandas UDFs allow for the use of type hints

B.

The vectorized pandas UDFs process data in batches rather than one row at a time

C.

The vectorized pandas UDFs allow for pandas API use inside of the function

D.

The vectorized pandas UDFs work on distributed DataFrames

E.

The vectorized pandas UDFs process data in memory rather than spilling to disk

Expert Solution
Questions # 4:

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:

A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_sql()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

E.

spark_df.to_pandas()

Expert Solution
Questions # 5:

A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds:

• 10.0

• 12.0

• 17.0

Which of the following values represents the overall cross-validation root-mean-squared error?

Options:

A.

13.0

B.

17.0

C.

12.0

D.

39.0

E.

10.0

Expert Solution
Questions # 6:

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:

A.

They should exponentiate the computed RMSE value

B.

They should take the log of the predictions before computing the RMSE

C.

They should evaluate the MSE of the log predictions to compute the RMSE

D.

They should exponentiate the predictions before computing the RMSE

Expert Solution
Questions # 7:

A data scientist is developing a single-node machine learning model. They have a large number of model configurations to test as a part of their experiment. As a result, the model tuning process takes too long to complete. Which of the following approaches can be used to speed up the model tuning process?

Options:

A.

Implement MLflow Experiment Tracking

B.

Scale up with Spark ML

C.

Enable autoscaling clusters

D.

Parallelize with Hyperopt

Expert Solution
Questions # 8:

A data scientist has produced three new models for a single machine learning problem. In the past, the solution used just one model. All four models have nearly the same prediction latency, but a machine learning engineer suggests that the new solution will be less time efficient during inference.

In which situation will the machine learning engineer be correct?

Options:

A.

When the new solution requires if-else logic determining which model to use to compute each prediction

B.

When the new solution's models have an average latency that is larger than the size of the original model

C.

When the new solution requires the use of fewer feature variables than the original model

D.

When the new solution requires that each model computes a prediction for every record

E.

When the new solution's models have an average size that is larger than the size of the original model

Expert Solution
Questions # 9:

A machine learning engineer has been notified that a new Staging version of a model registered to the MLflow Model Registry has passed all tests. As a result, the machine learning engineer wants to put this model into production by transitioning it to the Production stage in the Model Registry.

From which of the following pages in Databricks Machine Learning can the machine learning engineer accomplish this task?

Options:

A.

The home page of the MLflow Model Registry

B.

The experiment page in the Experiments observatory

C.

The model version page in the MLflow ModelRegistry

D.

The model page in the MLflow Model Registry

Expert Solution
Questions # 10:

A data scientist is working with a feature set with the following schema:

Question # 10

Thecustomer_idcolumn is the primary key in the feature set. Each of the columns in the feature set has missing values. They want to replace the missing values by imputing a common value for each feature.

Which of the following lists all of the columns in the feature set that need to be imputed using the most common value of the column?

Options:

A.

customer_id, loyalty_tier

B.

loyalty_tier

C.

units

D.

spend

E.

customer_id

Expert Solution
Viewing page 1 out of 3 pages
Viewing questions 1-10 out of questions