Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with ValidTests

Exam Databricks-Machine-Learning-Associate All Questions
Exam Databricks-Machine-Learning-Associate Premium Access

View all detail and faqs for the Databricks-Machine-Learning-Associate exam

Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions
Questions # 11:

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

A.

spark_df.describe()

B.

dbutils.data(spark_df).summarize()

C.

This task cannot be accomplished in a single line of code.

D.

spark_df.summary()

E.

dbutils.data.summarize (spark_df)

Expert Solution
Questions # 12:

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.

Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

Options:

A.

Model tuning

B.

Model evaluation

C.

Model deployment

D.

Exploratory data analysis

Expert Solution
Questions # 13:

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

A.

Run each notebook interactively

B.

Review the matrix view in the Job's runs

C.

Migrate the Job to a Delta Live Tables pipeline

D.

Change each Task’s setting to use a dedicated cluster

Expert Solution
Questions # 14:

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:

A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Expert Solution
Questions # 15:

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

A.

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

B.

pandas API on Spark DataFrames are more performant than Spark DataFrames

C.

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

D.

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

E.

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Expert Solution
Questions # 16:

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Options:

A.

One-hot encoding categorical features

B.

Target encoding categorical features

C.

Imputing missing feature values with the mean

D.

Imputing missing feature values with the true median

E.

Creating binary indicator features for missing values

Expert Solution
Questions # 17:

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:

A.

Theycan turn on Databricks Autologging

B.

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.

They can start each child run with the same experiment ID as the parent run

E.

They can specify nested=True when starting the parent run for the tuningprocess

Expert Solution
Questions # 18:

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A.

A holdout set is not necessary when using a train-validation split

B.

Reproducibility is achievable when using a train-validation split

C.

Fewer hyperparameter values need to be tested when usinga train-validation split

D.

Bias is avoidable when using a train-validation split

E.

Fewer models need to be trained when using a train-validation split

Expert Solution
Questions # 19:

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Options:

A.

Change the number of compute nodes to be half or less than half of the number of evaluations.

B.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

C.

Change the iterative optimization algorithm used to facilitate the tuning process.

D.

Change the number of compute nodes to be double or more than double the number of evaluations.

Expert Solution
Questions # 20:

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Question # 20

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

A.

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

B.

The process will leak data from the training set to the test set during the evaluation phase

C.

The process will be unable to parallelize tuning due to the distributed nature of pipeline

D.

The process will leak data prep information from the validation sets to the training sets for each model

Expert Solution
Viewing page 2 out of 3 pages
Viewing questions 11-20 out of questions