Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with ValidTests

Exam Databricks-Machine-Learning-Associate All Questions

Exam Databricks-Machine-Learning-Associate Premium Access

View all detail and faqs for the Databricks-Machine-Learning-Associate exam

Go to Exam

Viewing page 2 out of 3 pages

Viewing questions 11-20 out of questions

Questions # 11:

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:

spark_df.describe()

dbutils.data(spark_df).summarize()

This task cannot be accomplished in a single line of code.

spark_df.summary()

dbutils.data.summarize (spark_df)

Expert Solution

Questions # 12:

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.

Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

Options:

Model tuning

Model evaluation

Model deployment

Exploratory data analysis

Expert Solution

Questions # 13:

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

Run each notebook interactively

Review the matrix view in the Job's runs

Migrate the Job to a Delta Live Tables pipeline

Change each Task’s setting to use a dedicated cluster

Expert Solution

Questions # 14:

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:

Impute the missing values using each respective feature variable's mean value instead of the median value

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

Remove all feature variables that originally contained missing values from the feature set

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Expert Solution

Questions # 15:

Which of the following describes the relationship between native Spark DataFrames and pandas API on Spark DataFrames?

Options:

pandas API on Spark DataFrames are single-node versions of Spark DataFrames with additional metadata

pandas API on Spark DataFrames are more performant than Spark DataFrames

pandas API on Spark DataFrames are made up of Spark DataFrames and additional metadata

pandas API on Spark DataFrames are less mutable versions of Spark DataFrames

pandas API on Spark DataFrames are unrelated to Spark DataFrames

Expert Solution

Questions # 16:

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Options:

One-hot encoding categorical features

Target encoding categorical features

Imputing missing feature values with the mean

Imputing missing feature values with the true median

Creating binary indicator features for missing values

Expert Solution

Questions # 17:

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:

Theycan turn on Databricks Autologging

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

They can start each child run with the same experiment ID as the parent run

They can specify nested=True when starting the parent run for the tuningprocess

Expert Solution

Questions # 18:

A data scientist learned during their training to always use 5-fold cross-validation in their model development workflow. A colleague suggests that there are cases where a train-validation split could be preferred over k-fold cross-validation when k > 2.

Which of the following describes a potential benefit of using a train-validation split over k-fold cross-validation in this scenario?

Options:

A holdout set is not necessary when using a train-validation split

Reproducibility is achievable when using a train-validation split

Fewer hyperparameter values need to be tested when usinga train-validation split

Bias is avoidable when using a train-validation split

Fewer models need to be trained when using a train-validation split

Expert Solution

Questions # 19:

A data scientist is performing hyperparameter tuning using an iterative optimization algorithm. Each evaluation of unique hyperparameter values is being trained on a single compute node. They are performing eight total evaluations across eight total compute nodes. While the accuracy of the model does vary over the eight evaluations, they notice there is no trend of improvement in the accuracy. The data scientist believes this is due to the parallelization of the tuning process.

Which change could the data scientist make to improve their model accuracy over the course of their tuning process?

Options:

Change the number of compute nodes to be half or less than half of the number of evaluations.

Change the number of compute nodes and the number of evaluations to be much larger but equal.

Change the iterative optimization algorithm used to facilitate the tuning process.

Change the number of compute nodes to be double or more than double the number of evaluations.

Expert Solution

Questions # 20:

A data scientist has developed a random forest regressor rfr and included it as the final stage in a Spark MLPipeline pipeline. They then set up a cross-validation process with pipeline as the estimator in the following code block:

Question # 20

Which of the following is a negative consequence of includingpipelineas the estimator in the cross-validation process rather thanrfras the estimator?

Options:

The process will have a longer runtime because all stages of pipeline need to be refit or retransformed with each mode

The process will leak data from the training set to the test set during the evaluation phase

The process will be unable to parallelize tuning due to the distributed nature of pipeline

The process will leak data prep information from the validation sets to the training sets for each model

Expert Solution

Viewing page 2 out of 3 pages

Viewing questions 11-20 out of questions

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Databricks ML Data Scientist Databricks-Machine-Learning-Associate Questions and answers with ValidTests

Exam Databricks-Machine-Learning-Associate Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: