Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with ValidTests

Exam Professional-Data-Engineer All Questions
Exam Professional-Data-Engineer Premium Access

View all detail and faqs for the Professional-Data-Engineer exam

Viewing page 4 out of 12 pages
Viewing questions 31-40 out of questions
Questions # 31:

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

Options:

A.

Denormalize the data as must as possible.

B.

Preserve the structure of the data as much as possible.

C.

Use BigQuery UPDATE to further reduce the size of the dataset.

D.

Develop a data pipeline where status updates are appended to BigQuery instead of updated.

E.

Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery’s support for external data sources to query.

Expert Solution
Questions # 32:

You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? Choose 2 answers.

Options:

A.

Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.

B.

Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.

C.

Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.

D.

Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.

E.

Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.

Expert Solution
Questions # 33:

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

Options:

A.

Import the ORC files lo Bigtable tables for the data scientist team.

B.

Import the ORC files to BigOuery tables for the data scientist team.

C.

Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.

D.

Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

Expert Solution
Questions # 34:

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).

What should you do?

Options:

A.

Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.

B.

Add a try… catch block to your DoFn that transforms the data, extract erroneous rows from logs.

C.

Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.

D.

Add a try… catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

Expert Solution
Questions # 35:

You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.

Tom,555 X street

Tim,553 Y street

Sam, 111 Z street

Which operation is best suited for the above data processing requirement?

Options:

A.

ParDo

B.

Sink API

C.

Source API

D.

Data extraction

Expert Solution
Questions # 36:

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?

Options:

A.

Dataproc Worker

B.

Dataproc Viewer

C.

Dataproc Runner

D.

Dataproc Editor

Expert Solution
Questions # 37:

Does Dataflow process batch data pipelines or streaming data pipelines?

Options:

A.

Only Batch Data Pipelines

B.

Both Batch and Streaming Data Pipelines

C.

Only Streaming Data Pipelines

D.

None of the above

Expert Solution
Questions # 38:

Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?

Options:

A.

An hourly watermark

B.

An event time trigger

C.

The with Allowed Lateness method

D.

A processing time trigger

Expert Solution
Questions # 39:

Which of the following is not true about Dataflow pipelines?

Options:

A.

Pipelines are a set of operations

B.

Pipelines represent a data processing job

C.

Pipelines represent a directed graph of steps

D.

Pipelines can share data between instances

Expert Solution
Questions # 40:

Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

Options:

A.

Blaze

B.

Spark

C.

Fire

D.

Ignite

Expert Solution
Viewing page 4 out of 12 pages
Viewing questions 31-40 out of questions