Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with ValidTests

Exam Professional-Data-Engineer All Questions

Exam Professional-Data-Engineer Premium Access

View all detail and faqs for the Professional-Data-Engineer exam

Go to Exam

Viewing page 4 out of 12 pages

Viewing questions 31-40 out of questions

Questions # 31:

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

Options:

Denormalize the data as must as possible.

Preserve the structure of the data as much as possible.

Use BigQuery UPDATE to further reduce the size of the dataset.

Develop a data pipeline where status updates are appended to BigQuery instead of updated.

Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery’s support for external data sources to query.

Expert Solution

Questions # 32:

You have a data pipeline that writes data to Cloud Bigtable using well-designed row keys. You want to monitor your pipeline to determine when to increase the size of you Cloud Bigtable cluster. Which two actions can you take to accomplish this? Choose 2 answers.

Options:

Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Read pressure index is above 100.

Review Key Visualizer metrics. Increase the size of the Cloud Bigtable cluster when the Write pressure index is above 100.

Monitor the latency of write operations. Increase the size of the Cloud Bigtable cluster when there is a sustained increase in write latency.

Monitor storage utilization. Increase the size of the Cloud Bigtable cluster when utilization increases above 70% of max capacity.

Monitor latency of read operations. Increase the size of the Cloud Bigtable cluster of read operations take longer than 100 ms.

Expert Solution

Questions # 33:

You created an analytics environment on Google Cloud so that your data scientist team can explore data without impacting the on-premises Apache Hadoop solution. The data in the on-premises Hadoop Distributed File System (HDFS) cluster is in Optimized Row Columnar (ORC) formatted files with multiple columns of Hive partitioning. The data scientist team needs to be able to explore the data in a similar way as they used the on-premises HDFS cluster with SQL on the Hive query engine. You need to choose the most cost-effective storage and processing solution. What should you do?

Options:

Import the ORC files lo Bigtable tables for the data scientist team.

Import the ORC files to BigOuery tables for the data scientist team.

Copy the ORC files on Cloud Storage, then deploy a Dataproc cluster for the data scientist team.

Copy the ORC files on Cloud Storage, then create external BigQuery tables for the data scientist team.

Expert Solution

Answer

Explanation

The requirements are:

Explore ORC formatted files with Hive partitioning.

Mimic the SQL on Hive query engine experience.

Cost-effective storage and processing.

Avoid impacting the on-premises Hadoop solution.

Let's analyze the options:

Option A (Import to Bigtable):Bigtable is a NoSQL database, not suited for SQL-based exploration of ORC files or Hive-style partitioning directly. This would require significant data transformation and a different query paradigm. Not cost-effective for this use case.

Option B (Import to BigQuery native tables):Importing data into BigQuery native storage is an option. BigQuery can load ORC files. This provides excellent query performance. However, it involves an ETL step (importing) and storage costs for the datawithin BigQuery, which might be higher than keeping it in its original format on Cloud Storage if query patterns are exploratory and not extremely frequent on all data.

Option C (Copy to Cloud Storage, deploy Dataproc):Dataproc allows you to run Hadoop/Spark (and thus Hive) clusters on Google Cloud. This would provide a very similar experience ("SQL on the Hive query engine"). However, running a persistent Dataproc cluster incurs compute costs for the cluster nodes, even when not actively querying. While ephemeral clusters are possible, it adds operational overhead for exploratory queries. Storage on Cloud Storage is cost-effective.

Option D (Copy to Cloud Storage, create external BigQuery tables):This is often the most cost-effective and straightforward solution for this scenario.

Cost-effective Storage:Cloud Storage is a low-cost option for storing files like ORC.

SQL Interface:BigQuery provides a familiar SQL interface.

External Tables:BigQuery can query data directly from Cloud Storage (including ORC files) using external tables. This avoids the need to load data into BigQuery's managed storage, saving on storage costs and ETL effort.

Hive Partitioning:BigQuery external tables support Hive partitioning layouts. When you define the external table, you can specify the partitioning scheme, and BigQuery will use partition pruning to scan only relevant partitions, improving performance and reducing costs for queries that filter on partition keys. This directly mimics the Hive experience.

Processing Cost:You only pay for the data scanned by BigQuery queries, which aligns with exploratory analysis.

Comparing D with B: External tables are generally more cost-effective for storage and initial setup if the data is already in ORC and an ETL process into BigQuery native storage is to be avoided. Query performance might be slightly less than native tables but is often sufficient for exploration, especially with partitioning. Comparing D with C: BigQuery external tables are serverless, meaning no cluster to manage or pay for when idle. Dataproc requires managing and paying for a cluster. For exploration, the serverless nature of BigQuery is usually more cost-effective.

Therefore, copying ORC files to Cloud Storage and using BigQuery external tables is the most cost-effective solution that meets all requirements.

[Reference:, Google Cloud Documentation: BigQuery > External data sources > Querying Cloud Storage data. "You can query data in Cloud Storage by using external tables or federated queries. External tables are tables that read data directly from files in Cloud Storage.", Google Cloud Documentation: BigQuery > External data sources > Supported formats and compression types. ORC is a supported format., Google Cloud Documentation: BigQuery > Creating and using tables > Creating external tables. "External tables let you query data stored in Cloud Storage as if it were a standardBigQuery table. You can use external tables to query data in various formats, including... ORC...", Google Cloud Documentation: BigQuery > Creating and using tables > Querying partitioned external tables. "You can create an external table that is partitioned on Hive partitioning keys. When you query a Hive partitioned external table, BigQuery performs partition pruning to skip reading unnecessary partitions." This directly addresses the "Hive partitioning" and "explore data in a similar way" requirements., Google Cloud Blog: "Choosing the right data processing option on GCP: BigQuery vs. Dataproc" (and similar articles) often highlight BigQuery external tables as a cost-effective way to query data in place on Cloud Storage, especially for data lake scenarios., , , ]

Questions # 34:

Your team is responsible for developing and maintaining ETLs in your company. One of your Dataflow jobs is failing because of some errors in the input data, and you need to improve reliability of the pipeline (incl. being able to reprocess all failing data).

What should you do?

Options:

Add a filtering step to skip these types of errors in the future, extract erroneous rows from logs.

Add a try… catch block to your DoFn that transforms the data, extract erroneous rows from logs.

Add a try… catch block to your DoFn that transforms the data, write erroneous rows to PubSub directly from the DoFn.

Add a try… catch block to your DoFn that transforms the data, use a sideOutput to create a PCollection that can be stored to PubSub later.

Expert Solution

Questions # 35:

You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.

Tom,555 X street

Tim,553 Y street

Sam, 111 Z street

Which operation is best suited for the above data processing requirement?

Options:

ParDo

Sink API

Source API

Data extraction

Expert Solution

Questions # 36:

Which role must be assigned to a service account used by the virtual machines in a Dataproc cluster so they can execute jobs?

Options:

Dataproc Worker

Dataproc Viewer

Dataproc Runner

Dataproc Editor

Expert Solution

Questions # 37:

Does Dataflow process batch data pipelines or streaming data pipelines?

Options:

Only Batch Data Pipelines

Both Batch and Streaming Data Pipelines

Only Streaming Data Pipelines

None of the above

Expert Solution

Questions # 38:

Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?

Options:

An hourly watermark

An event time trigger

The with Allowed Lateness method

A processing time trigger

Expert Solution

Questions # 39:

Which of the following is not true about Dataflow pipelines?

Options:

Pipelines are a set of operations

Pipelines represent a data processing job

Pipelines represent a directed graph of steps

Pipelines can share data between instances

Expert Solution

Questions # 40:

Cloud Dataproc is a managed Apache Hadoop and Apache _____ service.

Options:

Blaze

Spark

Fire

Ignite

Expert Solution

Viewing page 4 out of 12 pages

Viewing questions 31-40 out of questions

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with ValidTests

Exam Professional-Data-Engineer Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: