Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with ValidTests

Exam Professional-Data-Engineer All Questions

Exam Professional-Data-Engineer Premium Access

View all detail and faqs for the Professional-Data-Engineer exam

Go to Exam

Viewing page 10 out of 12 pages

Viewing questions 91-100 out of questions

Questions # 91:

Your company has a hybrid cloud initiative. You have a complex data pipeline that moves data between cloud provider services and leverages services from each of the cloud providers. Which cloud-native service should you use to orchestrate the entire pipeline?

Options:

Cloud Dataflow

Cloud Composer

Cloud Dataprep

Cloud Dataproc

Expert Solution

Questions # 92:

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt? Choose 2 answers.

Options:

Denormalize the data as must as possible.

Preserve the structure of the data as much as possible.

Use BigQuery UPDATE to further reduce the size of the dataset.

Develop a data pipeline where status updates are appended to BigQuery instead of updated.

Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery’s support for external data sources to query.

Expert Solution

Questions # 93:

You are designing an Apache Beam pipeline to enrich data from Cloud Pub/Sub with static reference data from BigQuery. The reference data is small enough to fit in memory on a single worker. The pipeline should write enriched results to BigQuery for analysis. Which job type and transforms should this pipeline use?

Options:

Batch job, PubSubIO, side-inputs

Streaming job, PubSubIO, JdbcIO, side-outputs

Streaming job, PubSubIO, BigQueryIO, side-inputs

Streaming job, PubSubIO, BigQueryIO, side-outputs

Expert Solution

Questions # 94:

You set up a streaming data insert into a Redis cluster via a Kafka cluster. Both clusters are running on

Compute Engine instances. You need to encrypt data at rest with encryption keys that you can create, rotate, and destroy as needed. What should you do?

Options:

Create a dedicated service account, and use encryption at rest to reference your data stored in yourCompute Engine cluster instances as part of your API service calls.

Create encryption keys in Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

Create encryption keys locally. Upload your encryption keys to Cloud Key Management Service. Use those keys to encrypt your data in all of the Compute Engine cluster instances.

Create encryption keys in Cloud Key Management Service. Reference those keys in your API service calls when accessing the data in your Compute Engine cluster instances.

Expert Solution

Questions # 95:

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once, and must be ordered within windows of 1 hour. How should you design the solution?

Options:

Use Apache Kafka for message ingestion and use Cloud Dataproc for streaming analysis.

Use Apache Kafka for message ingestion and use Cloud Dataflow for streaming analysis.

Use Cloud Pub/Sub for message ingestion and Cloud Dataproc for streaming analysis.

Use Cloud Pub/Sub for message ingestion and Cloud Dataflow for streaming analysis.

Expert Solution

Questions # 96:

You have created an external table for Apache Hive partitioned data that resides in a Cloud Storage bucket, which contains a large number of files. You notice that queries against this table are slow. You want to improve the performance of these queries What should you do?

Options:

Migrate the Hive partitioned data objects to a multi-region Cloud Storage bucket.

Create an individual external table for each Hive partition by using a common table name prefix Use wildcard table queries to reference the partitioned data.

Change the storage class of the Hive partitioned data objects from Coldline to Standard.

Upgrade the external table to a BigLake table Enable metadata caching for the table.

Expert Solution

Questions # 97:

You have uploaded 5 years of log data to Cloud Storage A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons. What should you do?

Options:

Import the data from Cloud Storage into BigQuery Create a new BigQuery table, and skip the rows with errors.

Create a Compute Engine instance and create a new copy of the data in Cloud Storage Skip the rows with errors

Create a Cloud Dataflow workflow that reads the data from Cloud Storage, checks for values outside the expected range, sets the value to an appropriate default, and writes the updated records to a new dataset inCloud Storage

Expert Solution

Questions # 98:

You are preparing an organization-wide dataset. You need to preprocess customer data stored in a restricted bucket in Cloud Storage. The data will be used to create consumer analyses. You need to follow data privacy requirements, including protecting certain sensitive data elements, while also retaining all of the data for potential future use cases. What should you do?

Options:

Use Dataflow and the Cloud Data Loss Prevention API to mask sensitive data. Write the processed data in BigQuery.

Use the Cloud Data Loss Prevention API and Dataflow to detect and remove sensitive fields from the data in Cloud Storage. Write the filtered data in BigQuery.

Use Dataflow and Cloud KMS to encrypt sensitive fields and write the encrypted data in BigQuery. Share the encryption key by following the principle of least privilege.

Use customer-managed encryption keys (CMEK) to directly encrypt the data in Cloud Storage. Use federated queries from BigQuery. Share the encryption key by following the principle of least privilege.

Expert Solution

Answer

Explanation

The core requirements are to protect sensitive data elements (data privacy) while retainingalldata for potential future use, and then using this preprocessed data for consumer analyses.

Retaining All Data:This immediately makes option B (remove sensitive fields) unsuitable because it involves data loss.

Protecting Sensitive Data for Analysis & Future Use:Masking is a de-identification technique that redacts or replaces sensitive data with a substitute, allowing the data structure and usability for analysis to be maintained without exposing the original sensitive values. This aligns with protecting data while still making it usable.

Cloud Data Loss Prevention (DLP) API:This service is specifically designed to discover, classify, and protect sensitive data. It offers various de-identification techniques, including masking.

Dataflow:This is a serverless, fast, and cost-effective service for unified stream and batch data processing. It's well-suited for transforming large datasets, such as those read from Cloud Storage, and can integrate with the DLP API for de-identification.

Writing to BigQuery:BigQuery is an ideal destination for an organization-wide dataset for consumer analyses.

Therefore, using Dataflow to read the data from Cloud Storage, leveraging the Cloud DLP API tomask(a form of de-identification) the sensitive elements, and then writing the processed (masked) data to BigQuery is the most appropriate solution. This approach protects privacy for the consumer analyses dataset while the original, unaltered data can still be retained in the restricted Cloud Storage bucket for future use cases that might require access to the original sensitive information (under strict governance).

Let's analyze why other options are less suitable:

Option B:"Remove sensitive fields" means data loss, which contradicts the requirement to retain all data for potential future use cases.

Option C:Encrypting sensitive fields with Cloud KMS and writing them to BigQuery is a valid way to protect data. However, for "consumer analyses," masked data is generally more directly usable than encrypted data. Analysts would typically work with de-identified (e.g., masked) data rather than directly querying encrypted fields and managing decryption keys for analytical purposes. While decryption is possible, masking often provides a better balance of privacy and utility for broad analysis. The question also implies creating a datasetforanalysis, where masking makes the data ready-to-use for that purpose. The original data remains in Cloud Storage.

Option D:Using CMEK encrypts the entire object in Cloud Storage at rest. While this protects the data in Cloud Storage, federated queries from BigQuery would access the raw, unmasked data (assuming decryption occurs seamlessly). This doesn't address the preprocessing requirement of protectingcertain sensitive data elementswithin the data itself for theconsumer analysesdataset. The goal is to create a de-identified dataset for analysis, not just secure the raw data at rest.

[Reference:, Google Cloud Documentation: Cloud Data Loss Prevention > De-identification overview. "De-identification is the process of removing identifying information from data. Cloud DLP uses de-identification techniques such as masking, tokenization, pseudonymization, date shifting, and more to help you protect sensitive data.", Google Cloud Documentation: Cloud Data Loss Prevention > Basic de-identification > Masking. "Masking hides parts of data by replacing characters with a symbol, such as an asterisk (*) or hash (#).", Google Cloud Documentation: Dataflow > Overview. "Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.", Google Cloud Solution: Automating the de-identification of PII in large-scale datasets using Cloud DLP and Dataflow. This solution guide explicitly outlines using Dataflow and DLP API for de-identifying (including masking) data from Cloud Storage and loading it into BigQuery. "You can use Cloud DLP to scan data for sensitive elements andthen apply de-identification techniques such as redaction, masking, or tokenization." and "This tutorial uses Dataflow to orchestrate the de-identification process.", , , ]

Questions # 99:

You are using BigQuery and Data Studio to design a customer-facing dashboard that displays large quantities of aggregated data. You expect a high volume of concurrent users. You need to optimize tie dashboard to provide quick visualizations with minimal latency. What should you do?

Options:

Use BigQuery BI Engine with materialized views

Use BigQuery BI Engine with streaming data.

Use BigQuery Bl Engine with authorized views

Use BigQuery Bl Engine with logical reviews

Expert Solution

Questions # 100:

You are running a Dataflow streaming pipeline, with Streaming Engine and Horizontal Autoscaling enabled. You have set the maximum number of workers to 1000. The input of your pipeline is Pub/Sub messages with notifications from Cloud Storage One of the pipeline transforms reads CSV files and emits an element for every CSV line. The Job performance is low. the pipeline is using only 10 workers, and you notice that the autoscaler is not spinning up additional workers. What should you do to improve performance?

Options:

Use Dataflow Prime, and enable Right Fitting to increase the worker resources.

Update the job to increase the maximum number of workers.

Enable Vertical Autoscaling to let the pipeline use larger workers.

Change the pipeline code, and introduce a Reshuffle step to prevent fusion.

Expert Solution

Viewing page 10 out of 12 pages

Viewing questions 91-100 out of questions

Summer Certification Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Google Cloud Certified Professional-Data-Engineer Questions and answers with ValidTests

Exam Professional-Data-Engineer Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: