Pass the Amazon Web Services AWS Certified Data Engineer Data-Engineer-Associate Questions and answers with ValidTests

Exam Data-Engineer-Associate All Questions

Exam Data-Engineer-Associate Premium Access

View all detail and faqs for the Data-Engineer-Associate exam

Go to Exam

Viewing page 2 out of 9 pages

Viewing questions 11-20 out of questions

Questions # 11:

A company currently uses a provisioned Amazon EMR cluster that includes general purpose Amazon EC2 instances. The EMR cluster uses EMR managed scaling betweenone to five task nodes for the company's long-running Apache Spark extract, transform, and load (ETL) job. The company runs the ETL job every day.

When the company runs the ETL job, the EMR cluster quickly scales up to five nodes. The EMR cluster often reaches maximum CPU usage, but the memory usage remains under 30%.

The company wants to modify the EMR cluster configuration to reduce the EMR costs to run the daily ETL job.

Which solution will meet these requirements MOST cost-effectively?

Options:

Increase the maximum number of task nodes for EMR managed scaling to 10.

Change the task node type from general purpose EC2 instances to memory optimized EC2 instances.

Switch the task node type from general purpose EC2 instances to compute optimized EC2 instances.

Reduce the scaling cooldown period for the provisioned EMR cluster.

Expert Solution

Questions # 12:

A company is using an AWS Transfer Family server to migrate data from an on-premises environment to AWS. Company policy mandates the use of TLS 1.2 or above to encrypt the data in transit.

Which solution will meet these requirements?

Options:

Generate new SSH keys for the Transfer Family server. Make the old keys and the new keys available for use.

Update the security group rules for the on-premises network to allow only connections that use TLS 1.2 or above.

Update the security policy of the Transfer Family server to specify a minimum protocol version of TLS 1.2.

Install an SSL certificate on the Transfer Family server to encrypt data transfers by using TLS 1.2.

Expert Solution

Questions # 13:

A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company's data analysts can access data only for customers who are within the same country as the analysts.

Which solution will meet these requirements with the LEAST operational effort?

Options:

Create a separate table for each country's customer data. Provide access to each analyst based on the country that the analyst serves.

Register the S3 bucket as a data lake location in AWS Lake Formation. Use the Lake Formation row-level security features to enforce the company's access policies.

Move the data to AWS Regions that are close to the countries where the customers are. Provide access to each analyst based on the country that the analyst serves.

Load the data into Amazon Redshift. Create a view for each country. Create separate 1AM roles for each country to provide access to data from each country. Assign the appropriate roles to the analysts.

Expert Solution

Questions # 14:

A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options.

The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS.

Which extract, transform, and load (ETL) service will meet these requirements?

Options:

AWS Glue

Amazon EMR

AWS Lambda

Amazon Redshift

Expert Solution

Questions # 15:

A gaming company uses Amazon Kinesis Data Streams to collect clickstream data. The company uses Amazon Kinesis Data Firehose delivery streams to store the data in JSON format in Amazon S3. Data scientists at the company use Amazon Athena to query the most recent data to obtain business insights.

The company wants to reduce Athena costs but does not want to recreate the data pipeline.

Which solution will meet these requirements with the LEAST management effort?

Options:

Change the Firehose output format to Apache Parquet. Provide a custom S3 object YYYYMMDD prefix expression and specify a large buffer size. For the existing data, create an AWS Glue extract, transform, and load (ETL) job. Configure the ETL job to combine small JSON files, convert the JSON files to large Parquet files, and add the YYYYMMDD prefix. Use the ALTER TABLE ADD PARTITION statement to reflect the partition on the existing Athena

B.
Create an Apache Spark job that combines JSON files and converts the JSON files to Apache Parquet files. Launch an Amazon EMR ephemeral cluster every day to run the Spark job to create new Parquet files in a different S3 location. Use the ALTER TABLE SET LOCATION statement to reflect the new S3 location on the existing Athena table.

C.
Create a Kinesis data stream as a delivery destination for Firehose. Use Amazon Managed Service for Apache Flink (previously known as Amazon Kinesis Data Analytics) to run Apache Flink on the Kinesis data stream. Use Flink to aggregate the data and save the data to Amazon S3 in Apache Parquet format with a custom S3 object YYYYMMDD prefix. Use the ALTER TABLE ADD PARTITION statement to reflect the partition on the existing Athena table.<

D.
Integrate an AWS Lambda function with Firehose to convert source records to Apache Parquet and write them to Amazon S3. In parallel, run an AWS Glue extract, transform, and load (ETL) job to combine the JSON files and convert the JSON files to large Parquet files. Create a custom S3 object YYYYMMDD prefix. Use the ALTER TABLE ADD PARTITION statement to reflect the partition on the existing Athena table.

Expert Solution

Answer

A

Explanation

Step 1: Understanding the Problem
The company collectsclickstream datavia Amazon Kinesis Data Streams and stores it inJSON formatin Amazon S3 using Kinesis Data Firehose. They useAmazon Athenato query the data, but they want toreduce Athena costswhile maintaining the same data pipeline.
Since Athena charges based on the amount of data scanned during queries, reducing the data size (by converting JSON to a more efficient format likeApache Parquet) is a key solution to lowering costs.
Step 2: Why Option A is Correct
Option Aprovides a straightforward way to reduce costs withminimal management overhead:
Changing the Firehose output format to Parquet: Parquet is a columnar data format, which is more compact and efficient than JSON for Athena queries. It significantly reduces the amount of data scanned, which in turn reduces Athena query costs.
Custom S3 Object Prefix (YYYYMMDD): Adding a date-based prefix helps in partitioning the data, which further improves query efficiency in Athena by limiting the data scanned to only relevant partitions.
AWS Glue ETL Job for Existing Data: To handle existing data stored in JSON format, a one-time AWS Glue ETL job can combine small JSON files, convert them to Parquet, and apply the YYYYMMDD prefix. This ensures consistency in the S3 bucket structure and allows Athena to efficiently query historical data.
ALTER TABLE ADD PARTITION: This command updates Athena's table metadata to reflect the new partitions, ensuring that future queries target only the required data.
Step 3: Why Other Options Are Not Ideal
Option B (Apache Spark on EMR)introduces higher management effort by requiring the setup ofApache Spark jobsand anAmazon EMR cluster. While it achieves the goal of converting JSON to Parquet, it involves running and maintaining an EMR cluster, which adds operational complexity.
Option C (Kinesis and Apache Flink)is a more complex solution involvingApache Flink, which adds a real-time streaming layer to aggregate data. Although Flink is a powerful tool for stream processing, it adds unnecessary overhead in this scenario since the company already uses Kinesis Data Firehose for batch delivery to S3.
Option D (AWS Lambda with Firehose)suggests usingAWS Lambdato convert records in real time. While Lambda can work in some cases, it's generally not the best tool for handling large-scale data transformations like JSON-to-Parquet conversion due to potential scaling and invocation limitations. Additionally, running parallel Glue jobs further complicates the setup.
Step 4: How Option A Minimizes Costs
By usingApache Parquet, Athena queries become more efficient, as Athena will scan significantly less data, directly reducing query costs.
Firehosenatively supports Parquet as an output format, so enabling this conversion in Firehose requires minimal effort. Once set, new data will automatically be stored in Parquet format in S3, without requiring any custom coding or ongoing management.
TheAWS Glue ETL jobfor historical data ensures that existing JSON files are also converted to Parquet format, ensuring consistency across the data stored in S3.
Conclusion:
Option A meets the requirement toreduce Athena costswithout recreating the data pipeline, using Firehose’s native support forApache Parquetand a simple one-timeAWS Glue ETL jobfor existing data. This approach involvesminimal management effortcompared to the other solutions.

Questions # 16:

A company hosts its applications on Amazon EC2 instances. The company must use SSL/TLS connections that encrypt data in transit to communicate securely with AWS infrastructure that is managed by a customer.
A data engineer needs to implement a solution to simplify the generation, distribution, and rotation of digital certificates. The solution must automatically renew and deploy SSL/TLS certificates.
Which solution will meet these requirements with the LEAST operational overhead?

Options:

A.
Store self-managed certificates on the EC2 instances.

B.
Use AWS Certificate Manager (ACM).

C.
Implement custom automation scripts in AWS Secrets Manager.

D.
Use Amazon Elastic Container Service (Amazon ECS) Service Connect.

Expert Solution

Answer

B

Explanation

The best solution for managing SSL/TLS certificates on EC2 instances withminimal operational overheadis to useAWS Certificate Manager (ACM). ACM simplifies certificate management by automating the provisioning, renewal, and deployment of certificates.
AWS Certificate Manager (ACM):
ACM managesSSL/TLS certificatesfor EC2 and other AWS resources, including automatic certificate renewal. This reduces the need for manual management and avoids operational complexity.
ACM also integrates with other AWS services to simplify secure connections between AWS infrastructure and customer-managed environments.
[Reference:AWS Certificate Manager, Alternatives Considered:, A (Self-managed certificates): Managing certificates manually on EC2 instances increases operational overhead and lacks automatic renewal., C (Secrets Manager automation): While Secrets Manager can store keys and certificates, it requires custom automation for rotation and does not handle SSL/TLS certificates directly., D (ECS Service Connect): This is unrelated to SSL/TLS certificate management and would not address the operational need., References:, AWS Certificate Manager Documentation, ]

Questions # 17:

A company has a data warehouse in Amazon Redshift. To comply with security regulations, the company needs to log and store all user activities and connection activities for the data warehouse.
Which solution will meet these requirements?

Options:

A.
Create an Amazon S3 bucket. Enable logging for the Amazon Redshift cluster. Specify the S3 bucket in the logging configuration to store the logs.

B.
Create an Amazon Elastic File System (Amazon EFS) file system. Enable logging for the Amazon Redshift cluster. Write logs to the EFS file system.

C.
Create an Amazon Aurora MySQL database. Enable logging for the Amazon Redshift cluster. Write the logs to a table in the Aurora MySQL database.

D.
Create an Amazon Elastic Block Store (Amazon EBS) volume. Enable logging for the Amazon Redshift cluster. Write the logs to the EBS volume.

Expert Solution

Answer

A

Explanation

Problem Analysis:
The company must log alluser activitiesandconnection activitiesin Amazon Redshift for security compliance.
Key Considerations:
Redshift supportsaudit logging, which can be configured to write logs to an S3 bucket.
S3 provides durable, scalable, and cost-effective storage for logs.
Solution Analysis:
Option A: S3 for Logging
Standard approach for storing Redshift logs.
Easy to set up and manage with minimal cost.
Option B: Amazon EFS
EFS is unnecessary for this use case and less cost-efficient than S3.
Option C: Aurora MySQL
Using a database to store logs increases complexity and cost.
Option D: EBS Volume
EBS is not a scalable option for log storage compared to S3.
Final Recommendation:
EnableRedshift audit loggingand specify anS3 bucketas the destination.
[:, Amazon Redshift Audit Logging, Storing Logs in Amazon S3, , ]

Questions # 18:

An ecommerce company processes millions of orders each day. The company uses AWS Glue ETL to collect data from multiple sources, clean the data, and store the data in an Amazon S3 bucket in CSV format by using the S3 Standard storage class. The company uses the stored data to conduct daily analysis.
The company wants to optimize costs for data storage and retrieval.
Which solution will meet this requirement?

Options:

A.
Transition the data to Amazon S3 Glacier Flexible Retrieval.

B.
Transition the data from Amazon S3 to an Amazon Aurora cluster.

C.
Configure AWS Glue ETL to transform the incoming data to Apache Parquet format.

D.
Configure AWS Glue ETL to use Amazon EMR to process incoming data in parallel.

Expert Solution

Answer

C

Explanation

Apache Parquet is a columnar storage format that is much more space-efficient than row-based formats like CSV, especially for analytics workloads. Transforming data from CSV to Parquet significantly reduces storage costs and improves query performance. According to the study guide:
“Parquet is a columnar storage file format that is optimized for use with analytics workloads, providing efficient storage and fast query performance.”
–Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
By switching to Parquet, the company can reduce both storage size and retrieval times, making it the optimal choice for cost-effective data analysis.

Questions # 19:

A data engineer has two datasets that contain sales information for multiple cities and states. One dataset is named reference, and the other dataset is named primary.
The data engineer needs a solution to determine whether a specific set of values in the city and state columns of the primary dataset exactly match the same specific values in the reference dataset. The data engineer wants to useData Quality Definition Language (DQDL)rules in an AWS Glue Data Quality job.
Which rule will meet these requirements?

Options:

A.
DatasetMatch "reference" "city->ref_city, state->ref_state" = 1.0

B.
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 1.0

C.
DatasetMatch "reference" "city->ref_city, state->ref_state" = 100

D.
ReferentialIntegrity "city,state" "reference.{ref_city,ref_state}" = 100

Expert Solution

Answer

A

Explanation

TheDatasetMatchrule in DQDL checks for full value equivalence between mapped fields. A value of1.0indicates a100% match. The correct syntax and metric for an exact match scenario are:
“Use DatasetMatch when comparing mapped fields between two datasets. The comparison score of 1.0 confirms a perfect match.”
–Ace the AWS Certified Data Engineer - Associate Certification - version 2 - apple.pdf
Options with “100” use incorrect syntax since DQDL usesfloating-point scores(e.g., 1.0, 0.95), not percentages.

Questions # 20:

A company stores customer records in Amazon S3. The company must not delete or modify the customer record data for 7 years after each record is created. The root user also must not have the ability to delete or modify the data.
A data engineer wants to use S3 Object Lock to secure the data.
Which solution will meet these requirements?

Options:

A.
Enable governance mode on the S3 bucket. Use a default retention period of 7 years.

B.
Enable compliance mode on the S3 bucket. Use a default retention period of 7 years.

C.
Place a legal hold on individual objects in the S3 bucket. Set the retention period to 7 years.

D.
Set the retention period for individual objects in the S3 bucket to 7 years.

Expert Solution

Answer

B

Explanation

The company wants to ensure that no customer records are deleted or modified for 7 years, and even the root user should not have the ability to change the data.S3 Object LockinCompliance Modeis the correct solution for this scenario.
Option B: Enable compliance mode on the S3 bucket. Use a default retention period of 7 years.InCompliance Mode, even the root user cannot delete or modify locked objects during the retention period. This ensures that the data is protected for the entire 7-year duration as required. Compliance mode is stricter than governance mode and prevents all forms of alteration, even by privileged users.
Option A (Governance Mode)still allows certain privileged users (like the root user) to bypass the lock, which does not meet the company's requirement.Option C (legalhold)andOption D (setting retention per object)do not fully address the requirement to block root user modifications.
[References:, Amazon S3 Object Lock Documentation, ]

Viewing page 2 out of 9 pages

Viewing questions 11-20 out of questions

Prev

1

2

3

4

5

6

7

8

9

Next

Pre-Summer Special Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: validbest

Pass the Amazon Web Services AWS Certified Data Engineer Data-Engineer-Associate Questions and answers with ValidTests

Exam Data-Engineer-Associate Premium Access

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options:

Options: