A data scientist is attempting to identify sentences that are conceptually similar to each other within a set of text files. Which of the following is the best way to prepare the data set to accomplish this task after data ingestion?
→ Embeddings (e.g., word2vec, sentence transformers) are vector representations of text that capture semantic similarity. They allow comparison of conceptual meaning between sentences in a high-dimensional space, which is essential for tasks like semantic similarity or clustering.
Why the other options are incorrect:
B: Extrapolation predicts values beyond a dataset’s range — not relevant here.
C: Sampling reduces data volume but doesn’t aid in similarity analysis.
D: One-hot encoding captures presence of words but lacks semantic understanding.
Official References:
CompTIA DataX (DY0-001) Study Guide – Section 6.3:“Embeddings transform text into numeric vectors, enabling similarity computation and semantic analysis.”
—
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit