2023 100% Free Associate-Developer-Apache-Spark Daily Practice Exam With 179 Questions [Q64-Q86]

Share

2023 100% Free Associate-Developer-Apache-Spark Daily Practice Exam With 179 Questions

Associate-Developer-Apache-Spark exam torrent Databricks study guide


The Exam cost of Databricks Associate Developer Apache Spark Exam?

The cost of the Databricks Associate Developer Apache Spark Exam is 200 USD per attempt.


Why you should take Databricks Associate Developer Apache Spark Exam?

If you are a developer who is interested in learning more about Spark and Big Data technologies, then you should definitely consider taking the Databricks Associate Developer Apache Spark Exam. This exam will help you learn how to use the technologies that are being used in the real world. Databricks Associate Developer Apache Spark exam dumps are the best way to prepare for this exam.

Today's modern businesses need to be agile and nimble to adapt to the fast-paced business environment. With the advent of big data, cloud computing, and the Internet of Things, enterprises now face the challenge of managing, processing, analyzing, and integrating vast amounts of data. These challenges require new skills and a new approach to problem solving. The Apache Spark is a high-performance analytics engine that allows you to analyze and process large datasets in a fraction of the time. The Databricks Associate Developer Apache Spark Exam will help you master the skills required to build data-driven applications using Apache Spark.

 

NEW QUESTION 64
The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to numeric and nullable column predError in DataFrame transactionsDf. Find the error.
Code block:
1.def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x
7.
8.add_2_if_geq_3_udf = udf(add_2_if_geq_3)
9.
10.transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

  • A. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
  • B. The udf() method does not declare a return type.
  • C. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
  • D. The Python function is unable to handle null values, resulting in the code block crashing on execution.
  • E. The operator used to adding the column does not add column predErrorAdded to the DataFrame.

Answer: E

Explanation:
Explanation
Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x
elif x >= 3:
return x+2
return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data - but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does.
UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.

 

NEW QUESTION 65
Which of the following code blocks returns a one-column DataFrame of all values in column supplier of DataFrame itemsDf that do not contain the letter X? In the DataFrame, every value should only be listed once.
Sample of DataFrame itemsDf:
1.+------+--------------------+--------------------+-------------------+
2.|itemId| itemName| attributes| supplier|
3.+------+--------------------+--------------------+-------------------+
4.| 1|Thick Coat for Wa...|[blue, winter, cozy]|Sports Company Inc.|
5.| 2|Elegant Outdoors ...|[red, summer, fre...| YetiX|
6.| 3| Outdoors Backpack|[green, summer, t...|Sports Company Inc.|
7.+------+--------------------+--------------------+-------------------+

  • A. itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()
  • B. itemsDf.select(~col('supplier').contains('X')).distinct()
  • C. itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()
  • D. itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()
  • E. itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()

Answer: E

Explanation:
Explanation
Output of correct code block:
+-------------------+
| supplier|
+-------------------+
|Sports Company Inc.|
+-------------------+
Key to managing this question is understand which operator to use to do the opposite of an operation
- the ~ (not) operator. In addition, you should know that there is no unique() method.
Static notebook | Dynamic notebook: See test 1

 

NEW QUESTION 66
Which of the following is the idea behind dynamic partition pruning in Spark?

  • A. Dynamic partition pruning performs wide transformations on disk instead of in memory.
  • B. Dynamic partition pruning is intended to skip over the data you do not need in the results of a query.
  • C. Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
  • D. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
  • E. Dynamic partition pruning concatenates columns of similar data types to optimize join performance.

Answer: B

Explanation:
Explanation
Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution.
No - this is what adaptive query execution does, but not dynamic partition pruning.
Dynamic partition pruning concatenates columns of similar data types to optimize join performance.
Wrong, this answer does not make sense, especially related to dynamic partition pruning.
Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables.
It is true that dynamic partition pruning works in joins using broadcast variables. This actually happens in both the logical optimization and the physical planning stage. However, data types do not play a role for the reoptimization.
Dynamic partition pruning performs wide transformations on disk instead of in memory.
This answer does not make sense. Dynamic partition pruning is meant to accelerate Spark - performing any transformation involving disk instead of memory resources would decelerate Spark and certainly achieve the opposite effect of what dynamic partition pruning is intended for.

 

NEW QUESTION 67
Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i?
Sample of DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+

  • A. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(attributes_exploded.contains("i"))
  • B. itemsDf.select(explode("attributes")).filter("attributes_exploded".contains("i"))
  • C. itemsDf.select(col("attributes").explode().alias("attributes_exploded")).filter(col("attributes_exploded").co
  • D. itemsDf.select(explode("attributes").alias("attributes_exploded")).filter(col("attributes_exploded").contain
  • E. itemsDf.explode(attributes).alias("attributes_exploded").filter(col("attributes_exploded").contains("i"))

Answer: D

Explanation:
Explanation
Result of correct code block:
+-------------------+
|attributes_exploded|
+-------------------+
| winter|
| cooling|
+-------------------+
To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below).
Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options.
More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 68
Which of the following code blocks stores DataFrame itemsDf in executor memory and, if insufficient memory is available, serializes it and saves it to disk?

  • A. itemsDf.cache()
  • B. itemsDf.write.option('destination', 'memory').save()
  • C. itemsDf.store()
  • D. itemsDf.cache(StorageLevel.MEMORY_AND_DISK)
  • E. itemsDf.persist(StorageLevel.MEMORY_ONLY)

Answer: A

Explanation:
Explanation
The key to solving this question is knowing (or reading in the documentation) that, by default, cache() stores values to memory and writes any partitions for which there is insufficient memory to disk. persist() can achieve the exact same behavior, however not with the StorageLevel.MEMORY_ONLY option listed here. It is also worth noting that cache() does not have any arguments.
If you have troubles finding the storage level information in the documentation, please also see this student Q&A thread that sheds some light here.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 69
Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?

  • A. Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
  • B. Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions
  • C. Increase values for the properties spark.sql.parallelism and spark.sql.partitions
  • D. Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions
  • E. Decrease values for the properties spark.default.parallelism and spark.sql.partitions

Answer: A

Explanation:
Explanation
Decrease values for the properties spark.default.parallelism and spark.sql.partitions No, these values need to be increased.
Increase values for the properties spark.sql.parallelism and spark.sql.partitions Wrong, there is no property spark.sql.parallelism.
Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions See above.
Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions The property spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic allocation can be useful when to run multiple applications on the same cluster in parallel. However, in this case there is only a single application running on the cluster, so enabling dynamic allocation would not yield a performance benefit.
More info: Practical Spark Tips For Data Scientists | Experfy.com and Basics of Apache Spark Configuration Settings | by Halil Ertan | Towards Data Science (https://bit.ly/3gA0A6w ,
https://bit.ly/2QxhNTr)

 

NEW QUESTION 70
The code block shown below should return the number of columns in the CSV file stored at location filePath.
From the CSV file, only lines should be read that do not start with a # character. Choose the answer that correctly fills the blanks in the code block to accomplish this.
Code block:
__1__(__2__.__3__.csv(filePath, __4__).__5__)

  • A. 1. DataFrame
    2. spark
    3. read()
    4. escape='#'
    5. shape[0]
  • B. 1. len
    2. pyspark
    3. DataFrameReader
    4. comment='#'
    5. columns
  • C. 1. size
    2. pyspark
    3. DataFrameReader
    4. comment='#'
    5. columns
  • D. 1. len
    2. spark
    3. read
    4. comment='#'
    5. columns
  • E. 1. size
    2. spark
    3. read()
    4. escape='#'
    5. columns

Answer: D

Explanation:
Explanation
Correct code block:
len(spark.read.csv(filePath, comment='#').columns)
This is a challenging question with difficulties in an unusual context: The boundary between DataFrame and the DataFrameReader. It is unlikely that a question of this difficulty level appears in the exam. However, solving it helps you get more comfortable with the DataFrameReader, a subject you will likely have to deal with in the exam.
Before dealing with the inner parentheses, it is easier to figure out the outer parentheses, gaps 1 and 5. Given the code block, the object in gap 5 would have to be evaluated by the object in gap 1, returning the number of columns in the read-in CSV. One answer option includes DataFrame in gap 1 and shape[0] in gap 2. Since DataFrame cannot be used to evaluate shape[0], we can discard this answer option.
Other answer options include size in gap 1. size() is not a built-in Python command, so if we use it, it would have to come from somewhere else. pyspark.sql.functions includes a size() method, but this method only returns the length of an array or map stored within a column (documentation linked below).
So, using a size() method is not an option here. This leaves us with two potentially valid answers.
We have to pick between gaps 2 and 3 being spark.read or pyspark.DataFrameReader. Looking at the documentation (linked below), the DataFrameReader is actually a child class of pyspark.sql, which means that we cannot import it using pyspark.DataFrameReader. Moreover, spark.read makes sense because on Databricks, spark references current Spark session (pyspark.sql.SparkSession) and spark.read therefore returns a DataFrameReader (also see documentation below). Finally, there is only one correct answer option remaining.
More info:
- pyspark.sql.functions.size - PySpark 3.1.2 documentation
- pyspark.sql.DataFrameReader.csv - PySpark 3.1.2 documentation
- pyspark.sql.SparkSession.read - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 71
Which of the following code blocks immediately removes the previously cached DataFrame transactionsDf from memory and disk?

  • A. transactionsDf.persist()
  • B. transactionsDf.unpersist()
    (Correct)
  • C. transactionsDf.clearCache()
  • D. del transactionsDf
  • E. array_remove(transactionsDf, "*")

Answer: B

Explanation:
Explanation
transactionsDf.unpersist()
Correct. The DataFrame.unpersist() command does exactly what the question asks for - it removes all cached parts of the DataFrame from memory and disk.
del transactionsDf
False. While this option can help remove the DataFrame from memory and disk, it does not do so immediately. The reason is that this command just notifies the Python garbage collector that the transactionsDf now may be deleted from memory. However, the garbage collector does not do so immediately and, if you wanted it to run immediately, would need to be specifically triggered to do so. Find more information linked below.
array_remove(transactionsDf, "*")
Incorrect. The array_remove method from pyspark.sql.functions is used for removing elements from arrays in columns that match a specific condition. Also, the first argument would be a column, and not a DataFrame as shown in the code block.
transactionsDf.persist()
No. This code block does exactly the opposite of what is asked for: It caches (writes) DataFrame transactionsDf to memory and disk. Note that even though you do not pass in a specific storage level here, Spark will use the default storage level (MEMORY_AND_DISK).
transactionsDf.clearCache()
Wrong. Spark's DataFrame does not have a clearCache() method.
More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation, python - How to delete an RDD in PySpark for the purpose of releasing resources? - Stack Overflow Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 72
Which of the following describes a way for resizing a DataFrame from 16 to 8 partitions in the most efficient way?

  • A. Use a wide transformation to reduce the number of partitions.
    Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
  • B. Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
  • C. Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
  • D. Use a narrow transformation to reduce the number of partitions.

Answer: D

Explanation:
Explanation
Use a narrow transformation to reduce the number of partitions.
Correct! DataFrame.coalesce(n) is a narrow transformation, and in fact the most efficient way to resize the DataFrame of all options listed. One would run DataFrame.coalesce(8) to resize the DataFrame.
Use operation DataFrame.coalesce(8) to fully shuffle the DataFrame and reduce the number of partitions.
Wrong. The coalesce operation avoids a full shuffle, but will shuffle data if needed. This answer is incorrect because it says "fully shuffle" - this is something the coalesce operation will not do. As a general rule, it will reduce the number of partitions with the very least movement of data possible. More info:
distributed computing - Spark - repartition() vs coalesce() - Stack Overflow Use operation DataFrame.coalesce(0.5) to halve the number of partitions in the DataFrame.
Incorrect, since the num_partitions parameter needs to be an integer number defining the exact number of partitions desired after the operation. More info: pyspark.sql.DataFrame.coalesce - PySpark 3.1.2 documentation Use operation DataFrame.repartition(8) to shuffle the DataFrame and reduce the number of partitions.
No. The repartition operation will fully shuffle the DataFrame. This is not the most efficient way of reducing the number of partitions of all listed options.
Use a wide transformation to reduce the number of partitions.
No. While possible via the DataFrame.repartition(n) command, the resulting full shuffle is not the most efficient way of reducing the number of partitions.

 

NEW QUESTION 73
The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join.
Find the error.
Code block:

  • A. Spark will only apply the limit to threshold joins and not to other joins.
  • B. spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20)
  • C. The passed limit has the wrong variable type.
  • D. The command is evaluated lazily and needs to be followed by an action.
  • E. The correct option to write configurations is through spark.config and not spark.conf.
  • F. Spark will only broadcast DataFrames that are much smaller than the default value.

Answer: F

Explanation:
Explanation
This is question is hard. Let's assess the different answers one-by-one.
Spark will only broadcast DataFrames that are much smaller than the default value.
This is correct. The default value is 10 MB (10485760 bytes). Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not megabytes), the code block sets the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes.
The command is evaluated lazily and needs to be followed by an action.
No, this command is evaluated right away!
Spark will only apply the limit to threshold joins and not to other joins.
There are no "threshold joins", so this option does not make any sense.
The correct option to write configurations is through spark.config and not spark.conf.
No, it is indeed spark.conf!
The passed limit has the wrong variable type.
The configuration expects the number of bytes, a number, as an input. So, the 20 provided in the code block is fine.

 

NEW QUESTION 74
The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.
Code block:
transactionsDf.filter(col('predError').in([3, 6])).count()

  • A. The number of rows cannot be determined with the count() operator.
  • B. Numbers 3 and 6 need to be passed as string variables.
  • C. The method used on column predError is incorrect.
  • D. Instead of a list, the values need to be passed as single arguments to the in operator.
  • E. Instead of filter, the select method should be used.

Answer: C

Explanation:
Explanation
Correct code block:
transactionsDf.filter(col('predError').isin([3, 6])).count()
The isin method is the correct one to use here - the in method does not exist for the Column object.
More info: pyspark.sql.Column.isin - PySpark 3.1.2 documentation

 

NEW QUESTION 75
Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?

  • A. transactionsDf.sort(col(storeId)).desc(col(productId))
  • B. transactionsDf.sort("storeId").sort(desc("productId"))
  • C. transactionsDf.sort("storeId", desc("productId"))
  • D. transactionsDf.order_by(col(storeId), desc(col(productId)))
  • E. transactionsDf.sort("storeId", asc("productId"))

Answer: C

Explanation:
Explanation
In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column.
So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort().
Also, order_by is not a valid DataFrame API method.
More info: pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 76
Which of the following code blocks returns a one-column DataFrame for which every row contains an array of all integer numbers from 0 up to and including the number given in column predError of DataFrame transactionsDf, and null if predError is null?
Sample of DataFrame transactionsDf:
1.+-------------+---------+-----+-------+---------+----+
2.|transactionId|predError|value|storeId|productId| f|
3.+-------------+---------+-----+-------+---------+----+
4.| 1| 3| 4| 25| 1|null|
5.| 2| 6| 7| 2| 2|null|
6.| 3| 3| null| 25| 3|null|
7.| 4| null| null| 3| 2|null|
8.| 5| null| null| null| 2|null|
9.| 6| 3| 2| 25| 2|null|
10.+-------------+---------+-----+-------+---------+----+

  • A. 1.def count_to_target(target):
    2. result = list(range(target))
    3. return result
    4.
    5.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
    6.
    7.df = transactionsDf.select(count_to_target_udf('predError'))
  • B. 1.def count_to_target(target):
    2. if target is None:
    3. return
    4.
    5. result = list(range(target))
    6. return result
    7.
    8.transactionsDf.select(count_to_target(col('predError')))
  • C. 1.def count_to_target(target):
    2. if target is None:
    3. return
    4.
    5. result = [range(target)]
    6. return result
    7.
    8.count_to_target_udf = udf(count_to_target, ArrayType[IntegerType])
    9.
    10.transactionsDf.select(count_to_target_udf(col('predError')))
  • D. 1.def count_to_target(target):
    2. if target is None:
    3. return
    4.
    5. result = list(range(target))
    6. return result
    7.
    8.count_to_target_udf = udf(count_to_target)
    9.
    10.transactionsDf.select(count_to_target_udf('predError'))
  • E. 1.def count_to_target(target):
    2. if target is None:
    3. return
    4.
    5. result = list(range(target))
    6. return result
    7.
    8.count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
    9.
    10.transactionsDf.select(count_to_target_udf('predError'))
    (Correct)

Answer: E

Explanation:
Explanation
Correct code block:
def count_to_target(target):
if target is None:
return
result = list(range(target))
return result
count_to_target_udf = udf(count_to_target, ArrayType(IntegerType()))
transactionsDf.select(count_to_target_udf('predError'))
Output of correct code block:
+--------------------------+
|count_to_target(predError)|
+--------------------------+
| [0, 1, 2]|
| [0, 1, 2, 3, 4, 5]|
| [0, 1, 2]|
| null|
| null|
| [0, 1, 2]|
+--------------------------+
This question is not exactly easy. You need to be familiar with the syntax around UDFs (user-defined functions). Specifically, in this question it is important to pass the correct types to the udf method - returning an array of a specific type rather than just a single type means you need to think harder about type implications than usual.
Remember that in Spark, you always pass types in an instantiated way like ArrayType(IntegerType()), not like ArrayType(IntegerType). The parentheses () are the key here - make sure you do not forget those.
You should also pay attention that you actually pass the UDF count_to_target_udf, and not the Python method count_to_target to the select() operator.
Finally, null values are always a tricky case with UDFs. So, take care that the code can handle them correctly.
More info: How to Turn Python Functions into PySpark Functions (UDF) - Chang Hsin Lee - Committing my thoughts to words.
Static notebook | Dynamic notebook: See test 3

 

NEW QUESTION 77
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?

  • A. 1.spark.read.schema(
    2. StructType([
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).format("parquet").load(filePath)
  • B. 1.spark.read.schema([
    2. StructField("transactionId", NumberType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath)
  • C. 1.spark.read.schema(
    2. StructType([
    3. StructField("transactionId", StringType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).parquet(filePath)
  • D. 1.spark.read.schema(
    2. StructType(
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)
    5. )).load(filePath)
  • E. 1.spark.read.schema([
    2. StructField("transactionId", IntegerType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath, format="parquet")

Answer: A

Explanation:
Explanation
The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here.
NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided.
Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation and StructType - PySpark 3.1.2 documentation

 

NEW QUESTION 78
Which of the following code blocks returns a copy of DataFrame itemsDf where the column supplier has been renamed to manufacturer?

  • A. itemsDf.withColumnsRenamed("supplier", "manufacturer")
  • B. itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
  • C. itemsDf.withColumnRenamed("supplier", "manufacturer")
  • D. itemsDf.withColumn(["supplier", "manufacturer"])
  • E. itemsDf.withColumn("supplier").alias("manufacturer")

Answer: C

Explanation:
Explanation
itemsDf.withColumnRenamed("supplier", "manufacturer")
Correct! This uses the relatively trivial DataFrame method withColumnRenamed for renaming column supplier to column manufacturer.
Note that the question asks for "a copy of DataFrame itemsDf". This may be confusing if you are not familiar with Spark yet. RDDs (Resilient Distributed Datasets) are the foundation of Spark DataFrames and are immutable. As such, DataFrames are immutable, too. Any command that changes anything in the DataFrame therefore necessarily returns a copy, or a new version, of it that has the changes applied.
itemsDf.withColumnsRenamed("supplier", "manufacturer")
Incorrect. Spark's DataFrame API does not have a withColumnsRenamed() method.
itemsDf.withColumnRenamed(col("manufacturer"), col("supplier"))
No. Watch out - although the col() method works for many methods of the DataFrame API, withColumnRenamed is not one of them. As outlined in the documentation linked below, withColumnRenamed expects strings.
itemsDf.withColumn(["supplier", "manufacturer"])
Wrong. While DataFrame.withColumn() exists in Spark, it has a different purpose than renaming columns.
withColumn is typically used to add columns to DataFrames, taking the name of the new column as a first, and a Column as a second argument. Learn more via the documentation that is linked below.
itemsDf.withColumn("supplier").alias("manufacturer")
No. While DataFrame.withColumn() exists, it requires 2 arguments. Furthermore, the alias() method on DataFrames would not help the cause of renaming a column much. DataFrame.alias() can be useful in addressing the input of join statements. However, this is far outside of the scope of this question. If you are curious nevertheless, check out the link below.
More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation, pyspark.sql.DataFrame.withColumn - PySpark 3.1.1 documentation, and pyspark.sql.DataFrame.alias - PySpark 3.1.2 documentation (https://bit.ly/3aSB5tm , https://bit.ly/2Tv4rbE , https://bit.ly/2RbhBd2) Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/31.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

 

NEW QUESTION 79
The code block shown below should return a single-column DataFrame with a column named consonant_ct that, for each row, shows the number of consonants in column itemName of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this.
DataFrame itemsDf:
1.+------+----------------------------------+-----------------------------+-------------------+
2.|itemId|itemName |attributes |supplier |
3.+------+----------------------------------+-----------------------------+-------------------+
4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.|
5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX |
6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.|
7.+------+----------------------------------+-----------------------------+-------------------+ Code block:
itemsDf.select(__1__(__2__(__3__(__4__), "a|e|i|o|u|\s", "")).__5__("consonant_ct"))

  • A. 1. lower
    2. regexp_replace
    3. length
    4. "itemName"
    5. alias
  • B. 1. size
    2. regexp_extract
    3. lower
    4. col("itemName")
    5. alias
  • C. 1. size
    2. regexp_replace
    3. lower
    4. "itemName"
    5. alias
  • D. 1. length
    2. regexp_extract
    3. upper
    4. col("itemName")
    5. as
  • E. 1. length
    2. regexp_replace
    3. lower
    4. col("itemName")
    5. alias

Answer: E

Explanation:
Explanation
Correct code block:
itemsDf.select(length(regexp_replace(lower(col("itemName")), "a|e|i|o|u|\s", "")).alias("consonant_ct")) Returned DataFrame:
+------------+
|consonant_ct|
+------------+
| 19|
| 16|
| 10|
+------------+
This question tries to make you think about the string functions Spark provides and in which order they should be applied. Arguably the most difficult part, the regular expression "a|e|i|o|u|
\s", is not a numbered blank. However, if you are not familiar with the string functions, it may be a good idea to review those before the exam.
The size operator and the length operator can easily be confused. size works on arrays, while length works on strings. Luckily, this is something you can read up about in the documentation.
The code block works by first converting all uppercase letters in column itemName into lowercase (the lower() part). Then, it replaces all vowels by "nothing" - an empty character "" (the regexp_replace() part). Now, only lowercase characters without spaces are included in the DataFrame. Then, per row, the length operator counts these remaining characters. Note that column itemName in itemsDf does not include any numbers or other characters, so we do not need to make any provisions for these. Finally, by using the alias() operator, we rename the resulting column to consonant_ct.
More info:
- lower: pyspark.sql.functions.lower - PySpark 3.1.2 documentation
- regexp_replace: pyspark.sql.functions.regexp_replace - PySpark 3.1.2 documentation
- length: pyspark.sql.functions.length - PySpark 3.1.2 documentation
- alias: pyspark.sql.Column.alias - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 80
Which of the following code blocks returns a DataFrame showing the mean value of column "value" of DataFrame transactionsDf, grouped by its column storeId?

  • A. transactionsDf.groupBy("storeId").avg(col("value"))
  • B. transactionsDf.groupBy(col(storeId).avg())
  • C. transactionsDf.groupBy("storeId").agg(avg("value"))
  • D. transactionsDf.groupBy("value").average()
  • E. transactionsDf.groupBy("storeId").agg(average("value"))

Answer: C

Explanation:
Explanation
This question tests your knowledge about how to use the groupBy and agg pattern in Spark. Using the documentation, you can find out that there is no average() method in pyspark.sql.functions.
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 81
Which of the following describes characteristics of the Spark driver?

  • A. If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
  • B. The Spark driver processes partitions in an optimized, distributed fashion.
  • C. In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
  • D. The Spark driver's responsibility includes scheduling queries for execution on worker nodes.
  • E. The Spark driver requests the transformation of operations into DAG computations from the worker nodes.

Answer: C

Explanation:
Explanation
The Spark driver requests the transformation of operations into DAG computations from the worker nodes.
No, the Spark driver transforms operations into DAG computations itself.
If set in the Spark configuration, Spark scales the Spark driver horizontally to improve parallel processing performance.
No. There is always a single driver per application, but one or more executors.
The Spark driver processes partitions in an optimized, distributed fashion.
No, this is what executors do.
In a non-interactive Spark application, the Spark driver automatically creates the SparkSession object.
Wrong. In a non-interactive Spark application, you need to create the SparkSession object. In an interactive Spark shell, the Spark driver instantiates the object for you.

 

NEW QUESTION 82
Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?

  • A. transactionsDf.sample(False, 0.5)
  • B. transactionsDf.take(1000).distinct()
  • C. transactionsDf.sample(True, 0.5)
  • D. transactionsDf.take(1000)
  • E. transactionsDf.sample(True, 0.5, force=True)

Answer: C

Explanation:
Explanation
To solve this question, you need to know that DataFrame.sample() is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's withReplacement argument should be set to True. A force= argument for the operator does not exist.
While the take argument returns an exact number of rows, it will just take the first specified number of rows (1000 in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using take(), so the correct answer cannot involve take().
More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 83
Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?

  • A. transactionsDf.select(sqrt(predError))
  • B. transactionsDf.select(sqrt("predError"))
  • C. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
  • D. transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
  • E. transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))

Answer: E

Explanation:
Explanation
transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError")))
Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just using the column name as a string, "predError".
The question asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of DataFrame transactionsDf expressed through col("predError").
transactionsDf.withColumn("predErrorSqrt", sqrt(predError))
Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way - to Spark it looks as if you are trying to refer to the non-existent Python variable predError.
You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead.
transactionsDf.select(sqrt(predError))
Wrong. Here, the explanation just above this one about how to refer to predError applies.
transactionsDf.select(sqrt("predError"))
No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the question asks for a column to be added to the original DataFrame transactionsDf.
transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt())
No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column.
More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2

 

NEW QUESTION 84
Which of the following describes a shuffle?

  • A. A shuffle is a process that compares data across executors.
  • B. A shuffle is a process that is executed during a broadcast hash join.
  • C. A shuffle is a process that allocates partitions to executors.
  • D. A shuffle is a Spark operation that results from DataFrame.coalesce().
  • E. A shuffle is a process that compares data across partitions.

Answer: E

Explanation:
Explanation
A shuffle is a Spark operation that results from DataFrame.coalesce().
No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors.
This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors.
More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)

 

NEW QUESTION 85
The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.
Code block:
1. find_most_freq_letter_udf = udf(find_most_freq_letter)
2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))

  • A. The UDF method is not registered correctly, since the return type is missing.
  • B. Spark is not adding a column.
  • C. The "itemName" expression should be wrapped in col().
  • D. UDFs do not exist in PySpark.
  • E. Spark is not using the UDF method correctly.

Answer: E

Explanation:
Explanation
Correct code block:
find_most_freq_letter_udf = udf(find_most_frequent_letter)
itemsDf.withColumn("most_frequent_letter", find_most_freq_letter_udf("itemName")) Spark should use the previously registered find_most_freq_letter_udf method here - but it is not doing that in the original codeblock. There, it just uses the non-UDF version of the Python method.
Note that typically, we would have to specify a return type for udf(). Except in this case, since the default return type for udf() is a string which is what we are expecting here. If we wanted to return an integer variable instead, we would have to register the Python function as UDF using find_most_freq_letter_udf = udf(find_most_freq_letter, IntegerType()).
More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation

 

NEW QUESTION 86
......


Exam Format and Content

  • Exam Format: Multiple choice questions

  • Exam Length: 60 questions

  • Exam Duration: 120 minutes

  • Language: This exam is only available in the Python or Scala language.

  • Passing score: 70%

 

Use Valid New Associate-Developer-Apache-Spark Test Notes & Associate-Developer-Apache-Spark Valid Exam Guide: https://www.prep4away.com/Databricks-certification/braindumps.Associate-Developer-Apache-Spark.ete.file.html

Associate-Developer-Apache-Spark Actual Questions Answers PDF 100% Cover Real Exam Questions: https://drive.google.com/open?id=1eyHvspJyty7FN0PgFwP953inO03hnYtf