Tuesday, February 25, 2025

Explode - PySpark

In PySpark, the explode function is used to transform each element of a collection-like column (e.g., array or map) into a separate row.

Suppose we have a DataFrame df with a column fruits that contains an array of fruit names:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName("Explode Example").getOrCreate()

data = [ ("John", ["Apple", "Banana", "Cherry"]),
("Mary", ["Orange", "Grapes", "Peach"]),
("David", ["Pear", "Watermelon", "Strawberry"])
]

df = spark.createDataFrame(data, ["name", "fruits"])
df.show()

Display the contents of "fruits" only

df.select("fruits").show(truncate=False)

- John: [Apple, Banana, Cherry]
- Mary: [Orange, Grapes, Peach]
- David: [Pear, Watermelon, Strawberry]

Output:
+-----+--------------------+
| name| fruits|
+-----+--------------------+
| John|[Apple, Banana, ...|
| Mary|[Orange, Grapes, ...|
|David|[Pear, Watermelon...|
+-----+--------------------+

Now, let's use the explode function to transform each element of the fruits array into a separate row:

df_exploded = df.withColumn("fruit", explode("fruits"))
df_exploded.show()

Here's what's happening:

1. df.withColumn(): This method adds a new column to the existing DataFrame df.
2. "fruit": This is the name of the new column being added.
3. explode("fruits"): This is the transformation being applied to create the new column. Output:

+-----+--------------------+------+
| name| fruits| fruit|
+-----+--------------------+------+
| John|[Apple, Banana, ...| Apple|
| John|[Apple, Banana, ...|Banana|
| John|[Apple, Banana, ...|Cherry|
| Mary|[Orange, Grapes, ...|Orange|
| Mary|[Orange, Grapes, ...|Grapes|
| Mary|[Orange, Grapes, ...| Peach|
|David|[Pear, Watermelon...| Pear|
|David|[Pear, Watermelon...|Watermelon|
|David|[Pear, Watermelon...|Strawberry|
+-----+--------------------+------+


As you can see, the explode function has transformed each element of the fruits array into a separate row, with the corresponding name value.

To select only the "name" and "fruit" columns from the df_exploded DataFrame, you can use the select method:

df_name_fruit = df_exploded.select("name", "fruit")
df_name_fruit.show()
Output:
+-----+------+
| name| fruit|
+-----+------+
| John| Apple|
| John|Banana|
| John|Cherry|
| Mary|Orange|
| Mary|Grapes|
| Mary| Peach|
|David| Pear|
|David|Watermelon|
|David|Strawberry|
+-----+------+


By using select, you're creating a new DataFrame (df_name_fruit) that contains only the specified columns.

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...