In PySpark, the explode function is used to transform each element of a collection-like column (e.g., array or map) into a separate row.
Suppose we have a DataFrame df with a column fruits that contains an array of fruit names:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
spark = SparkSession.builder.appName("Explode Example").getOrCreate()
data = [
("John", ["Apple", "Banana", "Cherry"]),
("Mary", ["Orange", "Grapes", "Peach"]),
("David", ["Pear", "Watermelon", "Strawberry"])
]
df = spark.createDataFrame(data, ["name", "fruits"])
df.show()
Display the contents of "fruits" only
df.select("fruits").show(truncate=False)
- John: [Apple, Banana, Cherry]
- Mary: [Orange, Grapes, Peach]
- David: [Pear, Watermelon, Strawberry]
Output:
+-----+--------------------+
| name| fruits|
+-----+--------------------+
| John|[Apple, Banana, ...|
| Mary|[Orange, Grapes, ...|
|David|[Pear, Watermelon...|
+-----+--------------------+
Now, let's use the explode function to transform each element of the fruits array into a separate row:
df_exploded = df.withColumn("fruit", explode("fruits"))
df_exploded.show()
Here's what's happening:
1. df.withColumn(): This method adds a new column to the existing DataFrame df.
2. "fruit": This is the name of the new column being added.
3. explode("fruits"): This is the transformation being applied to create the new column.
Output:
+-----+--------------------+------+
| name| fruits| fruit|
+-----+--------------------+------+
| John|[Apple, Banana, ...| Apple|
| John|[Apple, Banana, ...|Banana|
| John|[Apple, Banana, ...|Cherry|
| Mary|[Orange, Grapes, ...|Orange|
| Mary|[Orange, Grapes, ...|Grapes|
| Mary|[Orange, Grapes, ...| Peach|
|David|[Pear, Watermelon...| Pear|
|David|[Pear, Watermelon...|Watermelon|
|David|[Pear, Watermelon...|Strawberry|
+-----+--------------------+------+
As you can see, the explode function has transformed each element of the fruits array into a separate row, with the corresponding name value.
To select only the "name" and "fruit" columns from the df_exploded DataFrame, you can use the select method:
df_name_fruit = df_exploded.select("name", "fruit")
df_name_fruit.show()
Output:
+-----+------+
| name| fruit|
+-----+------+
| John| Apple|
| John|Banana|
| John|Cherry|
| Mary|Orange|
| Mary|Grapes|
| Mary| Peach|
|David| Pear|
|David|Watermelon|
|David|Strawberry|
+-----+------+
By using select, you're creating a new DataFrame (df_name_fruit) that contains only the specified columns.
Subscribe to:
Post Comments (Atom)
Data synchronization in Lakehouse
Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...
-
Steps to Implement Medallion Architecture : Ingest Data into the Bronze Layer : Load raw data from external sources (e.g., databases, AP...
-
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructType from pyspark.sql.functions import col, explode_o...
-
Databricks Platform Architecture The Databricks platform architecture consists of two main components: the Control Plane and the Data Pla...
No comments:
Post a Comment