Tuesday, February 25, 2025

Collect - PySpark

Here's a sample of using the collect method in PySpark:
data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])
result = df.collect()
print(result)

Output:

[Row(Name='John', Age=25), Row(Name='Mary', Age=31), Row(Name='David', Age=42)]
The collect method returns all the rows in the DataFrame as a list of Row objects. Note that this can be memory-intensive for large DataFrames.

To display the age only from the first row, you can use the following code:

data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])

first_row = df.first()
age = first_row.Age
print(age)
Output:
25

Alternatively, you can use:
print(df.first().Age)
Both methods will display the age from the first row, which is 25.

In PySpark, collect() returns a list of Row objects.
collect[0] refers to the first Row object in the list.
collect[0][0] refers to the first element (or column value) within that first Row object.
So, collect[0][0] essentially gives you the value of the first column in the first row of the DataFrame.
Here's an example:

data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])
result = df.collect()
print(result[0][0])
# Output: John

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...