Here's a sample of using the collect method in PySpark:
data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])
result = df.collect()
print(result)
Output:
[Row(Name='John', Age=25), Row(Name='Mary', Age=31), Row(Name='David', Age=42)]
The collect method returns all the rows in the DataFrame as a list of Row objects. Note that this can be memory-intensive for large DataFrames.
To display the age only from the first row, you can use the following code:
data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])
first_row = df.first()
age = first_row.Age
print(age)
Output:
25
Alternatively, you can use:
print(df.first().Age)
Both methods will display the age from the first row, which is 25.
In PySpark, collect() returns a list of Row objects.
collect[0] refers to the first Row object in the list.
collect[0][0] refers to the first element (or column value) within that first Row object.
So, collect[0][0] essentially gives you the value of the first column in the first row of the DataFrame.
Here's an example:
data = [("John", 25), ("Mary", 31), ("David", 42)]
df = spark.createDataFrame(data, ["Name", "Age"])
result = df.collect()
print(result[0][0])
# Output: John
Subscribe to:
Post Comments (Atom)
Data synchronization in Lakehouse
Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...
-
Steps to Implement Medallion Architecture : Ingest Data into the Bronze Layer : Load raw data from external sources (e.g., databases, AP...
-
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructType from pyspark.sql.functions import col, explode_o...
-
Databricks Platform Architecture The Databricks platform architecture consists of two main components: the Control Plane and the Data Pla...
No comments:
Post a Comment