Tuesday, February 25, 2025

PySpark Functions - filter, groupBy, agg

from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", 25, "New York"),
("Bob", 30, "Los Angeles"),
("Charlie", 35, "Chicago")]

# Create DataFrame
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
# Show the DataFrame
df.show()
# Filter the DataFrame

filtered_data = df.where(df.Age > 30)
filtered_data.show()

# Group by 'City' and calculate the average 'Age'
grouped_data = df.groupBy("City").agg(avg("Age").alias("Average_Age"))
# Show the result
grouped_data.show()

# Use the agg() function to calculate the average and maximum age for each city
aggregated_data = df.groupBy("City").agg(avg("Age").alias("Average_Age"), max("Age").alias("Max_Age"))
# Show the result
aggregated_data.show()

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...