Tuesday, February 25, 2025

PySpark - Inner Join

from pyspark.sql import SparkSession # Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data for DataFrame 1
data1 = [("Alice", 25, "New York"), ("Bob", 30, "Los Angeles"), ("Charlie", 35, "Chicago")]
# Sample data for DataFrame 2
data2 = [("Alice", "F"), ("Bob", "M"), ("Charlie", "M")]
# Create DataFrames

columns1 = ["Name", "Age", "City"]
columns2 = ["Name", "Gender"]
df1 = spark.createDataFrame(data1, columns1)
df2 = spark.createDataFrame(data2, columns2)

# Show the DataFrames
df1.show()
df2.show()
# Perform a join operation
joined_df = df1.join(df2, on="Name", how="inner")
# Show the result
joined_df.show()

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...