Tuesday, February 25, 2025

PySpark - Select

from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
# Sample data
data = [("Alice", 25, "New York"),
("Bob", 30, "Los Angeles"),
("Charlie", 35, "Chicago")]

# Create DataFrame
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
# Select specific columns
selected_columns = df.select("Name", "City")
# Show the result
selected_columns.show()
print(type(selected_columns))

In this example, a Spark session is created, and a DataFrame df is created with the columns 'Name', 'Age', and 'City'. The select method is used to create a new DataFrame selected_columns that only includes the 'Name' and 'City' columns.

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...