Tuesday, February 25, 2025

How to use Split function in PySpark

from pyspark.sql import SparkSession from pyspark.sql.functions import split

# Initialize a Spark session
spark = SparkSession.builder \ .appName("Split Column") \ .getOrCreate()

# Sample data

data = [
(1, "John Doe"),
(2, "Jane Smith"),
(3, "Alice Johnson")
]

# Create DataFrame from sample data
df = spark.createDataFrame(data, ["id", "full_name"])
# Split the 'full_name' column into 'first_name' and 'last_name'

df_split = df.withColumn("first_name", split(df["full_name"], " ").getItem(0)) \ .withColumn("last_name", split(df["full_name"], " ").getItem(1))

# Show the resulting DataFrame
df_split.show()

# Stop the Spark session

spar
k.stop()


In t his example:
We initialize a Spark session.

We create a DataFrame from sample data with an id column and a full_name column.

We use the split function to split the full_name column into two new colu
mns: first_name and last_name.
We display the resulting DataFrame.

The split function splits the full_name column into an array of s
trings based on the delimiter (a space in this case), and then we use getItem(0) and getItem(1) to extract the first and last names, respectively.

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...