Databricks Notes: How to use Split function in PySpark

Tuesday, February 25, 2025

How to use Split function in PySpark

from pyspark.sql import SparkSession from pyspark.sql.functions import split

# Initialize a Spark session
spark = SparkSession.builder \ .appName("Split Column") \ .getOrCreate()

# Sample data

data = [
(1, "John Doe"),
(2, "Jane Smith"),
(3, "Alice Johnson")
]

# Create DataFrame from sample data
df = spark.createDataFrame(data, ["id", "full_name"])
# Split the 'full_name' column into 'first_name' and 'last_name'

df_split = df.withColumn("first_name", split(df["full_name"], " ").getItem(0)) \ .withColumn("last_name", split(df["full_name"], " ").getItem(1))

# Show the resulting DataFrame
df_split.show()

# Stop the Spark session

spar
k.stop()

In t his example:
We initialize a Spark session.

We create a DataFrame from sample data with an id column and a full_name column.

We use the split function to split the full_name column into two new colu
mns: first_name and last_name.
We display the resulting DataFrame.

The split function splits the full_name column into an array of s
trings based on the delimiter (a space in this case), and then we use getItem(0) and getItem(1) to extract the first and last names, respectively.

Databricks Notes

Tuesday, February 25, 2025

How to use Split function in PySpark

No comments:

Post a Comment

Data synchronization in Lakehouse

Report Abuse

Labels