from pyspark.sql import SparkSession
from pyspark.sql.functions import split
# Initialize a Spark session
spark = SparkSession.builder \
.appName("Split Column") \
.getOrCreate()
# Sample data
data = [
(1, "John Doe"),
(2, "Jane Smith"),
(3, "Alice Johnson")
]
# Create DataFrame from sample data
df = spark.createDataFrame(data, ["id", "full_name"])
# Split the 'full_name' column into 'first_name' and 'last_name'
df_split = df.withColumn("first_name", split(df["full_name"], " ").getItem(0)) \
.withColumn("last_name", split(df["full_name"], " ").getItem(1))
# Show the resulting DataFrame
df_split.show()
# Stop the Spark session
spar
k.stop()
In t
his example:
We initialize a Spark session.
We create a DataFrame from sample data with an id column and a full_name column.
We use the split function to split the full_name column into two new colu
mns: first_name and last_name.
We display the resulting DataFrame.
The split function splits the full_name column into an array of s
trings based on the delimiter (a space in this case), and then we use getItem(0) and getItem(1) to extract the first and last names, respectively.
Subscribe to:
Post Comments (Atom)
Data synchronization in Lakehouse
Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...
-
Steps to Implement Medallion Architecture : Ingest Data into the Bronze Layer : Load raw data from external sources (e.g., databases, AP...
-
from pyspark.sql import SparkSession from pyspark.sql.types import ArrayType, StructType from pyspark.sql.functions import col, explode_o...
-
Databricks Platform Architecture The Databricks platform architecture consists of two main components: the Control Plane and the Data Pla...
No comments:
Post a Comment