Tuesday, February 25, 2025

PySpark - Commonly used functions

Data Manipulation

1. withColumn(): Add a new column to a DataFrame.
2. withColumnRenamed(): Rename an existing column.
3. drop(): Drop one or more columns from a DataFrame.
4. cast(): Cast a column to a different data type.

Data Analysis

1. count(): Count the number of rows in a DataFrame.
2. sum(): Calculate the sum of a column.
3. avg(): Calculate the average of a column.
4. max(): Find the maximum value in a column.
5. min(): Find the minimum value in a column.

Data Transformation

1. explode(): Transform an array column into separate rows.
2. flatten(): Flatten a nested struct column.
3. split(): Split a string column into an array.

Dataframe Operations

1. distinct(): Return a DataFrame with unique rows.
2. intersect(): Return a DataFrame with rows common to two DataFrames
3. exceptAll(): Return a DataFrame with rows in the first DataFrame but not in the second.
4. repartition(): Repartition a DataFrame to increase or decrease the number of partitions.
5. coalesce(): Coalesce a DataFrame to reduce the number of partitions.

Data Manipulation

1. orderBy(): Order a DataFrame by one or more columns.
2. sort(): Sort a DataFrame by one or more columns.
3. limit(): Limit the number of rows in a DataFrame.
4. sample(): Return a sampled subset of a DataFrame.
5. randomSplit(): Split a DataFrame into multiple DataFrames randomly.

Data Analysis

1. corr(): Calculate the correlation between two columns.
2. cov(): Calculate the covariance between two columns.
3. skewness(): Calculate the skewness of a column.
4. kurtosis(): Calculate the kurtosis of a column.
5. approxQuantile(): Calculate an approximate quantile of a column.

Data Transformation

1. udf(): Create a user-defined function (UDF) to transform data.
2. apply(): Apply a UDF to a column.
3. transform(): Transform a DataFrame using a UDF.
4. map(): Map a DataFrame to a new DataFrame using a UDF.

String Functions

1. concat(): Concatenate two or more string columns.
2. length(): Calculate the length of a string column.
3. lower(): Convert a string column to lowercase.
4. upper(): Convert a string column to uppercase.
5. trim(): Trim whitespace from a string column.

Date and Time Functions

1. current_date(): Return the current date.
2. current_timestamp(): Return the current timestamp.
3. date_format(): Format a date column.
4. hour(): Extract the hour from a timestamp column.
5. dayofweek(): Extract the day of the week from a date column.

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...