Tuesday, February 25, 2025

PySpark - Commonly used functions

Dataframe Operations

1. select(): Select specific columns from a DataFrame.
2. filter(): Filter rows based on conditions.
3. where(): Similar to filter(), but uses SQL-like syntax.
4. groupBy(): Group rows by one or more columns.
5. agg(): Perform aggregation operations (e.g., sum, count, avg).
6. join(): Join two DataFrames based on a common column.
7. union(): Combine two DataFrames into a single DataFrame.

Data Manipulation

1. withColumn(): Add a new column to a DataFrame.
2. withColumnRenamed(): Rename an existing column.
3. drop(): Drop one or more columns from a DataFrame.
4. cast(): Cast a column to a different data type.

Data Analysis

1. count(): Count the number of rows in a DataFrame.
2. sum(): Calculate the sum of a column.
3. avg(): Calculate the average of a column.
4. max(): Find the maximum value in a column.
5. min(): Find the minimum value in a column.

Data Transformation

1. explode(): Transform an array column into separate rows.
2. flatten(): Flatten a nested struct column.
3. split(): Split a string column into an array.

No comments:

Post a Comment

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres d...