Databricks Notes

Thursday, June 12, 2025

Data synchronization in Lakehouse

Data synchronization in Lakebase ensures that transactional data and analytical data remain up-to-date across the lakehouse and Postgres database without requiring complex ETL pipelines.

How It Works

Sync from Delta Lake: Lakebase allows automatic synchronization from Delta tables to Postgres tables, ensuring that data updates are reflected in real-time.

Managed Sync: Instead of manually moving data, Lakebase provides a fully managed synchronization process that continuously updates records.

Optional Secondary Indexes: Users can define indexes to optimize query performance on synchronized data.

Change Data Capture (CDC): Lakebase supports CDC, meaning it tracks inserts, updates, and deletes to maintain consistency.

Multi-Cloud Support: Synchronization works across different cloud environments, ensuring flexibility and scalability.

Key Benefits Eliminates ETL Complexity: No need for custom pipelines—data flows seamlessly.

Real-Time Updates: Ensures low-latency access to fresh data.

Optimized for AI & ML: Supports feature serving and retrieval-augmented generation (RAG).

Secure & Governed: Works with Unity Catalog for authentication and data governance.

While data synchronization and data replication are often used interchangeably, they have distinct differences:

Data Synchronization Ensures that two or more copies of data remain consistent and up-to-date.

Can involve incremental updates, meaning only changed data is transferred.

Often used in distributed systems where data needs to be continuously updated across multiple locations.

Example: Keeping a mobile app's local database in sync with a central cloud database.

Data Replication
Creates exact copies of data across multiple locations.
Typically involves bulk transfers, meaning entire datasets are copied.
Used for backup, disaster recovery, and load balancing.

Example:A read replica of a database used to distribute query load.

Key Differences
Synchronization focuses on keeping data updated across systems, while replication ensures identical copies exist.
Synchronization can be real-time or scheduled, whereas replication is often one-time or periodic.
Synchronization is more dynamic, while replication is more static.

What is Lakebase

Lakebase is a new serverless Postgres database developed by Databricks. It is designed to integrate seamlessly with data lakehouses, making it easier to manage both transactional and analytical data in a single environment.

Lakebase is built for the AI era, supporting high-speed queries and scalability while eliminating the complexity of traditional database management. It allows developers to sync data between lakehouse tables and Lakebase records automatically, continuously, or based on specific conditions.

Seamless Integration: It connects operational databases with data lakes, eliminating silos between transactional and analytical workloads.

Scalability & Performance: Built on Postgres, it supports high-speed queries and efficient scaling for AI-driven applications.

Simplified Management: Fully managed by Databricks, reducing the complexity of provisioning and maintaining databases.

AI & ML Capabilities: Supports feature serving, retrieval-augmented generation (RAG), and other AI-driven workflows.

Multi-Cloud Support: Works across different cloud environments, ensuring flexibility and reliability.

Best Practices

Optimize Data Synchronization: Use managed sync between Delta Lake and Lakebase to avoid complex ETL pipelines.

Leverage AI & ML Features: Take advantage of feature serving and retrieval-augmented generation (RAG) for AI-driven applications.

Ensure Secure Access: Use Unity Catalog for authentication and governance, ensuring controlled access to data.

Monitor Performance: Regularly analyze query performance and optimize indexes to maintain efficiency.

Utilize Multi-Cloud Flexibility: Deploy across different cloud environments for scalability and reliability.

Thursday, May 8, 2025

Explain the query processing in PySpark

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[Department#1], functions=[sum(Salary#2L)])
+- Exchange hashpartitioning(Department#1, 200),
ENSURE_REQUIREMENTS, [id=#60]
+- HashAggregate(keys=[Department#1],
functions=[partial_sum(Salary#2L)])
+- InMemoryTableScan [Department#1, Salary#2L]
+- InMemoryRelation [Name#0, Department#1, Salary#2L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Scan ExistingRDD[Name#0,Department#1,Salary#2L]

Let's break down what the physical plan is showing you:

1. **AdaptiveSparkPlan:**

The top node, `AdaptiveSparkPlan isFinalPlan=false`, indicates that Spark's Adaptive Query Execution (AQE) is enabled. AQE means Spark can adjust its physical plan at runtime based on the actual data and statistics. Here, it informs you that the current plan is not final and may be optimized further as the job executes.

2. **Final Global Aggregation (HashAggregate):**

The next step is a `HashAggregate` node that groups data by the key `[Department#1]` and uses the aggregation function `sum(Salary#2L)`. This is the final step in computing the total salary per department. Because no grouping keys are passed to the final `groupBy()` (if you had one on the top level, it would be a global aggregation), here it's grouping on the department key to produce the desired result.

3. **Data Exchange (Exchange):**

Before reaching the final aggregation, there's an `Exchange` node. This node handles the data shuffle by redistributing rows across 200 partitions based on hash partitioning of the department column. The exchange ensures that all rows for the same department end up in the same partition so that the subsequent aggregation can compute the final sum correctly. The `ENSURE_REQUIREMENTS` note indicates that Spark is satisfying physical properties (like partitioning) required by the following operators.

4. **Partial Aggregation (HashAggregate):**

Beneath the exchange, another `HashAggregate` node appears. This node computes partial sums of salaries per department. Partial aggregation is a common technique in distributed computing because it reduces the amount of data that has to be shuffled over the network by performing some of the aggregation locally within each partition.

5. **Data Source (InMemoryTableScan and InMemoryRelation):**

The data is ultimately sourced from an `InMemoryTableScan` on the columns `[Department#1, Salary#2L]`. This scan reads data from an `InMemoryRelation`, which is a cached version of your dataset stored in memory (with a storage level that includes disk fallback and a single replica). The `Scan ExistingRDD` at the bottom indicates that this cached DataFrame (or RDD) is being scanned to provide the required columns to the aggregation pipeline.

In summary, the plan shows that Spark is reading data from memory, performing a local (partial) aggregation to compute partial sums of salaries by department, then shuffling the data so that all rows with the same department are grouped together, and finally computing the global sum for each department. This multi-phase strategy (partial then final aggregation) is used to optimize performance and reduce data movement across nodes.

Thursday, April 24, 2025

How to flatten a complex JSON file - Example 2

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructType
from pyspark.sql.functions import col, explode_outer

def flatten(df):

""" Recursively flattens a PySpark DataFrame with nested structures. For any column whose type is either ArrayType or StructType: - If it is a StructType, the function expands each field of the struct into a new column. The new column names are in the form "parentField_childField". - If it is an ArrayType, the function uses `explode_outer` to convert each element of the array into a separate row (which is useful for arrays of structs or primitive types).

Parameters: df (DataFrame): The input DataFrame that may contain nested columns.

Returns: DataFrame: A flattened DataFrame with no nested columns. """

# Identify columns whose type is either ArrayType or StructType.

complex_fields = {field.name: field.dataType for field in df.schema.fields if isinstance(field.dataType, (ArrayType, StructType))}

while complex_fields:
for col_name, col_type in complex_fields.items():
if isinstance(col_type, StructType):
# For a struct: expand its fields as separate columns.
expanded = [ col(col_name + '.' + subfield.name).alias(col_name + '_' + subfield.name) for subfield in col_type.fields ]

df = df.select("*", *expanded).drop(col_name)
elif isinstance(col_type, ArrayType):
# For an array, explode it so that each element becomes a separate row.
df = df.withColumn(col_name, explode_outer(col_name))
# Recompute the schema to check whether more nested columns remain.

complex_fields = {field.name: field.dataType for field in df.schema.fields if isinstance(field.dataType, (ArrayType, StructType))}

return df
Example Usage
if __name__ == "__main__":
spark = SparkSession.builder.appName("FlattenNestedJson").getOrCreate()

# Replace this with the path to your nested employee JSON file.
json_file_path = "/path/to/employee_record.json"

# Read the nested JSON file.
df = spark.read.json(json_file_path)
# Apply the flatten function.
flat_df = flatten(df)
# Display the flattened DataFrame.
flat_df.show(truncate=False)
spark.stop()

Detecting Complex Types: The function first builds a dictionary (complex_fields) of column names pointing to their data types for any field that is a nested structure (either an array or a struct).

Processing Structs: For each field of type StructType, the code iterates over its subfields and creates new columns named in the pattern "parent_subfield". The original nested column is then dropped.

Processing Arrays: For columns with ArrayType, the function calls explode_outer which converts each element of the array into a separate row (ensuring that null or empty arrays are handled gracefully).

Iterative Flattening: After processing the current set of nested fields, the function rebuilds the dictionary to catch any newly exposed nested fields. This loop continues until no more complex types remain.

How to flatten a complex JSON file - Example 1

from pyspark.sql.types import *
from pyspark.sql.functions import *

def flatten(df):
compute Complex Fields (Lists and Structs) in Schema
complex_fields = dict([(field.name, field.dataType)
for field in df.schema.fields
if type(field.dataType) == ArrayType or type(field.dataType) == StructType])

while len(complex_fields)!=0:

col_name=list(complex_fields.keys())[0]
print ("Processing :"+col_name+" Type : "+str(type(complex_fields[col_name])))

if StructType then convert all sub element to columns.
i.e. flatten structs
if (type(complex_fields[col_name]) == StructType):
expanded = [col(col_name+'.'+k).alias(col_name+'_'+k) for k in [ n.name for n in complex_fields[col_name]]]

df=df.select("*", *expanded).drop(col_name)

if ArrayType then add the Array Elements as Rows using the explode function

i.e. explode Arrays
elif (type(complex_fields[col_name]) == ArrayType):

df=df.withColumn(col_name,explode_outer(col_name))

recompute remaining Complex Fields in Schema
complex_fields = dict([(field.name, field.dataType)
for field in df.schema.fields
if type(field.dataType) == ArrayType or type(field.dataType) == StructType])
return df

Sample Nested Employee Data

[
{
"employeeId": "E001",
"name": {
"first": "John",
"last": "Doe"
},
"contact": {
"email": "john.doe@example.com",
"phones": [
"555-1234",
"555-5678"
]
},
"address": {
"street": "123 Elm St",
"city": "Springfield",
"state": "IL",
"zipcode": "62704"
},
"department": {
"deptId": "D001",
"deptName": "Engineering"
},
"projects": [
{ "projectId": "P001",
"projectName": "Redesign Website", "duration": "3 months" },
{
"projectId": "P002",
"projectName": "Develop Mobile App",
"duration": "6 months"
}
]
},
{
"employeeId": "E002",
"name": {
"first": "Jane",
"last": "Smith"
},
"contact": {
"email": "jane.smith@example.com",
"phones": [
"555-9876"
]
},
"address": {
"street": "456 Oak St",
"city": "Riverside",
"state": "CA",
"zipcode": "92501"
},
"department": {
"deptId": "D002",
"deptName": "Marketing"
},
"projects": [
{
"projectId": "P003",
"projectName": "Product Launch",
"duration": "2 months"
}
]
}
]

Thursday, April 17, 2025

PySpark PartitionBy - Example

Read a CSV file and group by Year, for each year write the resulting data in the partition.

df.spark.read.format("csv").option("inferschema", True).option("header", True).option("sep", ",").load("/FileStore/tables/baby_name/input/")

display(df)

df.groupBy("Year").count().show(truncate=False)

df.write.option("header", True).partitionBy("Year").mode("overwrite").csv("/FileStore/tables/flower_name/output/")

Read a CSV file and group by Year, Color and write it by Year, for each year write the resulting data in the partition.

df.write.option("header", True).partitionBy("Year", "Color").mode("overwrite").csv("/FileStore/tables/flower_name/output/")

Read a CSV file, partition it based on the number of records, in this case 4000 records per partition, for each year write the resulting data in the partition.artition

df.write.option("header", True).option("maxRecordsPerFile", 4200).partitionBy("Year").mode("overwrite").csv("/FileStore/tables/flower_name/output/")

dbutils.fs.rm("/FileStore/tables/flower_name/output/", True)

dbutils.fs.mkdir("/FileStore/tables/flower_name/output/")

Below is an example of how you might partition a dataset of employee records by a year column using PySpark.

from pyspark.sql import SparkSession
from pyspark.sql.functions import year, to_date, col
Initialize the Spark session

spark = SparkSession.builder \
.appName("EmployeeRecordsPartitioning") \
.getOrCreate()

Read the employee records from a CSV file. Adjust the file path and options as needed.

df = spark.read.csv("employee_records.csv", header=True, inferSchema=True)

Optional: Convert the 'hire_date' column from string to date format.
(Assumes hire_date is stored in "yyyy-MM-dd" format)

df = df.withColumn("hire_date", to_date(col("hire_date"), "yyyy-MM-dd"))

Extract the year from the 'hire_date' column. If your dataset already has a year column, this step isn’t necessary.
df = df.withColumn("year", year(col("hire_date")))
Display a few rows to verify the new 'year' column.
df.show()

Write the DataFrame partitioning the records into separate directories by the 'year' value.
The resulting partitioned data is stored in Parquet format at the specified output path.
output_path = "/path/to/output/employee_partitioned"
df.write.mode("overwrite").partitionBy("year").parquet(output_path)
Stop the Spark session
spark.stop()

Tuesday, April 15, 2025

Databricks EXPLAIN Plan

The Databricks EXPLAIN plan is a built‐in tool that lets you peek under the hood of your Spark SQL queries or DataFrame operations. Its main purpose is to show exactly how your high-level statement is translated, optimized, and executed across your cluster. Here’s a streamlined summary:

Multiple Layers of Query Representation:

Parsed Logical Plan: This is where Spark first interprets your query's syntax without yet resolving table or column names.

Analyzed Logical Plan: In this stage, Spark resolves these names and data types, transforming your raw query into one that reflects the structure of your data.

Optimized Logical Plan: Spark then applies various optimization rules such as predicate pushdown, projection pruning, and join reordering—essentially refining the query for efficiency without changing its result.

Physical Plan: Finally, the engine decides on specific execution strategies (like scans, joins, and shuffles) and constructs a plan that details how your operations will run on the cluster.

Modes of EXPLAIN:
Simple Mode (default): Shows only the final physical plan.
Extended Mode: Provides all stages—from the parsed plan through to the physical plan.
Formatted Mode: Organizes the output into a neat overview (physical plan outline) and detailed node information.
Cost and Codegen Modes: Offer additional insights such as cost statistics (when available) or even generated code for physical operations.

Why It’s Valuable:

Debugging and Performance Tuning: By examining each layer, you can identify expensive operations (e.g., data shuffles) or inefficient join strategies, which is crucial for optimizing performance and debugging complex queries.
Understanding Spark’s Optimizations: It offers transparency into how Catalyst (Spark’s optimizer) works, helping you appreciate the transition from high-level code to the low-level execution tasks actually run on your hardware.
In essence, the Databricks EXPLAIN plan is like having a roadmap of how your data moves and transforms from the moment you write your query to the time results are delivered. This detail is invaluable for both debugging query issues and refining performance, especially as your datasets and transformations grow more complex.

Databricks Notes

Thursday, June 12, 2025

Data synchronization in Lakehouse

What is Lakebase

Thursday, May 8, 2025

Explain the query processing in PySpark

Thursday, April 24, 2025

How to flatten a complex JSON file - Example 2

How to flatten a complex JSON file - Example 1

Thursday, April 17, 2025

PySpark PartitionBy - Example

Tuesday, April 15, 2025

Databricks EXPLAIN Plan

Data synchronization in Lakehouse

Report Abuse

Labels