Databricks Notes: Optimizing Table Performance with Z‑Ordering

Monday, April 14, 2025

Optimizing Table Performance with Z‑Ordering

Optimizing Table Performance with Z‑Ordering

Over time, frequent incremental loads (plus file-level operations like compaction) can result in many small files. Queries filtering on certain columns might have to scan many files, which can slow performance. Z‑Ordering is a technique that reorganizes data on disk based on one or more columns. When your table is physically organized by those columns, queries that filter on them can skip reading irrelevant files.

Example: Optimize and Z‑Order a Delta Table

Once your Delta table has been updated with incremental loads, you can run the following SQL command to improve query performance: # Optimize the table and perform Z‑Ordering on the 'id' column spark.sql("OPTIMIZE my_table ZORDER BY (id)")

Explanation:

OPTIMIZE Command: This command compacts small files into larger ones.
ZORDER BY: By ordering the data using the specified column (id in this case), Delta Lake clusters similar values together. This reduction in file-level fragmentation means that queries filtering on id will scan fewer files—cutting down the overall I/O and speeding up query execution .
Tip: You can Z‑Order on multiple columns if your query filters often include more than one attribute (e.g., ZORDER BY (country, city)).

Databricks Notes

Monday, April 14, 2025

Optimizing Table Performance with Z‑Ordering

No comments:

Post a Comment

Data synchronization in Lakehouse

Report Abuse

Labels