Delta Lake Deletion vectors : A complete overview

ANGE KOUAME
5 min readNov 27, 2023

--

1. What is Delta Lake

Delta Lake, firstly introduced in 2019 by Databricks is an hybrid architecture merging the strengths of data warehouses and data lakes into a unified open-source architecture suitable for big data. It perfectly combined the data management (quality, update etc..) of traditional data warehouse systems with the datalake capabilities of processing and handling varied data types (structured to unstructured).

Delta Lake addresses this by introducing ACID transactions and robust data management to data lakes, ensuring data integrity and supporting complex operations, thereby overcoming the limitations of both systems.

As Databricks resumed it

The lakehouse is an open data management architecture that combines the flexibility, cost-efficiency and scale of the datalake with the data management, schema enforcement and ACID transactions of the traditionnal warehouse

The best of the two worlds to be concise

1.1 ACID transactions

What does ACID means ?

Atomicity : Unlike traditional Parquet formats in data lakes, where operations can have intermediate states, atomicity ensures that each operation either completely succeeds or fails without partial completion.

Consistency : This guarantees the transition from one valid state of the table to another, maintaining data integrity throughout transactions.

Isolation : Transactions like merges, updates, deletes and overwrites occur independently of one another. This isolation ensures that a consistent state of the table is maintained, even amid concurrent read/write operations.

Durability : Once operations are executed, they are permanently recorded. This durability enables the tracking of data transformations over time and supports ‘time travel’ to previous states of the data.

1.2 How transactions work under the hood on Delta Lake

Delta tables are essentially a collection of Parquet files coupled with a _delta_log directory. This architecture enables both schema-on-write and declarative schema evolution. The _delta_log folder contains JSON and CRC files that record every transformation made on the table since its inception

By default the tables created in Databricks with runtime (8) is delta.

Let’s create here for the purpose of the demonstration a table user partition by month. The table describe actions made by social network user across months.

CREATE TABLE IF NOT EXISTS user(user_id INT, month STRING, action STRING)
PARTITION BY(month)
INSERT INTO user
VALUES
(1, 'January', 'login'),
(2, 'February', 'logout'),
(3, 'March', 'comment'),
(4, 'April', 'like'),
(3, 'May', 'share'),
(6, 'June', 'upload'),
(3, 'July', 'update'),
(8, 'August', 'delete'),
(3, 'September', 'register'),
(10, 'October', 'purchase');

The table partitioned should look like this

and the _delta_log folder should contain 2 json files initially.

  • 0……0.json => creation of the table
  • 0…..1.json => insertion of the different values

Generally speaking, it’s a good practice to partition your table on keys like date or others columns with lower cardinality. In fact it lower the amount of data loaded with predicate pushdown technique

Let suppose now we want to delete from our user table based on some RGPD request the user_id=3. Remember here user_id is a non partitioned column.

DELETE FROM user 
WHERE user_id=3

before delta lake 2.4, the parquets files partitions(March, May, July, September) that contains the files will be load in memory and the specified rows removed then the resulting files re written in the given partitions.

Here an overview of the _delta_log after this transaction

The deletion process in Delta Lake is initially a soft delete, meaning that the data such as the Parquet files month=March (old), isn’t immediately removed from physical storage. These files are only purged during a VACUUM operation on the table. While querying the dataset on motth with predicate pushdown remains straightforward and fast, updating it—especially on columns that aren't partitioned can be less efficient. This inefficiency arises because updating requires rewriting the entire files, which is more time-consuming.

Here come Deletion Vectors as our hero

2. What are deletion vectors ?

Delta Lake introduces an innovative approach to streamlining delete, upsert, and merge operations on Delta tables. This approach centers on a crucial objective: reducing the necessity to rewrite Parquet files affected by these modifications whenever possible. How does one leverage this feature? It’s a matter of enabling specific configurations via TBL PROPERTIES for your table or setting the parameters at the notebook or cluster level within your data ecosystem.

Databricks recommends using Databricks Runtime 14.1 and above to write tables with deletion vectors to leverage all optimizations. You can read tables with deletion vectors enabled in Databricks Runtime 12.1 and above.

To activate this option, simply add a new TBL Property.

ALTER TABLE users SET TBLPROPERTIES ('delta.enableDeletionVectors' = true);

2.1 How does this feature work ?

Deletion Vectors vastly improve the performance of merge, delete, updates operations on Delta Lake Tables. This is done this by specifying that writers mark the position of deleted rows separately from the data files themselves. This is also known as a “soft delete”. The position of the deleted rows are encoded in a highly compressed bitmap format, RoaringBitmap, that can be compacted into the data files separately later.

Instead of rewritting several parquets, we only have for this transaction one deletion_vectors_xxxx.bin representating the modifications across all the partitions.

In order to optimize storage, you can orchestrate trigger on deletion vector to in fact delete the deletion vectors in you table directory

REORG TABLE users APPLY (PURGE);

-- apply something specific to partitition
REORG TABLE daily_user_actions
WHERE month IN ('March', 'April', 'May')
APPLY (PURGE);

This action will in fact apply the deletion vectors to the specified parquets and make them smaller (if deletion) or update them removing the .bin files created.

In conclusion this new feature brought by delta lake open source team can speed up incremental updates of tables. Some benchmarks made when testing the features tend to say we can speed up workloads from 2 to 10 times.

If you want to go further and analyze the design, here the link (https://docs.google.com/document/d/1lv35ZPfioopBbzQ7zT82LOev7qV7x4YNLkMr2-L5E_M/edit?pli=1#heading=h.dd2cc57oc5wk)

Feel free to follow me on Medium and LinkedIn (https://www.linkedin.com/in/ange-kouame/) for more news around Spark and Databricks.

--

--

ANGE KOUAME

Data Engineer in France sharing insights on Spark, Databricks, Azure, and cloud innovations. Join me to explore tech frontiers. #DataEngineering #CloudTech