Battle of the file formats: Parquet, Delta Lake, Iceberg, Hudi

Tapas Das
15 min readAug 17, 2024

--

With the growing popularity of the data lakehouse, there has been a growing battle between the 4 popular file formats, namely parquet, delta lake, iceberg and hudi, for storing and processing massive amounts of data efficiently.

Let’s do a deeper dive into these formats to understand their inner workings, features, benefits and limitations.

Apache Parquet

Apache Parquet is a columnar storage file format designed for efficient data processing, particularly in the context of big data applications. It is an open-source format developed as part of the Apache Hadoop ecosystem, but it is widely used beyond Hadoop in various data processing frameworks, including Apache Spark, Apache Drill, and Amazon Athena.

Structure

A Parquet file is divided into several components, which allow it to efficiently store and retrieve data:

  • File Header: Contains the magic number that identifies the file as a Parquet file.
  • Row Groups: A Parquet file is divided into row groups, each containing a large number of rows. A row group is the minimum unit of data that can be read or written to a Parquet file.
  • Column Chunks: Each row group contains column chunks, where each chunk stores the data for a particular column across all rows in that row group. The columnar nature allows for efficient access to individual columns.
  • Pages: Column chunks are further divided into pages. Pages are the smallest unit of storage within a Parquet file and can be encoded and compressed individually. Types of pages include data pages (storing the actual column data), dictionary pages (storing dictionary entries if dictionary encoding is used), and index pages (for quick lookups).
  • File Footer: The footer contains metadata about the file, including the schema, the number of rows, and the location of row groups and column chunks within the file. It also includes the file’s magic number at the end, serving as a consistency check.

For more info, refer: File Format | Parquet (apache.org)

Key Features

  • Columnar Storage: Parquet organizes data by columns rather than rows. This means that all values for a particular column are stored together, which is highly efficient for queries that only need to access specific columns rather than the entire dataset.
  • Efficient Data Compression: Due to its columnar storage, Parquet allows for better compression rates. Similar data types are stored together, which enables more effective compression techniques, reducing storage space and I/O bandwidth.
  • Schema Evolution: Parquet supports schema evolution, meaning that you can add or modify columns in the schema without breaking backward compatibility. This is particularly useful in dynamic data environments.
  • Splittable Format: Parquet files can be split into smaller chunks for parallel processing. This feature is crucial for distributed data processing systems, allowing large datasets to be processed concurrently by multiple nodes.
  • Compatibility: Parquet is designed to work with various big data processing tools and frameworks. It is supported by most data processing engines, such as Apache Hadoop, Apache Spark, and others, making it a versatile format for big data analytics.
  • Support for Complex Data Types: Parquet supports complex nested data structures, including arrays, maps, and structs. This allows it to efficiently store and query hierarchical or multi-dimensional data.

Benefits

  • Query Performance: Because of its columnar storage, Parquet is highly efficient for queries that need to access only a subset of columns, reducing the amount of data that needs to be read from disk.
  • Storage Efficiency: The combination of columnar storage and advanced compression techniques leads to significant storage savings, particularly for large datasets with repetitive data.
  • Interoperability: Parquet’s widespread support across various big data tools and platforms makes it an ideal choice for environments that need to process and analyze large datasets with multiple tools.
  • Cost-Effective: By reducing storage requirements and improving query efficiency, Parquet helps lower the overall cost of data processing and storage.

Limitations

  • No ACID Transactions: Parquet is a file format, not a data management system, so it doesn’t support ACID (Atomicity, Consistency, Isolation, Durability) transactions. This can make it challenging to manage concurrent writes or ensure data consistency.
  • No Schema Evolution: While Parquet supports schema evolution to some extent, it’s limited compared to more sophisticated table formats like Delta Lake or Iceberg. Changing the schema might require rewriting the data.
  • No Data Management Features: Parquet doesn’t offer data versioning, time travel, or built-in support for handling streaming data.
  • Complex File Management: Users must manually manage Parquet files, which can be error-prone, especially with large-scale datasets.

Delta Lake

Delta Lake is an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transactions on top of Apache Parquet and is used primarily in data lake architectures to ensure data reliability and consistency while enabling complex, large-scale data processing. Delta Lake is tightly integrated with Apache Spark, but it can also be used with other processing engines.

Structure

Delta Lake’s structure builds on top of the underlying Parquet files, with additional components to manage data consistency, versioning, and ACID transactions.

  • Transaction Log: The transaction log is the core component of Delta Lake. It records every change made to the data, including inserts, updates, deletes, and schema changes. The log is stored as a sequence of JSON files that detail the operations performed on the data.
  • Parquet Files: The actual data in a Delta Lake table is stored in Parquet files. Each transaction may create new Parquet files, and older files might be retained or removed depending on the operations performed.
  • Checkpoints: To optimize performance, Delta Lake periodically writes out checkpoints that summarize the state of the transaction log up to a certain point. These checkpoints allow faster recovery and query performance by reducing the need to read through the entire transaction log.
  • Data Files: The Parquet files containing the actual data are organized into a hierarchical directory structure. Each version of the data corresponds to a set of Parquet files, and the transaction log keeps track of which files belong to which version.

For more info, refer: Welcome to the Delta Lake documentation — Delta Lake Documentation

Key Features

  • ACID Transactions: Delta Lake ensures that all operations on the data are atomic, consistent, isolated, and durable. This allows for reliable data ingestion and updates without risking data corruption or inconsistency, even in the presence of concurrent reads and writes.
  • Scalable Metadata Handling: Delta Lake uses a transaction log to track changes to the data, allowing it to efficiently manage large-scale datasets with millions or billions of files. This transaction log is a central component that ensures data consistency and enables features like time travel.
  • Time Travel: Delta Lake allows users to access previous versions of the data by using the transaction log. This feature is useful for auditing, debugging, or reverting to a previous state of the data.
  • Data Versioning: Every change to the data (such as updates, inserts, and deletes) creates a new version in the transaction log. This versioning system enables features like time travel and allows for safe rollbacks if necessary.
  • Schema Enforcement and Evolution: Delta Lake enforces a schema at write time, ensuring that all data conforms to a specified structure. Additionally, it supports schema evolution, allowing the schema to change over time in a controlled manner (e.g., adding new columns).
  • Unified Batch and Streaming Processing: Delta Lake can handle both batch and streaming data in a unified way. This means that you can ingest real-time streaming data and batch data into the same Delta Lake table and query it without needing to handle them separately.
  • Data Lineage: Delta Lake keeps track of data transformations and the lineage of data over time. This feature is valuable for understanding how data has changed, ensuring data quality, and meeting regulatory requirements.
  • Support for Deletes, Updates, and Merges: Unlike traditional data lakes, where such operations can be challenging and inefficient, Delta Lake natively supports these operations, allowing for easier and more efficient data management.

Benefits

  • Data Reliability: Delta Lake provides strong guarantees around data consistency and reliability, making it easier to manage large datasets with complex workflows and ensuring that the data is trustworthy.
  • Efficient Data Management: With support for ACID transactions, Delta Lake simplifies operations like updates, deletes, and merges, which are traditionally difficult to perform efficiently in data lakes.
  • Time Travel and Data Versioning: The ability to query previous versions of the data and rollback changes is a powerful feature for data auditing, debugging, and compliance with data governance policies.
  • Performance Improvements: Delta Lake’s transaction log and checkpoints improve query performance by reducing the overhead associated with managing large datasets. This is particularly beneficial in scenarios with frequent updates or streaming data.
  • Unified Architecture: By supporting both batch and streaming data in a unified framework, Delta Lake simplifies the architecture of data pipelines, allowing for more streamlined and efficient data processing.
  • Open Source and Ecosystem Integration: Delta Lake is open-source and has broad support in the big data community. It integrates well with Apache Spark and other big data tools, making it easy to adopt within existing infrastructures.

Limitations:

  • Tight Coupling with Spark: Delta Lake is tightly integrated with Apache Spark. While it can be used with other engines like Presto or Hive, the experience is best with Spark. This can limit its adoption in non-Spark environments.
  • Limited Ecosystem Support: Compared to more open formats like Iceberg or Parquet, Delta Lake has a smaller ecosystem of tools and platforms that natively support it.
  • Potential Vendor Lock-In: Delta Lake was initially developed by Databricks, and while it is open source, some advanced features and optimizations are more accessible in the Databricks ecosystem.
  • Overhead for Small Files: Like other formats that provide transaction support, Delta Lake can incur overhead when dealing with many small files due to the maintenance of transaction logs.

Apache Iceberg

Apache Iceberg, developed by Netflix, is an open-source table format designed for managing large-scale datasets in data lakes. Iceberg provides a high-performance and reliable framework for working with massive datasets, offering features such as schema evolution, partitioning, and versioning. Unlike traditional file formats, Iceberg is designed to handle petabytes of data efficiently, making it ideal for big data applications.

Structure

The internal structure of Apache Iceberg is designed to manage large datasets efficiently while ensuring flexibility and performance:

  • Table Metadata: Iceberg tables are defined by metadata files that store information about the schema, partitioning, snapshots, and other properties of the table. The metadata file is the core component that drives how data is read and written.
  • Manifests and Manifest Lists:

Manifests: These are files that contain metadata about data files within a table. A manifest tracks a set of data files, including their file paths, partition values, and row counts. Each manifest represents a portion of the table’s data.

Manifest Lists: These files aggregate multiple manifests into a single snapshot, representing the entire state of the table at a given point in time. When a new snapshot is created, a new manifest list is generated.

  • Data Files: The actual data in an Iceberg table is stored in Parquet, ORC, or Avro files. These data files are managed and organized according to the table’s partitioning and stored in the data lake’s underlying storage system.
  • Snapshots: Each snapshot represents a consistent view of the table at a specific time, including all the data files that comprise the table at that point. Snapshots are immutable, and new snapshots are created for every change (insert, update, delete) to the table.
  • Partitioning: Iceberg partitions data dynamically and efficiently. Instead of relying on rigid directory structures for partitions, it uses partitioning expressions that allow flexible data grouping and efficient query pruning.

For more info, refer: Introduction — Apache Iceberg™

Key Features

  • Schema Evolution: Iceberg allows you to change the schema of a table without requiring costly migrations or full rewrites of data. This includes adding, renaming, or dropping columns, and changing column types. Iceberg ensures backward and forward compatibility with different schema versions.
  • Partitioning: Iceberg introduces a flexible and efficient approach to partitioning. It supports partitioning strategies that can be defined independently of the data’s physical layout. This enables efficient querying by pruning partitions that don’t match query filters, improving performance.
  • ACID Transactions: Iceberg provides ACID (Atomicity, Consistency, Isolation, Durability) guarantees for data modifications, allowing safe and reliable data updates, inserts, and deletes. This is essential for ensuring data consistency in multi-user environments.
  • Data Versioning and Time Travel: Iceberg tracks changes to data over time, enabling users to query historical versions of the data. This feature, known as “time travel,” is useful for auditing, debugging, and recovering from accidental modifications.
  • Snapshot Isolation: Iceberg uses a snapshot-based isolation model, where each operation (like insert, update, or delete) creates a new snapshot of the table’s state. This allows concurrent operations to be handled efficiently without conflicts.
  • Efficient Metadata Management: Iceberg’s metadata is designed to scale with the size of the dataset, enabling efficient querying and management of large tables. It includes metadata files that describe the structure and layout of the data, which helps optimize query planning.
  • Multi-Engine Support: Iceberg is engine-agnostic and integrates with various data processing engines like Apache Spark, Apache Flink, Trino, and Presto. This flexibility allows users to choose the best tools for their workloads.
  • Hidden Partitioning: Iceberg abstracts the complexity of partitioning from the user. It automatically handles partition pruning and ensures that users don’t need to manage partitions manually, reducing the likelihood of errors.

Benefits

  • Scalability: Iceberg is designed to handle tables with billions of records and petabytes of data. Its efficient metadata management and partitioning strategies allow it to scale without performance degradation.
  • Reliability: With ACID transaction support and snapshot isolation, Iceberg ensures data consistency and reliability, even in multi-user or concurrent environments.
  • Flexibility: The ability to evolve schemas and use flexible partitioning strategies makes Iceberg adaptable to changing data requirements and reduces the complexity of managing large datasets.
  • Performance: Iceberg optimizes query performance through techniques like partition pruning, efficient metadata handling, and hidden partitioning. This results in faster queries and lower resource usage.
  • Integration: Iceberg’s compatibility with various data processing engines allows it to be used in diverse environments, making it a versatile choice for big data processing.
  • Version Control and Time Travel: The ability to query historical data versions and revert to previous states provides powerful tools for data management, auditing, and recovery.

Limitations

  • Complexity: Iceberg introduces more complexity in terms of setup and configuration compared to simpler formats like Parquet or even Delta Lake. It requires a deeper understanding to take full advantage of its features.
  • Performance Overhead: The extra features like schema evolution, hidden partitioning, and versioning can introduce overhead, especially in scenarios where these features are not needed.
  • Evolving Ecosystem: Although it is gaining traction, Iceberg’s ecosystem and community support are still maturing compared to more established formats.
  • Compatibility: While Iceberg is designed to be compatible with many compute engines (like Spark, Presto, and Flink), full compatibility and optimized performance may require additional configuration or newer versions of these engines.

Apache Hudi

Apache Hudi (Hadoop Upsert Delete and Incremental) is an open-source data management framework that provides the ability to manage large datasets on top of distributed storage systems like HDFS, Amazon S3, and Google Cloud Storage. Hudi brings stream processing capabilities to batch processing frameworks, enabling efficient data ingestion, updates, and deletes in big data environments. It was originally developed by Uber to address the challenges of maintaining data freshness in data lakes.

Structure

Apache Hudi organizes data in a data lake using a structured approach that facilitates efficient data management, querying, and processing:

  • Hudi Tables: Hudi organizes data into tables, which can be of two types: Copy on Write (COW) and Merge on Read (MOR). Each table is a collection of files stored in a distributed file system like HDFS or cloud storage.
  • Copy on Write (COW) Tables: Data is stored in a row-based format (e.g., Parquet). When data is updated, a new version of the file is created, and the old version is marked for deletion during cleaning.
  • Merge on Read (MOR) Tables: Data is stored in a combination of base files (in Parquet format) and delta logs (in Avro format). When data is updated, only the delta logs are updated, and the base files are compacted periodically to incorporate the changes.
  • Commit Timeline: Hudi maintains a commit timeline that tracks all changes to the dataset over time. Each commit represents a write operation (insert, update, delete) and includes metadata about the operation, such as the timestamp, file paths, and the affected records.
  • Data Files: Hudi stores the actual data in Parquet files for base data and Avro files for delta logs (in the case of MOR tables). These files are organized by partitioning schemes (e.g., by date, region) to improve query performance.
  • Indexes: Hudi uses indexes to map record keys to specific file locations. This indexing mechanism is crucial for efficiently performing upserts and deletes by quickly locating the records that need to be updated or removed.
  • Delta Logs: In MOR tables, Hudi maintains delta logs that store incremental changes to the data. These logs allow for quick updates and serve as a temporary storage area before the data is compacted into base files.
  • Compaction: Hudi periodically compacts delta logs into base files in MOR tables. Compaction is the process of merging the changes recorded in delta logs with the base files to optimize query performance.

For more info, refer: Apache Hudi Stack | Apache Hudi

Key Features

  • Upserts and Deletes: Hudi supports the ability to perform upserts (updates and inserts) and deletes on data stored in a data lake. This allows for the correction of records, updates to existing records, and the deletion of outdated or incorrect data, making it easier to maintain data quality.
  • Incremental Data Processing: Hudi enables incremental data processing by tracking changes to the data. It allows for the efficient ingestion and processing of only the new or updated records rather than the entire dataset, reducing resource consumption and processing time.
  • ACID Transactions: Hudi provides ACID (Atomicity, Consistency, Isolation, Durability) semantics for data operations, ensuring data consistency and reliability even in the presence of concurrent writes and reads.
  • Indexing: Hudi maintains an index that maps record keys to file locations, enabling efficient record lookups during upserts and deletes. This index improves the performance of write operations and ensures that records are updated correctly.
  • Snapshot Isolation and Time Travel: Hudi allows users to query data as it existed at a specific point in time, providing snapshot isolation. This time travel feature is useful for auditing, debugging, and historical analysis.
  • Optimized Data Layout: Hudi supports both row-based (Copy on Write) and column-based (Merge on Read) storage formats. This flexibility allows users to choose between write-optimized or read-optimized storage, depending on their workload requirements.
  • Compaction and Cleaning: Hudi automatically manages the data layout by performing compaction, which merges small files into larger ones to improve query performance. It also supports cleaning policies to remove outdated or redundant data files, freeing up storage space.
  • Integration with Big Data Ecosystem: Hudi integrates well with various big data processing engines like Apache Spark, Apache Flink, Apache Hive, and Presto. It can be used with existing data pipelines and storage systems.
  • Support for Bulk Inserts: Hudi supports bulk insert operations for large initial loads of data, providing high throughput for data ingestion.

Benefits

  • Efficient Data Management: Hudi’s support for upserts and deletes makes it easy to manage evolving datasets, allowing users to correct, update, and delete records efficiently in a data lake environment.
  • Reduced Data Latency: Hudi enables near-real-time data ingestion and processing by supporting incremental data updates and time travel queries. This reduces the latency between data arrival and availability for querying.
  • Scalability: Hudi is designed to handle large-scale datasets, supporting efficient indexing, compaction, and partitioning to manage petabytes of data without compromising performance.
  • Data Consistency and Reliability: With ACID transactions and snapshot isolation, Hudi ensures that data remains consistent and reliable, even in the face of concurrent reads and writes.
  • Cost Efficiency: Hudi’s compaction and cleaning features help reduce storage costs by minimizing data redundancy and optimizing file sizes. This is especially important in cloud environments where storage costs can be significant.
  • Flexibility: Hudi’s dual storage formats (COW and MOR) allow users to optimize their data layout based on their specific workload requirements, whether they prioritize write performance or query performance.
  • Integration with Existing Ecosystems: Hudi’s compatibility with various data processing engines and storage systems makes it easy to integrate into existing big data pipelines, leveraging the power of tools like Apache Spark and Apache Flink.

Limitations

  • Complexity of Configuration: Apache Hudi offers a wide range of features (like incremental data processing and real-time views), but this can also make it complex to configure and manage, especially for teams that don’t need all the advanced features.
  • Tight Coupling with Spark: Similar to Delta Lake, Hudi is primarily optimized for use with Apache Spark. Although it supports other engines, the experience may not be as seamless.
  • Potential Performance Overhead: The transaction and versioning features of Hudi can introduce performance overhead, particularly in scenarios with high-frequency updates or deletes.
  • Data Duplication: Hudi provides support for different types of views (Copy-on-Write and Merge-on-Read), but this can sometimes lead to data duplication and increased storage requirements.
  • Limited Ecosystem: Although Hudi is growing in popularity, its ecosystem is still catching up with more mature alternatives like Delta Lake or Iceberg.

Summary

Choosing the right format depends on specific use cases and requirements.

  • Parquet is a simple, efficient format but lacks data management features.
  • Delta Lake offers ACID transactions but is tightly coupled with Spark.
  • Iceberg is highly flexible but introduces complexity and performance overhead.
  • Hudi provides real-time data management but can be complex and Spark-dependent.

By understanding the strengths and weaknesses of each format, you can make informed decisions for your data architecture.

--

--