Why Parquet Is the Go-To Format for Data…

Jun 3

With more practical lessons to help you with the data engineering journey

Read →

7 Comments

amrud

Jun 20

Amazing. Thnx guys. It's an eye opener.

Expand full comment

Jatin Jangid

Jun 20

I am a data engineer this article gave me orgasm

Expand full comment

Reply (1)

luminousmen

Jun 20

You're welcome, I guess ...

Expand full comment

Nadav Lavy

Jun 5

Fantastic drill down!

You provide just the right balance of the nitty details needed to understand the _why_ its one the way it does - makes understanding the _how_ (to best use) much clearer.

Would you consider doing something similar for ORC?

Expand full comment

Praveen Kumar B N

Jun 4

Great one!! One question on this, if pages are lowest unit of data in parquet, does each task (or core) read one page or one row group? As far as I've seen, maximum number of cores it uses to read a parquet file is limited by number of row groups. But if page is the lowest unit of data, then it should be limited by number of pages and not row groups. Any thoughts on this?

Expand full comment

Reply (1)

luminousmen

Jun 4

That's a great question!

Yeah, Spark and other arrow-based readers usually operate at the row-group level, and that's where the parallelism comes from. Pages are internal to column chunks and are designed for I/O efficiency and compression - not for parallel execution- because they're too small to make distributed processing efficient (the coordination cost would be huge). Also, row-groups are based on rows :), which means they contain all the columns, so they're self-contained in a way. This allows nodes or processes to work with those pieces independently.

Maybe there are other reasons too, but those are the main ones, I think. I hope that makes sense to you

Expand full comment

Reply (1)

Praveen Kumar B N

Jun 4

Got it . Thanks!!

Expand full comment

Blog | luminousmen

Why Parquet Is the Go-To Format for Data…