7 Comments
User's avatar
amrud's avatar

Amazing. Thnx guys. It's an eye opener.

Expand full comment
Jatin Jangid's avatar

I am a data engineer this article gave me orgasm

Expand full comment
luminousmen's avatar

You're welcome, I guess ...

Expand full comment
Nadav Lavy's avatar

Fantastic drill down!

You provide just the right balance of the nitty details needed to understand the _why_ its one the way it does - makes understanding the _how_ (to best use) much clearer.

Would you consider doing something similar for ORC?

Expand full comment
Praveen Kumar B N's avatar

Great one!! One question on this, if pages are lowest unit of data in parquet, does each task (or core) read one page or one row group? As far as I've seen, maximum number of cores it uses to read a parquet file is limited by number of row groups. But if page is the lowest unit of data, then it should be limited by number of pages and not row groups. Any thoughts on this?

Expand full comment
luminousmen's avatar

That's a great question!

Yeah, Spark and other arrow-based readers usually operate at the row-group level, and that's where the parallelism comes from. Pages are internal to column chunks and are designed for I/O efficiency and compression - not for parallel execution- because they're too small to make distributed processing efficient (the coordination cost would be huge). Also, row-groups are based on rows :), which means they contain all the columns, so they're self-contained in a way. This allows nodes or processes to work with those pieces independently.

Maybe there are other reasons too, but those are the main ones, I think. I hope that makes sense to you

Expand full comment
Praveen Kumar B N's avatar

Got it . Thanks!!

Expand full comment