You provide just the right balance of the nitty details needed to understand the _why_ its one the way it does - makes understanding the _how_ (to best use) much clearer.
Would you consider doing something similar for ORC?
Great one!! One question on this, if pages are lowest unit of data in parquet, does each task (or core) read one page or one row group? As far as I've seen, maximum number of cores it uses to read a parquet file is limited by number of row groups. But if page is the lowest unit of data, then it should be limited by number of pages and not row groups. Any thoughts on this?
Yeah, Spark and other arrow-based readers usually operate at the row-group level, and that's where the parallelism comes from. Pages are internal to column chunks and are designed for I/O efficiency and compression - not for parallel execution- because they're too small to make distributed processing efficient (the coordination cost would be huge). Also, row-groups are based on rows :), which means they contain all the columns, so they're self-contained in a way. This allows nodes or processes to work with those pieces independently.
Maybe there are other reasons too, but those are the main ones, I think. I hope that makes sense to you
Amazing. Thnx guys. It's an eye opener.
I am a data engineer this article gave me orgasm
You're welcome, I guess ...
Fantastic drill down!
You provide just the right balance of the nitty details needed to understand the _why_ its one the way it does - makes understanding the _how_ (to best use) much clearer.
Would you consider doing something similar for ORC?
Great one!! One question on this, if pages are lowest unit of data in parquet, does each task (or core) read one page or one row group? As far as I've seen, maximum number of cores it uses to read a parquet file is limited by number of row groups. But if page is the lowest unit of data, then it should be limited by number of pages and not row groups. Any thoughts on this?
That's a great question!
Yeah, Spark and other arrow-based readers usually operate at the row-group level, and that's where the parallelism comes from. Pages are internal to column chunks and are designed for I/O efficiency and compression - not for parallel execution- because they're too small to make distributed processing efficient (the coordination cost would be huge). Also, row-groups are based on rows :), which means they contain all the columns, so they're self-contained in a way. This allows nodes or processes to work with those pieces independently.
Maybe there are other reasons too, but those are the main ones, I think. I hope that makes sense to you
Got it . Thanks!!