Blog | luminousmen

Blog | luminousmen

Share this post

Blog | luminousmen
Blog | luminousmen
Choosing the Right Compression Codec

Choosing the Right Compression Codec

A Guide for People Who Move Data

luminousmen's avatar
luminousmen
Jul 29, 2025
∙ Paid
3

Share this post

Blog | luminousmen
Blog | luminousmen
Choosing the Right Compression Codec
Share

Let me tell you about the time I shot myself in the foot with Gzip.

It started innocently enough. We were building a new data pipeline — daily ingestion of CSV files from an upstream team, landing in Azure Blob Storage, and then feeding into a Apache Spark job downstream. You've probably built something like it.

And because I thought I was being smart — saving costs, shaving off network transfer time, keeping the files tidy (I don't remember the exact "why" to be fair) — I told the data producer team, "Hey, compress the files with Gzip before dropping them in the object storage". They said, "cool". I said, "great".

Fast forward a couple weeks — things started to smell.

Jobs were slowing down. Stages were stalling — half the tasks were stuck at 0%, while others finished in seconds. Executors would randomly spin up, read one file, and then sit there doing nothing.

At first, I thought I'd possibly messed up the partitioning. Maybe I needed more shuffle memory. Maybe autoscaling was drunk again. You know how it goes — blame literally everything except the obvious — isn't it the first rule of software engineering?

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 luminousmen
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share