Using Apache Parquet to store data
Are you a data scientist using CSV files to store your data? What if I told you there is a better way? Can you imagine a
- lighter
- faster
- cheaper
file format to save your datasets?
Don’t get me wrong, I actually love CSVs. They’re super easy to open with any text editor, inspect, and share with others. Plus, they’ve become the go-to format for datasets in the AI/ML community, so they definitely have their perks.
However, there’s a little hiccup with CSVs that can be a bit annoying. You see, they’re stored as a list of rows, which makes them kinda slow when it comes to querying data. SQL and CSV just don’t mesh well together in that regard. On top of that, they tend to hog a lot of disk space, making storage a bit of a headache.
Is there an alternative to CSVs? Yes!
Welcome Parquet
Apache Parquet is a columnar storage format available to any project […], regardless of the choice of data processing framework, data model or programming language.
— https://parquet.apache.org/
As an alternative to the CSV row-oriented format, we have a column-oriented format: Parquet.
Parquet is an open-source format for storing data, licensed under Apache. Data engineers are used to Parquet. But, sadly, data scientists are still lagging behind.
How is Parquet format different from CSV? Let’s imagine you have this dataset.
Internally, the CSV file stores the data based on its rows
Parquet, on the other hand, stores the data based on its columns.
Why column-storing is better than row-storing you ask? 2 technical reasons and 1 business reason.
Tech reason #1: Parquet files are way smaller than CSVs. When you use Parquet, it compresses each column based on its data type (like integers, strings, dates), which makes the files way smaller. So, a massive CSV file of 1TB becomes a Parquet file that’s only around 100GB (just 10% of the original size).
Tech reason #2: Parquet files are lightning-fast when it comes to querying data. Since the data is organized in columns, it’s super quick to scan and extract what you need. So, if you’re running an SQL query and only need certain columns, it won’t waste time scanning the rest. That means faster queries and less waiting around.
Business reason #3: Parquet files are a budget-friendly choice. If you’re using storage services like AWS S3 or Google Cloud Storage, they charge you based on how much data you store or scan. But since Parquet files are lightweight and speedy to scan, you’ll end up storing the same data for a fraction of the cost. So, it’s a win-win for your wallet!
And now the cherry on top of the cake: Working with Parquet files in Pandas is as easy as working with CSVs
- Wanna read a data file?
Stop doing: pd.read_csv('file.csv')
instead do: pd.read_parquet('file.parquet') - Wanna save data to disk?
Stop doing: df.to_csv('my_file.csv')
instead, do : df.to_parquet('my_file.parquet') - Trick: Wanna transform all your old CSV files into Parquet? Simple. pd.read_csv('my_file.csv').to_parquet('my_file.parquet')
In summary, Parquet is better than other storage formats because it provides several advantages such as:
- Columnar storage layout which allows for efficient compression and encoding of data, resulting in smaller storage size and faster read/write operations.
- Schema evolution support, which allows for adding or removing columns from a table without requiring a full rewrite of the data.
- Predicate pushdown support, which allows for filtering data at the storage level rather than reading and filtering in the application level, resulting in faster query execution.
- Widely supported by various big data processing frameworks such as Apache Spark, Hive, and Impala.
- Parquet is designed for high-performance big data processing and storage, which is suitable for big data storage and data analytics.
I hope that you will find this article insightful and informative. If you enjoyed it, please consider sharing the link with your friends, family, and colleagues. If you have any suggestions or feedback, please feel free to leave a comment. And if you’d like to stay updated on my future content, please consider following and subscribing using the provided link. Thank you for your support!