Works portfolio

Parquet at HuggingFace

Work

This work was realized for Hugging Face .

2022 2024 2025 Data Visualization Hyparquet Observable Parquet Python Svelte Tailwind CSS TypeScript

In 2022, we decided to push the adoption of the Parquet format for datasets, as it’s generally adapted to machine learning data. In that sense, we decided to automatically convert every dataset to Parquet. The converted files are available in a special “branch” of the dataset repository. Read the docs for more information.

Convert every dataset to Parquet
Convert every dataset to Parquet

In 2024, I developed a Parquet metadata viewer for Hugging Face, allowing users to easily inspect the metadata of their Parquet files. The component is based on the GGUF viewer for ML model metadata.

The viewer is a Svelte component which uses hyparquet to parse the Parquet files.

Parquet metadata viewer
Parquet metadata viewer on Hugging Face

In 2025, I did experiments to display the content difference between two Parquet files. Parquet is a columnar format, and each column is stored in pages within “row groups”. By comparing the size of these pages, we can detect (with some margin of uncertainty) which pages are different between two files. Check the tool in the Parquet diff notebook.

Diff of two Parquet files on Hugging Face
Diff of two Parquet files on Hugging Face

We also used visual tools to represent how parameters like compression or CDC (chunk-based compression) used when writing Parquet files can impact the difference between to files after an operation (read Parquet Content-Defined Chunking for more details).

Diff of two Parquet files on Hugging Face
Diff of two Parquet files on Hugging Face