Parquet ~ for storing data in columnar format in disk
Arrow ~ for storing in-memory data columnar format
Existing big data crunching systems like PySpark incur huge serde overhead. 70%/80% CPU wasted on serde. Leveraging Parquet and Arrow can help speed up data interchange and boost overall data read/write performance.
Parquet offers compact type-aware encodings with compression along side highly optimized I/O (projection push down ~ column purging, Predicate push down ~ filters based on stats)
The Arrow columnar format provides excellent CPU cache locality and the ability to leverage vectorized (i.e., SIMD) operations and Pipelining in Intel CPUs.
All systems share same memory format.
RStudio provides Arrow support to Python and R by creating Python and R bindings on top of the C++ Arrow library.
Feather libray leverages Arrow’s columnar memory layout and simple metadata using Google’s Flatbuffers
Currently feather is suitable for quick analysis of large dataset, not suitable for long term storage.
pip install feather-format
path <- “file.feather”
df <- read_feather(path)
path = ‘file.feather’
df = feather.read_dataframe(path)
Benchmarking feather in laptop: https://blog.rstudio.org/2016/03/29/feather/
More on Arrow: