Parquet ~ for storing data in columnar format in disk
Arrow ~ for storing in-memory data columnar format

Existing big data crunching systems like PySpark incur huge serde overhead.  70%/80% CPU wasted on serde. Leveraging Parquet and Arrow can help speed up data interchange and boost overall data read/write performance.

Ref: http://www.slideshare.net/wesm/apache-arrow-and-python-the-latest

a3 @dremio

Parquet offers compact type-aware encodings with compression along side highly optimized I/O (projection push down ~ column purging, Predicate push down ~ filters based on stats)

a2

The Arrow columnar format provides excellent CPU cache locality and the ability to leverage vectorized (i.e., SIMD) operations and Pipelining in Intel CPUs.

All systems share same memory format.

a1@dremio

RStudio provides Arrow support to Python and R by creating Python and R bindings on top of the C++ Arrow library.

Feather libray leverages Arrow’s columnar memory layout and simple metadata using Google’s Flatbuffers

Currently feather is suitable for quick analysis of large dataset, not suitable for long term storage.

Using feather:

devtools::install_github("wesm/feather/R")
pip install feather-format
library(feather)
path <- “file.feather”
write_feather(df, path)
df <- read_feather(path)
import feather
path = ‘file.feather’
feather.write_dataframe(df, path)
df = feather.read_dataframe(path)

Benchmarking feather in laptop: https://blog.rstudio.org/2016/03/29/feather/

More on Arrow:

Ref: http://www.slideshare.net/wesm/apache-arrow-and-python-the-latest

 

 


Advertisements