From: gal salomon Date: Fri, 14 Jan 2022 15:47:02 +0000 (+0200) Subject: adding parquet documentation X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=1b47f81886c100e0449cc191e361a02c6768955d;p=ceph.git adding parquet documentation Signed-off-by: gal salomon --- diff --git a/doc/radosgw/s3select.rst b/doc/radosgw/s3select.rst index 3e38eb6ca91..1ad21a71ea2 100644 --- a/doc/radosgw/s3select.rst +++ b/doc/radosgw/s3select.rst @@ -530,6 +530,26 @@ CSV parsing behavior | | tag | "**IGNORE**" value means to skip the first line | +---------------------------------+-----------------+-----------------------------------------------------------------------+ +Parquet format processing +------------------------- + + | Parquet implementation is about accessing columnar objects(Parquet format) using s3select queries + | The s3select-engine contains a Parquet-reader(apache/arrow) that enables access + | to specific columns according to query, which saves a lot of IOPS. + | The s3select-engine is using (call-back) GetObj-RangeScan to access these types + | of objects. + | The Parquet object is identified by its name(\*.parquet) and magic-number exists + | in objects. thus, upon sending s3select query, there are 2 main flows, one + | for CSV the other for Parquet format. + | RGW chooses the flow according the object name. + | + | upon Parquet processing commencing, the Parquet reader (part of s3select-engine) is taking charge of the flow + | it calls (using RGW call back) to GetObject-rangeScan. + | the rangeScan results return to send_response_data, and back to caller(parquet reader), back to s3select-engine. + | this flow repeats until end-of-query. + | + | the s3select repo contains testing for parquet flow. + | per each query executed on CSV-object the framework is also executing the same query on Parquet-object(that is generated from CSV), the framework validates identical results. BOTO3 -----