From: gal salomon <gal.salomon@gmail.com>
Date: Fri, 14 Jan 2022 15:47:02 +0000 (+0200)
Subject: adding parquet documentation
X-Git-Url: http://git-server-git.apps.pok.os.sepia.ceph.com/?a=commitdiff_plain;h=1b47f81886c100e0449cc191e361a02c6768955d;p=ceph.git

adding parquet documentation

Signed-off-by: gal salomon <gal.salomon@gmail.com>
---

diff --git a/doc/radosgw/s3select.rst b/doc/radosgw/s3select.rst
index 3e38eb6ca91..1ad21a71ea2 100644
--- a/doc/radosgw/s3select.rst
+++ b/doc/radosgw/s3select.rst
@@ -530,6 +530,26 @@ CSV parsing behavior
 |                                 | tag             | "**IGNORE**" value means to skip the first line                       |
 +---------------------------------+-----------------+-----------------------------------------------------------------------+       
 
+Parquet format processing
+-------------------------
+
+    | Parquet implementation is aboutÂ accessing columnar objects(Parquet format) usingÂ s3select queries
+    | The s3select-engine contains a Parquet-reader(apache/arrow) that enables access
+    | to specific columns according to query, which saves a lot of IOPS.
+    | The s3select-engine is using (call-back) GetObj-RangeScan to access these types 
+    | of objects.
+    | The Parquet object is identified by its name(\*.parquet) and magic-number exists 
+    | in objects. thus, upon sending s3select query, there are 2 main flows, one 
+    | for CSV the other for Parquet format.
+    | RGW chooses the flow according the object name.
+    |
+    | upon Parquet processing commencing, the Parquet reader (part of s3select-engine) is taking charge of the flow
+    | it calls (using RGW call back) to GetObject-rangeScan.
+    | the rangeScan results return to send_response_data, and back to caller(parquet reader), back to s3select-engine.
+    | this flow repeats until end-of-query.
+    |
+    | the s3select repo contains testing for parquet flow.
+    | per each query executed on CSV-object the framework is also executing the same query on Parquet-object(that is generated from CSV), the framework validates identical results.
 
 BOTO3
 -----