adding parquet documentation

author gal salomon <gal.salomon@gmail.com>

Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)

committer gal salomon <gal.salomon@gmail.com>

Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)
author gal salomon <gal.salomon@gmail.com>
Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)
committer gal salomon <gal.salomon@gmail.com>
Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)
diff --git a/doc/radosgw/s3select.rst b/doc/radosgw/s3select.rst

index 3e38eb6ca91d7cd7c4bc2d5ebc31a24f32b68f40..1ad21a71ea225002d455b438ac3d49c5ad7eb738 100644 (file)
--- a/doc/radosgw/s3select.rst
+++ b/doc/radosgw/s3select.rst
@@ -530,6 +530,26 @@ CSV parsing behavior
  |                                 | tag             | "**IGNORE**" value means to skip the first line                       |
  +---------------------------------+-----------------+-----------------------------------------------------------------------+       
  
+Parquet format processing
+-------------------------
+
+    | Parquet implementation is about accessing columnar objects(Parquet format) using s3select queries
+    | The s3select-engine contains a Parquet-reader(apache/arrow) that enables access
+    | to specific columns according to query, which saves a lot of IOPS.
+    | The s3select-engine is using (call-back) GetObj-RangeScan to access these types 
+    | of objects.
+    | The Parquet object is identified by its name(\*.parquet) and magic-number exists 
+    | in objects. thus, upon sending s3select query, there are 2 main flows, one 
+    | for CSV the other for Parquet format.
+    | RGW chooses the flow according the object name.
+    |
+    | upon Parquet processing commencing, the Parquet reader (part of s3select-engine) is taking charge of the flow
+    | it calls (using RGW call back) to GetObject-rangeScan.
+    | the rangeScan results return to send_response_data, and back to caller(parquet reader), back to s3select-engine.
+    | this flow repeats until end-of-query.
+    |
+    | the s3select repo contains testing for parquet flow.
+    | per each query executed on CSV-object the framework is also executing the same query on Parquet-object(that is generated from CSV), the framework validates identical results.
  
  BOTO3
  -----
author	gal salomon <gal.salomon@gmail.com>
	Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)
committer	gal salomon <gal.salomon@gmail.com>
	Fri, 14 Jan 2022 15:47:02 +0000 (17:47 +0200)