Always apply per-file schema during parquet read #18

alamb · 2025-04-04T17:26:19Z

Target parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener apache/datafusion#15561
Follow on to @zhuqi-lucas's PR Fix ArrowReaderOptions should read with physical_file_schema so we do… #17

#17 figured out the root cause of the problem (not updating the metadata used to create a reader) but I think it only applies when there is a page index to load. The clickbench queries in particular do not have page indexes so the code is skipped

I updated the code to always update the reader_metadata

alamb

I tested this and the performance goes back to the performance of main

alamb · 2025-04-04T17:27:14Z

datafusion/datasource-parquet/src/opener.rs

-            // Don't load the page index yet - we will decide later if we need it
-            let options = ArrowReaderOptions::new().with_page_index(false);
-
+            // Don't load the page index yet. Since it is not stored inline in


this is all comments and renaming variables so they better reflect what they are (metadata --> reader_metadata for example to distinguish between ParquetMetaData)

alamb · 2025-04-04T17:27:31Z

datafusion/datasource-parquet/src/opener.rs

            if let Some(merged) =
                apply_file_schema_type_coercions(&table_schema, &physical_file_schema)
            {
                physical_file_schema = Arc::new(merged);
+                options = options.with_schema(Arc::clone(&physical_file_schema));


Ths is the actual fix -- to update reader_metadata here

zhuqi-lucas · 2025-04-05T02:16:11Z

datafusion/datasource-parquet/src/opener.rs

-            let options = ArrowReaderOptions::new().with_page_index(false);
-
+            // Don't load the page index yet. Since it is not stored inline in
+            // the footer, loading the page index if it is not needed will do


Good comments!

this is some of the trickiest code in the parquet reader I think

… ParquetOpener (apache#15561) * parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener * use file schema, avoid loading page index if unecessary * Add comment * add comment * Add comment * remove check * fix clippy * update sqllogictest * restore to explain plans * reverted * modify access * Fix ArrowReaderOptions should read with physical_file_schema so we do… (#17) * Fix ArrowReaderOptions should read with physical_file_schema so we don't need to cast back to utf8 * Fix fmt * Update opener.rs * Always apply per-file schema during parquet read (#18) * Update datafusion/datasource-parquet/src/opener.rs --------- Co-authored-by: Qi Zhu <821684824@qq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Always apply per-file schema during parquet read

9249156

github-actions bot added the datasource label Apr 4, 2025

alamb commented Apr 4, 2025

View reviewed changes

alamb mentioned this pull request Apr 4, 2025

parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener apache/datafusion#15561

Merged

adriangb merged commit 34993f2 into pydantic:move-predicate Apr 4, 2025
27 checks passed

alamb deleted the alamb/fix_load_try_2 branch April 4, 2025 17:40

zhuqi-lucas reviewed Apr 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always apply per-file schema during parquet read #18

Always apply per-file schema during parquet read #18

alamb commented Apr 4, 2025

alamb left a comment

alamb Apr 4, 2025

alamb Apr 4, 2025

zhuqi-lucas Apr 5, 2025

alamb Apr 5, 2025

Always apply per-file schema during parquet read #18

Always apply per-file schema during parquet read #18

Conversation

alamb commented Apr 4, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 4, 2025

Choose a reason for hiding this comment

alamb Apr 4, 2025

Choose a reason for hiding this comment

zhuqi-lucas Apr 5, 2025

Choose a reason for hiding this comment

alamb Apr 5, 2025

Choose a reason for hiding this comment