Skip to content

Always apply per-file schema during parquet read #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 4, 2025

Conversation

alamb
Copy link

@alamb alamb commented Apr 4, 2025

#17 figured out the root cause of the problem (not updating the metadata used to create a reader) but I think it only applies when there is a page index to load. The clickbench queries in particular do not have page indexes so the code is skipped

I updated the code to always update the reader_metadata

Copy link
Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this and the performance goes back to the performance of main

// Don't load the page index yet - we will decide later if we need it
let options = ArrowReaderOptions::new().with_page_index(false);

// Don't load the page index yet. Since it is not stored inline in
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is all comments and renaming variables so they better reflect what they are (metadata --> reader_metadata for example to distinguish between ParquetMetaData)

if let Some(merged) =
apply_file_schema_type_coercions(&table_schema, &physical_file_schema)
{
physical_file_schema = Arc::new(merged);
options = options.with_schema(Arc::clone(&physical_file_schema));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ths is the actual fix -- to update reader_metadata here

@adriangb adriangb merged commit 34993f2 into pydantic:move-predicate Apr 4, 2025
27 checks passed
@alamb alamb deleted the alamb/fix_load_try_2 branch April 4, 2025 17:40
let options = ArrowReaderOptions::new().with_page_index(false);

// Don't load the page index yet. Since it is not stored inline in
// the footer, loading the page index if it is not needed will do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is some of the trickiest code in the parquet reader I think

adriangb added a commit that referenced this pull request Apr 6, 2025
… ParquetOpener (apache#15561)

* parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener

* use file schema, avoid loading page index if unecessary

* Add comment

* add comment

* Add comment

* remove check

* fix clippy

* update sqllogictest

* restore to explain plans

* reverted

* modify access

* Fix ArrowReaderOptions should read with physical_file_schema so we do… (#17)

* Fix ArrowReaderOptions should read with physical_file_schema so we don't need to cast back to utf8

* Fix fmt

* Update opener.rs

* Always apply per-file schema during parquet read (#18)

* Update datafusion/datasource-parquet/src/opener.rs

---------

Co-authored-by: Qi Zhu <821684824@qq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants