-
Notifications
You must be signed in to change notification settings - Fork 0
Always apply per-file schema during parquet read #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this and the performance goes back to the performance of main
// Don't load the page index yet - we will decide later if we need it | ||
let options = ArrowReaderOptions::new().with_page_index(false); | ||
|
||
// Don't load the page index yet. Since it is not stored inline in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is all comments and renaming variables so they better reflect what they are (metadata
--> reader_metadata
for example to distinguish between ParquetMetaData
)
if let Some(merged) = | ||
apply_file_schema_type_coercions(&table_schema, &physical_file_schema) | ||
{ | ||
physical_file_schema = Arc::new(merged); | ||
options = options.with_schema(Arc::clone(&physical_file_schema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ths is the actual fix -- to update reader_metadata
here
let options = ArrowReaderOptions::new().with_page_index(false); | ||
|
||
// Don't load the page index yet. Since it is not stored inline in | ||
// the footer, loading the page index if it is not needed will do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is some of the trickiest code in the parquet reader I think
… ParquetOpener (apache#15561) * parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener * use file schema, avoid loading page index if unecessary * Add comment * add comment * Add comment * remove check * fix clippy * update sqllogictest * restore to explain plans * reverted * modify access * Fix ArrowReaderOptions should read with physical_file_schema so we do… (#17) * Fix ArrowReaderOptions should read with physical_file_schema so we don't need to cast back to utf8 * Fix fmt * Update opener.rs * Always apply per-file schema during parquet read (#18) * Update datafusion/datasource-parquet/src/opener.rs --------- Co-authored-by: Qi Zhu <821684824@qq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Target parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener apache/datafusion#15561
Follow on to @zhuqi-lucas's PR Fix ArrowReaderOptions should read with physical_file_schema so we do… #17
#17 figured out the root cause of the problem (not updating the metadata used to create a reader) but I think it only applies when there is a page index to load. The clickbench queries in particular do not have page indexes so the code is skipped
I updated the code to always update the reader_metadata