Skip to content

add dot support in DSL indentifier #3989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jan 29, 2024
Merged

Conversation

JerryKwan
Copy link
Collaborator

add dot support in DSL indentifier

Description

In the previous implementation, the identifier only supports { a..z | A..Z | 0..9 | _ }, and does not support set partition_key on key in json field and key with dot

How was this PR tested?

Closes:#3380

add dot support in DSL indentifier
@fmassot fmassot requested a review from trinity-1686a October 19, 2023 07:35
@fmassot
Copy link
Collaborator

fmassot commented Oct 19, 2023

@trinity-1686a can you take a look?

Copy link
Contributor

@trinity-1686a trinity-1686a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
your changes seems to work fine for a json like

{
  "abc.def": 123
}

but not for

{
  "abc": {
    "def": 123
  }
}

(in that case, hash_attribute search for the literal key abc.def and finds nothing)

We want to support both use cases (probably with slightly different syntax, like having abc\.def to search for the literal key abc.def, and abc.def for nested keys).

Do you want to implement the whole thing (that is, a bit more change to the dsl, and teach hash_attribute how to handle nested keys), or leave things mostly as is and keep the issue open for now? (in the 2nd case, I'd still prefer for the dsl to want . for not nested keys to be escaped, so it's not a breaking change when support for nested key is implemented)

@JerryKwan
Copy link
Collaborator Author

@trinity-1686a Good catch.
Do you mean the function named hash_attribute?
https://github.com/quickwit-oss/quickwit/blob/main/quickwit/quickwit-doc-mapper/src/routing_expression/mod.rs#L72-L81
I think it would be better to do some changes to support nested keys, and considering the key names with dot (.)
Although I don't think using dot (.) in some key names is a good practice

@fmassot
Copy link
Collaborator

fmassot commented Oct 19, 2023

Although I don't think using dot (.) in some key names is a good practice

We do not have the choice, users send data with arbitrary keys.

I agree with @trinity-1686a on the syntax:

  • abc\.def to search for the literal key abc.def
  • abc.def for nested keys

@JerryKwan
Copy link
Collaborator Author

JerryKwan commented Oct 19, 2023

Okay, let's use abc\.def to search for the literal key and abc.def for nested keys

@trinity-1686a
Copy link
Contributor

(I took the liberty of editing your message, the \ got eaten my markdown and made the message unclear)

refactor hash_attribute to hash nested keys
add test case
minor bug fix
add more test cases
@JerryKwan
Copy link
Collaborator Author

JerryKwan commented Oct 22, 2023

@trinity-1686a @fmassot
I did some changes to hash_attribute(), and it should support the usage of abd.def for escaped dot and abd.def for nested keys
Would you please help to review again? thank you

Copy link
Collaborator

@fmassot fmassot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are on the indexing critical path here so we need to be extra careful when doing extra allocation. This make this issue an interesting problem! Can you review your code and try to remove all allocations?

@fmassot
Copy link
Collaborator

fmassot commented Oct 22, 2023

@JerryKwan: you could start by adding a benchmark to monitor the performance (thanks @guilload for the suggestion!). There is a benchmark for the doc mapper, we could probably use it.

@JerryKwan
Copy link
Collaborator Author

@fmassot @guilload
Thanks for the suggestions.
Based on we are on the indexing critical path, we should minimize unnecessary allocations whenever possible. Any suggestions about how we can do it?

optimize routing expression
add benchmark test for routing expression
@JerryKwan
Copy link
Collaborator Author

@fmassot @trinity-1686a
Would you please help to review again? thank you

@fmassot
Copy link
Collaborator

fmassot commented Nov 12, 2023

@JerryKwan can you share the bench results before/after?

@JerryKwan
Copy link
Collaborator Author

@fmassot
This is the bench result before change on my dev environment

test tag_pruning::test::test_match_tag_field_name ... ignored

test result: ok. 0 passed; 0 failed; 180 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/doc_to_json_bench.rs (target/release/deps/doc_to_json_bench-02707d090c296519)
Gnuplot not found, using plotters backend
Benchmarking simple-json-to-doc/simple-json-to-doc: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.7s, enable flat sampling, or reduce sample count to 60.
simple-json-to-doc/simple-json-to-doc
                        time:   [952.95 µs 957.72 µs 963.15 µs]
                        thrpt:  [86.808 MiB/s 87.301 MiB/s 87.738 MiB/s]
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
Benchmarking simple-json-to-doc/simple-json-to-doc-tantivy: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.0s, enable flat sampling, or reduce sample count to 60.
simple-json-to-doc/simple-json-to-doc-tantivy
                        time:   [1.1479 ms 1.1519 ms 1.1565 ms]
                        thrpt:  [72.293 MiB/s 72.583 MiB/s 72.839 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

And this is the one after changes

test tag_pruning::test::test_extract_tags_from_query_range_query_conjunction ... ignored
test tag_pruning::test::test_match_tag_field_name ... ignored

test result: ok. 0 passed; 0 failed; 187 ignored; 0 measured; 0 filtered out; finished in 0.01s

     Running benches/doc_to_json_bench.rs (target/release/deps/doc_to_json_bench-02707d090c296519)
Gnuplot not found, using plotters backend
Benchmarking simple-json-to-doc/simple-json-to-doc: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.5s, enable flat sampling, or reduce sample count to 50.
simple-json-to-doc/simple-json-to-doc
                        time:   [964.23 µs 968.95 µs 974.78 µs]
                        thrpt:  [86.750 MiB/s 87.272 MiB/s 87.700 MiB/s]
                 change:
                        time:   [+3.8428% +6.3502% +8.8367%] (p = 0.00 < 0.05)
                        thrpt:  [-8.1193% -5.9710% -3.7006%]
                        Performance has regressed.
Benchmarking simple-json-to-doc/simple-json-to-doc-tantivy: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.8s, enable flat sampling, or reduce sample count to 60.
simple-json-to-doc/simple-json-to-doc-tantivy
                        time:   [1.0735 ms 1.0766 ms 1.0803 ms]
                        thrpt:  [78.280 MiB/s 78.542 MiB/s 78.774 MiB/s]
                 change:
                        time:   [-7.0396% -6.5161% -6.0295%] (p = 0.00 < 0.05)
                        thrpt:  [+6.4163% +6.9703% +7.5727%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

     Running benches/routing_expression_bench.rs (target/release/deps/routing_expression_bench-7249b0fa809dafee)
Gnuplot not found, using plotters backend
simple-routing-expression/simple-json-to-doc
                        time:   [4.9654 ms 5.0183 ms 5.0766 ms]
                        thrpt:  [51.631 MiB/s 52.231 MiB/s 52.787 MiB/s]
Found 15 outliers among 100 measurements (15.00%)
  15 (15.00%) high severe
simple-routing-expression/simple-eval-hash
                        time:   [3.5201 ms 3.5315 ms 3.5442 ms]
                        thrpt:  [73.955 MiB/s 74.221 MiB/s 74.461 MiB/s]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

@JerryKwan
Copy link
Collaborator Author

@trinity-1686a Thanks for reviewing, I will refactor the codes asap, thank you

@trinity-1686a
Copy link
Contributor

hi @JerryKwan, it looks like you haven't had time to make progress on this PR. Would you mind if I take over?

@JerryKwan
Copy link
Collaborator Author

@trinity-1686a
Sorry for the trouble introduced. please help to take it over,thank you

@trinity-1686a
Copy link
Contributor

no problem 😉

@trinity-1686a trinity-1686a dismissed fmassot’s stale review January 26, 2024 10:40

allocations were removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants