-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON benchmark #72
Comments
I'm curious as to why libraries like ultrajson and orjson weren't explored. They aren't command line tools, but neither is pandas right? Is it perhaps because the code required to implement the challenges is large enough that they are considered too inconvenient to use through the same way pandas was used (ie, |
Thank you @davidatbu! I should mention that spyql leverages orjson, which has a considerable impact on performance. spyql supports both the json module from the standard library as well as orjson as json decoder/encoder. Performance wise, for 1GB of input data, orjson allows to decrease processing time by 20-30%. So, orjson is part of the reason why a python-based tool outperforms tools written in C, Go, etc and deserves credit. You can find more info about the performance impact of orjson in #70. |
That was what I was trying to say when I said "the code required to implement the challenges is large enough that they are considered too inconvenient to use". This makes sense to me. Thank you for doing this benchmark! I've been using There's more to my reply on HN, (regarding the title of the shared link on HN), but I don't think that's relevant here, so I've left it out. |
Hey @dcmoura! I've actualy already done the necessary changes in OctoSQL that I described in the HN thread. Thanks for motivating me :) It was much less work than I anticipated. You can run the new version with I've run your benchmark notebook and OctoSQL is now just slightly slower than SPyQL (and the RAM is stable as well). It'd be great if you could update the notebook with the above change. (btw. I've had to add The default (without |
Hey @cube2222 !
Great!! I am happy :-) Nothing like a benchmark to bring some extra motivation ;-)
I was missing this feature in octosql :-)
💪
Of course! I should be able to do it on Monday.
SpyQL does the same with pretty printing. Thank you!! |
Hi! Your benchmark interested me and I decided to run your tests with ClickHouse github repo (if you are interested): https://github.com/ClickHouse/ClickHouse It would be great if you could add clickhouse-local to your benchmark. |
Wow! Seems really fast!!
Of course I am! Still trying to figure how I haven't stumbled into clickhouse before... I was not aware of the option to run it locally.
Of course! @Avogar I was trying to run clickhouse on colab and I get the following error. I have actually run your installation script. Could you help?
I did try running it in my local machine and I am amazed with the speed, I just needed to adjust the argument Thanks! |
Colab is doing LD_PRELOAD of tcmalloc, while we are using jemalloc. To run clickhouse in colab we should run it like
Yep, the default value is 100, but the books.json dataset has a new json field |
@Avogar I was unable to put clickhouse to work using your installation method. I had to follow these instructions (From DEB Packages) https://clickhouse.com/docs/en/getting-started/install/amp/ Then the fix of setting the env variable worked. The version installed using this method was:
|
Hi again @cube2222. It seems that I cannot do arithmetic expressions on octosql, or I am overlooking something?
|
@dcmoura Hey, looks like a parser issue affecting division (you can totally write i.e. I'll look into it. In the meantime, you can use the alternative notation
Thanks for letting me know! |
I asked to pin the following comment to the top of the HN Thread: https://news.ycombinator.com/item?id=31111863 This was suggested by the HN admins instead of reposting. |
Posted the updated benchmark on Reddit r/programming |
Hey thanks for putting the benchmarks together. One issue I notice is that you don't have an ORDER BY in the LIMIT. You aren't actually comparing like and like in this case. SQL (or at least most implementations of it) does not define any meaningful order by default and I can see that Clickhouse does in fact produce something different from what spyql produces. This doesn't change that dsq is pretty slow. But I just wanted to point out that overall LIMIT without ORDER BY is not a reasonable benchmark and in general you may want to make sure that each tool produces the same result when you're benchmarking. If all tools don't produce the same result you're still measuring something but I don't think it's a reasonable something from a user perspective. |
I think it makes sense to include the use case of taking the first n rows/records/etc. of the file. It's fairly intuitive behavior and a common use case (at least in my own usage, the first thing I do with files I'm working on is That said, as @eatonphil suggests, I would keep the benchmark apples to apples - write a comment for those that don't support this behavior and not include them in that benchmark case (and possibly add another case which does the ORDER BY). |
Yes, and this was a conscious decision. The goal is simply to understand if the tool is smart enough to avoid scanning the full dataset when only a sample is required. I know that there are no guarantees about the order of the output in standard SQL if you do not specify and ORDER BY. Still, most (if not all) database engines I know will not do a full table scan, they would stop as soon as the limit clause is satisfied. The request is: give me any N rows (and stop as soon as you have them). Putting an order by would defeat the purpose of this test. Why is this test important? Because many times we just want to work with a sample or take a quick look or simply iterate on a query based on a sample. Tools that load all data into memory fail this test and require some kind of pre-processing (e.g. invoking In conclusion, I don't see a reason to change the benchmark. Some tools will take advantage of parallel processing, cached data, process as you go, etc, and that might result in different outputs, but that's OK in my view. |
Leave your comments for the JSON benchmark here.
The text was updated successfully, but these errors were encountered: