Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fasta parsing experiment #5

Open
zachcp opened this issue Jan 2, 2023 · 3 comments
Open

Fasta parsing experiment #5

zachcp opened this issue Jan 2, 2023 · 3 comments

Comments

@zachcp
Copy link

zachcp commented Jan 2, 2023

Hi @fubark ,

Thanks again for your awesome language. I played around with cyber a bit today for fasta parsing to see how it might fare against some other languages (inspiration here). My results are here if you are interested in taking a look. Right now python is ahead by ~ 2 orders of magnitude. I know cyber is designed for embedded systems but I thought i might get lucky with some fast I/O as well :).

This is a really promising language thats been fun to use; thank you.
zach cp

time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
real    0m1.065s

time ./cyber readfq.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
real    2m24.335s

time ./cyber readfq2.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
real    2m30.641s
@fubark
Copy link
Owner

fubark commented Jan 3, 2023

Thanks for providing readfq2. It helped me narrow down the perf bottleneck quickly. readLine was meant for getting the user input from the command line and not bulk reads from stdin. For that reason, I deprecated readLine in favor of getInput. As for bulk reads on std.in you can do the following now in readfq2:

import os 'os'

--- minimal parse. don't use object or fastq
---  '@+>' is  64 / 43 / 62
func is_fastx(chr) bool:
    if chr == 64:
        return true
    if chr == 62:
        return true
    return false
n     = 0
slen  = 0
qlen  = 0
for os.stdin.streamLines() as line:
    if is_fastx(line.charAt(0)):
        n    += 1
    else:
        slen += line.len()

print 'There are {slen} bases from {n} records in this file.'

On my linux machine, this is now twice as fast as the python3 version (still much room for improvement but now it's a more fair comparison in regards to reading lines from stdin). Although the python script seems to be doing more in the script... I'm going to see what missing functions there are and also flesh out more of the new File api.

@zachcp
Copy link
Author

zachcp commented Jan 4, 2023

Boom shakalaka! Amazing work.

Note: if cyber can compete favorably on these benchmarks I think you might unlock a bioinformatics market segment.....

# Same for me on MacOS!

time python3 readfq.py < GCA_013297495.1_ASM1329749v1_genomic.fna
 There are 341540 records and 161512289 bases

real	0m0.898s
user	0m0.794s
sys	0m0.072s

time ./cyber readfq3.cy <  GCA_013297495.1_ASM1329749v1_genomic.fna
There are 163709211 bases from 341540 records in this file.

real	0m0.393s
user	0m0.323s
sys	0m0.062s

@fubark
Copy link
Owner

fubark commented Jan 5, 2023

I just made the same script even faster using simd to find the new line character. Also you can now provide a read buffer size to streamLines(). It defaults to 4096 bytes, but I've found that 4MB works well for larger files. Between this and simd (mostly simd), I'm seeing almost another 2x in performance gains.

Also worth mentioning the same simd technique is now made available for string.indexChar()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants