Skip to content

Conversation

@SGSSGene
Copy link
Collaborator

@SGSSGene SGSSGene commented Oct 26, 2022

This updates the fmindex_collection submodule. In that version the loading of the fmindex should be faster.
In my local test, I checked how long it takes to load the bidirectional fmindex (ieprv2) of thehuman genome. I took the fastes run out of a couple of runs.
old version: 4.7s
new version: 2.1s

The index size increased:
old version: 6.2GB
new version: 6.6GB

Attention: a rebuild of the fmindex is required.

@SGSSGene SGSSGene changed the base branch from master to lambda3 October 26, 2022 14:29
@h-2
Copy link
Member

h-2 commented Oct 26, 2022

Why is there no CI?

@h-2 h-2 closed this Oct 26, 2022
@h-2 h-2 reopened this Oct 26, 2022
@h-2
Copy link
Member

h-2 commented Oct 26, 2022

weird, now CI works again

@SGSSGene SGSSGene force-pushed the patch/faster_loading_fmc branch from 776afd1 to 95407f5 Compare October 28, 2022 11:55
@h-2
Copy link
Member

h-2 commented Oct 28, 2022

Why do the Mac-tests pass :o ?

@SGSSGene
Copy link
Collaborator Author

Oh no....I have two fallbacks: 1. If cereal/archive/binary.hpp is not available or 2. if the archive is not a binary Input/Output archive.

So I guess somehow it is using the non accelerated path... :/ I will go over it again.

@SGSSGene
Copy link
Collaborator Author

Why do the Mac-tests pase :o ?

You have very good unittests!

@SGSSGene SGSSGene marked this pull request as draft October 31, 2022 13:48
@SGSSGene SGSSGene force-pushed the patch/faster_loading_fmc branch 3 times, most recently from cbfa544 to e741f4e Compare November 4, 2022 08:52
@SGSSGene
Copy link
Collaborator Author

SGSSGene commented Nov 4, 2022

Turns out MacOs is using md5 which has a different output to md5sum. I added a fix to this PR: e741f4e

@SGSSGene SGSSGene force-pushed the patch/faster_loading_fmc branch 2 times, most recently from 922711f to 01ced33 Compare November 4, 2022 10:10
@h-2
Copy link
Member

h-2 commented Nov 7, 2022

Thanks for fixing the mac tests. Is this PR still in draft status, or do you want a review?

@sarahet
Copy link
Member

sarahet commented Nov 8, 2022

I compared this PR and the current lambda3 branch with a ~6 GB nucleotide database (using /dev/shm). Currently the index loading time for the current lambda3 branch is ~69 seconds vs ~65 seconds for this PR.

@SGSSGene
Copy link
Collaborator Author

SGSSGene commented Nov 8, 2022

This is unexpected. When testing the bidirecional index directly, the speed up was much greater.
To see, why there is no real speed up, I guess I should test with lambda3, instead of only the index itself.

@h-2 I think there is no need to integrate this, if this doesn't bring any real improvements. I will make an extra PR for the mac tests (#196).

Another idea, for improved loading, is to not store the blocks and superblocks, but compute them when loading. This should slowdown loading from cache, but increase speeds when loading from network or disk storage.

Currently, I am under some time limitations, and won't have any time this week to look at it :/

@h-2
Copy link
Member

h-2 commented Nov 9, 2022

@sarahet Maybe you can check the runtimes of (de)serialising the individual members of the index-file struct? This could be done by splitting the call to archive() into multiple calls and printing the time in-between:
https://github.com/seqan/lambda/blob/lambda3/src/shared_definitions.hpp#L337

Unless, we are sure that the index itself is the problem, it doesn't make much sense to have @SGSSGene dig into the Lambda code, I think.

[I won't be able to test things until Wednesday next week.]

@sarahet
Copy link
Member

sarahet commented Nov 9, 2022

Loading Database Index...

Time options: 4.29153e-06
Time ids: 0.00523663
Time seqs: 60.5834
Time sTaxIds: 2.14577e-06
Time taxonParentIDs: 2.38419e-07
Time taxonHeights: 2.38419e-07
Time taxonNames: 2.38419e-07
Time index: 6.91801

done.

Runtime: 67.507s

🙃 That was a good suggestion ..

@sarahet
Copy link
Member

sarahet commented Nov 10, 2022

@SGSSGene would you update this PR and remove the draft status? 🙂

@SGSSGene SGSSGene force-pushed the patch/faster_loading_fmc branch from 01ced33 to f4c78f5 Compare November 11, 2022 12:56
@SGSSGene SGSSGene marked this pull request as ready for review November 11, 2022 12:57
@SGSSGene SGSSGene force-pushed the patch/faster_loading_fmc branch from f4c78f5 to a0d6638 Compare November 11, 2022 13:13
@h-2
Copy link
Member

h-2 commented Nov 11, 2022

The search tests still pass, but the index tests fail (expectedly). Can you update the checksums in the tests so that the tests pass?

@h-2
Copy link
Member

h-2 commented Nov 11, 2022

@sarahet Or shall we wait with updating the index files / checksums until we also pull the default alphabet changes?

@sarahet
Copy link
Member

sarahet commented Nov 11, 2022

I don't think we have to wait but I can also update the checksums in a separate PR if that's easier

@sarahet
Copy link
Member

sarahet commented Nov 11, 2022

I guess depends how fast we are with the modes so that we can finalize the overall alphabet/mode change. I would like to include the nucleotide default in that PR/change and if we can fix that soon, then sure maybe cleaner to wait. Otherwise I would say we should proceed to keep moving forward?

@sarahet
Copy link
Member

sarahet commented Nov 11, 2022

@SGSSGene we decided we will first include the newly defined modes in a separate PR and merge this one afterwards 🙂

@h-2
Copy link
Member

h-2 commented Nov 30, 2022

superseded by #198

@h-2 h-2 closed this Nov 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants