Patch/faster loading fmc #191

SGSSGene · 2022-10-26T14:29:10Z

This updates the fmindex_collection submodule. In that version the loading of the fmindex should be faster.
In my local test, I checked how long it takes to load the bidirectional fmindex (ieprv2) of thehuman genome. I took the fastes run out of a couple of runs.
old version: 4.7s
new version: 2.1s

The index size increased:
old version: 6.2GB
new version: 6.6GB

Attention: a rebuild of the fmindex is required.

h-2 · 2022-10-26T14:36:38Z

Why is there no CI?

h-2 · 2022-10-26T14:55:20Z

weird, now CI works again

h-2 · 2022-10-28T14:23:03Z

Why do the Mac-tests pass :o ?

SGSSGene · 2022-10-28T14:26:57Z

Oh no....I have two fallbacks: 1. If cereal/archive/binary.hpp is not available or 2. if the archive is not a binary Input/Output archive.

So I guess somehow it is using the non accelerated path... :/ I will go over it again.

SGSSGene · 2022-10-28T14:27:19Z

Why do the Mac-tests pase :o ?

You have very good unittests!

SGSSGene · 2022-11-04T08:55:57Z

Turns out MacOs is using md5 which has a different output to md5sum. I added a fix to this PR: e741f4e

h-2 · 2022-11-07T20:38:45Z

Thanks for fixing the mac tests. Is this PR still in draft status, or do you want a review?

sarahet · 2022-11-08T09:22:44Z

I compared this PR and the current lambda3 branch with a ~6 GB nucleotide database (using /dev/shm). Currently the index loading time for the current lambda3 branch is ~69 seconds vs ~65 seconds for this PR.

SGSSGene · 2022-11-08T09:31:39Z

This is unexpected. When testing the bidirecional index directly, the speed up was much greater.
To see, why there is no real speed up, I guess I should test with lambda3, instead of only the index itself.

@h-2 I think there is no need to integrate this, if this doesn't bring any real improvements. I will make an extra PR for the mac tests (#196).

Another idea, for improved loading, is to not store the blocks and superblocks, but compute them when loading. This should slowdown loading from cache, but increase speeds when loading from network or disk storage.

Currently, I am under some time limitations, and won't have any time this week to look at it :/

h-2 · 2022-11-09T00:24:38Z

@sarahet Maybe you can check the runtimes of (de)serialising the individual members of the index-file struct? This could be done by splitting the call to archive() into multiple calls and printing the time in-between:
https://github.com/seqan/lambda/blob/lambda3/src/shared_definitions.hpp#L337

Unless, we are sure that the index itself is the problem, it doesn't make much sense to have @SGSSGene dig into the Lambda code, I think.

[I won't be able to test things until Wednesday next week.]

sarahet · 2022-11-09T17:22:17Z

Loading Database Index...

Time options: 4.29153e-06
Time ids: 0.00523663
Time seqs: 60.5834
Time sTaxIds: 2.14577e-06
Time taxonParentIDs: 2.38419e-07
Time taxonHeights: 2.38419e-07
Time taxonNames: 2.38419e-07
Time index: 6.91801

done.

Runtime: 67.507s

🙃 That was a good suggestion ..

sarahet · 2022-11-10T20:40:17Z

@SGSSGene would you update this PR and remove the draft status? 🙂

h-2 · 2022-11-11T19:46:32Z

The search tests still pass, but the index tests fail (expectedly). Can you update the checksums in the tests so that the tests pass?

h-2 · 2022-11-11T19:47:20Z

@sarahet Or shall we wait with updating the index files / checksums until we also pull the default alphabet changes?

sarahet · 2022-11-11T20:24:05Z

I don't think we have to wait but I can also update the checksums in a separate PR if that's easier

sarahet · 2022-11-11T20:32:16Z

I guess depends how fast we are with the modes so that we can finalize the overall alphabet/mode change. I would like to include the nucleotide default in that PR/change and if we can fix that soon, then sure maybe cleaner to wait. Otherwise I would say we should proceed to keep moving forward?

sarahet · 2022-11-11T21:05:00Z

@SGSSGene we decided we will first include the newly defined modes in a separate PR and merge this one afterwards 🙂

h-2 · 2022-11-30T16:45:17Z

superseded by #198

SGSSGene changed the base branch from master to lambda3 October 26, 2022 14:29

h-2 closed this Oct 26, 2022

h-2 reopened this Oct 26, 2022

SGSSGene force-pushed the patch/faster_loading_fmc branch from 776afd1 to 95407f5 Compare October 28, 2022 11:55

SGSSGene marked this pull request as draft October 31, 2022 13:48

SGSSGene force-pushed the patch/faster_loading_fmc branch 3 times, most recently from cbfa544 to e741f4e Compare November 4, 2022 08:52

SGSSGene force-pushed the patch/faster_loading_fmc branch 2 times, most recently from 922711f to 01ced33 Compare November 4, 2022 10:10

sarahet mentioned this pull request Nov 10, 2022

[fix] workaround bug in cereal #197

Merged

SGSSGene force-pushed the patch/faster_loading_fmc branch from 01ced33 to f4c78f5 Compare November 11, 2022 12:56

SGSSGene marked this pull request as ready for review November 11, 2022 12:57

[patch] update fmindex_collection version

a0d6638

SGSSGene force-pushed the patch/faster_loading_fmc branch from f4c78f5 to a0d6638 Compare November 11, 2022 13:13

sarahet mentioned this pull request Nov 18, 2022

Add protein and nucleotide profiles #198

Merged

h-2 closed this Nov 30, 2022

Patch/faster loading fmc #191

Patch/faster loading fmc #191

Uh oh!

Conversation

SGSSGene commented Oct 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-2 commented Oct 26, 2022

Uh oh!

h-2 commented Oct 26, 2022

Uh oh!

h-2 commented Oct 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SGSSGene commented Oct 28, 2022

Uh oh!

SGSSGene commented Oct 28, 2022

Uh oh!

SGSSGene commented Nov 4, 2022

Uh oh!

h-2 commented Nov 7, 2022

Uh oh!

sarahet commented Nov 8, 2022

Uh oh!

SGSSGene commented Nov 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

h-2 commented Nov 9, 2022

Uh oh!

sarahet commented Nov 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarahet commented Nov 10, 2022

Uh oh!

h-2 commented Nov 11, 2022

Uh oh!

h-2 commented Nov 11, 2022

Uh oh!

sarahet commented Nov 11, 2022

Uh oh!

sarahet commented Nov 11, 2022

Uh oh!

sarahet commented Nov 11, 2022

Uh oh!

h-2 commented Nov 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SGSSGene commented Oct 26, 2022 •

edited

Loading

h-2 commented Oct 28, 2022 •

edited

Loading

SGSSGene commented Nov 8, 2022 •

edited

Loading

sarahet commented Nov 9, 2022 •

edited

Loading