Mitchish65 #388

dirkgr · 2023-12-04T06:15:49Z

More checkpointing formats

The problem that's being solved here is, how do we restore the 65B model from an unsharded checkpoint? The existing way works, I think, only by accident. If it works at all. It's possible that this never worked, or maybe only with sharding strategies that don't actually shard the model (like NO_SHARD or SHARD_GRAD_OP or wrapping_strategy=null).

So this new way uses FSDP.apply() to copy tensors from a state dict into the model. This part is pretty straightforward. That's what apply() is for.

The part that isn't straightforward is the detour through the safetensors format. Safetensors is brilliant. It lets you create a Dict[str, Tensor], where the tensors are memory mapped files. It loads up a 500GB files in seconds (because, of course, it doesn't actually read the tensor bytes until later). So this PR contains a script that can read an existing unsharded checkpoint (in .pt format), and write it to disk in safetensors format (.safetensors). This can be done on CPU, though you need a lot of memory to do it. When reading a .pt file, we check whether there happens to be a .safetensors file with the same name, and if so, we load that instead.

One more problem is that state dicts are not Dict[str, Tensor]. State dicts can contain inner dicts, and optimizer state dicts contain even more crazy stuff. So there is a mapper in this PR that maps crazy state dicts to well-formed Dict[str, Tensor]s and back. This sacrifices human interpretability of the files, but retains the lazy-loading memmap goodness from safetensors.

TODO

Verify that this works with 7B on Cirrascale.
Verify this works with the 65B on LUMI.

…nto mitchish65

This reverts commit 74a7670.

This reverts commit 12fa63c.

# Conflicts: # olmo/data/iterable_dataset.py # olmo/util.py

dirkgr · 2023-12-06T21:28:14Z

With the latest fix, I can load an unsharded checkpoint this way. It takes a long time, and it's very inefficient (because every rank will load everything but then discard most of what it just loaded), but it does work.

Unfortunately I seem to have broken how the optimizer state works. So this is not complete.

dirkgr · 2023-12-06T21:28:53Z

In fact, with the way this works, we can now load a model that's so big it wouldn't fit into CPU memory. Not that we need to do this. But we could.

… for 24h

For example, suppose you want to keep your checkpoints at every 1000 steps, but you also want to save a temporary checkpoint every 100 steps in case your job fails. In that case you would set `save_interval=1000` and `save_interval_ephemeral=100`.

…ish65

…nto mitchish65

…ints. But also still contains the old code.

epwalsh

LGTM!

One thing to consider is integrating the safetensors conversion into Shane's checkpoint management script. But let's leave that for another day.

epwalsh · 2024-01-04T23:06:45Z

olmo/checkpoint.py

-            fsdp_model.load_state_dict(state_dict_to_load)
-            del state_dict_to_load
+            with torch.no_grad():
+                # fill everything with NaN, so we can check afterwards that every parameter has been restored


dirkgr added 30 commits November 15, 2023 15:40

Adds a 65B config

f661bfd

Make the config actually 65B

28cf940

More parameters

0c122e9

Set to the most memory efficient setting

5401904

Change some names

a4af8af

Default to local checkpointing

cc72acd

Script for the 65b config

c609fb4

Put some learnings into the script

addc631

Decay everything

135cf11

New image version for logging into nodes

ce380b4

Merge remote-tracking branch 'origin/epwalsh/threaded-data-loading' i…

518b2e1

…nto mitchish65

Read the longrunfix data

2b59d48

Mokey patch FSDP stream management

12fa63c

3 streams

74a7670

Revert "3 streams"

ddd69c5

This reverts commit 74a7670.

Revert "Mokey patch FSDP stream management"

36f2b17

This reverts commit 12fa63c.

Smaller global batch for batch size warmup

f52fa18

Grad clipping warmup

8952655

Adjust the run script for batch size warmup

573c0c0

Typo

14ea8fa

Mitchich 35

750202f

Improved 50

471945a

35

addac71

Load the model in turns

6bae22b

Merge remote-tracking branch 'origin/main' into mitchish65

4ad3982

# Conflicts: # olmo/data/iterable_dataset.py # olmo/util.py

Fix missing import

caafbec

Double the batch size

5ee7fb6

More nodes

2d5f5d5

Load twice as fast

b4626ab

2T tokens

966ca47

dirkgr added 2 commits December 6, 2023 12:29

Try closing and reopening the safetensors file every once in a while

e86e9ba

Fix counts

e312b04

dirkgr and others added 17 commits December 8, 2023 07:54

We need even more time it seems

5452658

Checkpoint even more often, because we clearly can't keep a job going…

72bdc96

… for 24h

Add support for ephemeral checkpoints

00305eb

For example, suppose you want to keep your checkpoints at every 1000 steps, but you also want to save a temporary checkpoint every 100 steps in case your job fails. In that case you would set `save_interval=1000` and `save_interval_ephemeral=100`.

fix typo

a3e1a56

fix

1ac7666

Merge remote-tracking branch 'origin/mitchish65-timelimit' into mitch…

5ca9a5d

…ish65

Merge remote-tracking branch 'origin/epwalsh/ephemeral-checkpoints' i…

498afaa

…nto mitchish65

Use the new ephemeral checkpoint feature

45fc06d

Restores checkpoints directly into flat params from unsharded checkpo…

d6dcf90

…ints. But also still contains the old code.

Merge branch 'mitchish65' into mitchish65-flatparamrestore

028d978

Clean up the full checkpoint restorer and make it useful

288ccc5

Makes checkpoint loading finally work

4935c0f

Backwards compatibility

933cce4

We are not yet ready to double the batch size.

a0d480b

Need to clean up the key some more

52bcd3f

Keep Pete's fixes

8e7b0fd

Merge remote-tracking branch 'origin/main' into mitchish65

6425924

dirkgr marked this pull request as ready for review January 4, 2024 01:43

Remove unused import

fd95100

dirkgr requested a review from AkshitaB January 4, 2024 18:10

Productivity through sorted imports

472ea1e

epwalsh approved these changes Jan 4, 2024

View reviewed changes

dirkgr added 2 commits January 4, 2024 17:45

Adds documentation

f581951

Merge branch 'main' into mitchish65

8bd7452

dirkgr merged commit df19554 into main Jan 5, 2024

dirkgr deleted the mitchish65 branch January 5, 2024 18:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitchish65 #388

Mitchish65 #388

dirkgr commented Dec 4, 2023 •

edited

Loading

dirkgr commented Dec 6, 2023

dirkgr commented Dec 6, 2023

epwalsh left a comment

epwalsh Jan 4, 2024

Mitchish65 #388

Mitchish65 #388

Conversation

dirkgr commented Dec 4, 2023 • edited Loading

More checkpointing formats

TODO

dirkgr commented Dec 6, 2023

dirkgr commented Dec 6, 2023

epwalsh left a comment

Choose a reason for hiding this comment

epwalsh Jan 4, 2024

Choose a reason for hiding this comment

dirkgr commented Dec 4, 2023 •

edited

Loading