Checkpointing
Checkpointing has two use cases: saving the state of the solver to be loaded later (for example, in unsteady adjoint calculations), and to restart the solver after a crash. This checkpointing functionality described here is useful for both of these functions. Its purpose is to provide an interface for writing the current state to a file that can be read back later. As long as at least 2 checkpoints are saved, the implementation guarantees that at least one checkpoint is loadable at any time, even if the code is terminated while writing a checkpoint.
Log files require special handling when restarting. Log files should be opened in append mode when restarting (in fact, the code usually opens log files in append mode even when not restarting). One of the consequences of restarting is that there may be duplicate lines in log files. For example, if a code is doing a run of 10,000 timesteps, checkpointing every 1,000 timesteps, and gets terminated at timestep 6,500, it can be restarted from the checkpoint at timestep 6,000 and run to the end. This means any output written between timesteps 6,000 and 6,500 will appear twice in the log files. To deal with this problem, the script log_cleaner.jl
(located in src/scripts
) can be used to post-process delimited data files (space, comma, tab, etc.) and remove duplicate entries in favor of the entry appearing last in the log file. It also checks removed entries to make sure they are the same (to floating point tolerance) as the entries that are preserved.
API
Types
Utils.Checkpointer
— Type.This type keeps track of which checkpoints are in use and has an API for loading and saving checkpoints
The fields of the type should never be directly accessed. Use the API instead.
Every checkpoint must be in its own directory.
Checkpointing has two uses: loading a previous state and restarting. The first one is rather easy, all that needs to happen is the solution variables get loaded. Restarting is more involved because all the local data of the NonlinearSolver that was running needs to be stored and then loaded again in a new session.
An input file called input_vals_restart is created in the current directory that can be used to restart the code.
Fields
ncheckpoints: total number of checkpoints
paths: absolute paths to the checkpoints (array of strings)
status: status of each checkpoint (used or free), (array of Ints)
history: list of checkpoints from most recently used to least recently used, unused entries set to -1
Utils.Checkpointer
— Method.Constructs a Checkpointer with a given number of checkpoints.
The checkpoints are put into directories called "checkpoint", where the "" is replaced by the index of the checkpoint (from 1 to ncheckpoints). This constructor does not error if the directories already exist, however any data in the checkpoints may be overwritten.
Inputs
myrank: MPI rank of this process
ncheckpoints: number of checkpoints, defaults to 2. Using fewer than 2 is not recommended (there will be a moment in time during which the old checkpoint has been partially overwritten but the new checkpoint is not complete yet)
prefix: added to directory names (with an underscore added in-between). Defaults to empty string. This is useful if there are multiple sets of checkpoints (and therefore multiple Checkpointer objects).
Outputs
a Checkpointer object, fully initialized
Utils.Checkpointer
— Method.This constructor loads a Checkpointer object from a file. This should be used for restarting. Do not use the other constructor for restarting. It will load the most recently written checkpoint that is complete (an important distinction if the code was killed in the middle of writing a checkpoint)
Inputs
opts: the options dictionary.
Outputs
a Checkpointer object, fully initialized and aware of which checkpoints are in use and which are free
This function only loads the Checkpointer from the checkpoint. See loadLastCheckpoint
and readCheckpointData
to load the rest of checkpoint data. The checkpoint that was loaded can be accessed via getLastCheckpoint
Implementation notes:
Uses 4 keys in the options dictionary:
"writing checkpoint": index of checkpoint that might be complete
"writing_checkpoint_path": absolute path of checkpoint
"most_recent_checkpoint": index of checkpoint that is definitely complete, but might be older than the above
"most_recent_checkpoint_path": absolute path of checkpoint
This system ensure it is possible to restart even if the code is killed in the middle of writing a checkpoint.
The Checkpointer object is the same on all processes.
Base.copy
— Method.Recursively copies all fields to make a new Checkpointer object. Note that changes to one Checkpointer object will not affect the other (ie. the record of which checkpoints are used and which are not). This could easily lead to corrupting a checkpoint. For this reason, this function should rarely be used.
Base.copy!
— Method.2 argument version of copy(). See that function for details.
Utils.AbstractCheckpointData
— Type.Every function that wants to checkpoint must implement an AbstractCheckpointData containing all the data that needs to be loaded when the code restarts. The values in the AbstractSolutionData can be different for each MPI process.
This type must contain only "julia" data, ie. objects that are managed by Julia and have deterministic values. No pointers, MPI communicators etc.
File IO is expensive, so include only as much data as needed in the AbstractCheckpointData. Anything that can be recalculated should be.
Generally, only the outermost function (ie. the time stepper or the nonlinear solver for steady problems) will need to checkpoint.
There are no required fields for this type
Functions
These function provide the basic operations required for checkpointing
Utils.saveNextFreeCheckpoint
— Function.This function saves to the next free checkpoint
Inputs
checkpointer: the Checkpointer object
mesh: an AbstractMesh
sbp: an SBP operator
eqn: an AbstractSolutionData object
checkpoint_data: the users AbstractCheckpointData implementation
Inputs/Outputs
opts: the options dictionary
Outputs
checkpoint: the index of the checkpoint saved.
Utils.loadLastCheckpoint
— Function.This function loads the most recently saved checkpoint.
Inputs
checkpointer: the CheckPointer object
mesh: an AbstractMesh
sbp: an SBP operator
opts: the options dictionary
Inputs/Outputs
eqn: an AbstractSolutionData object
Outputs
the checkpoint index that was loaded
Utils.readLastCheckpointData
— Function.Simple wrapper around readCheckpointData
to loading the most recently saved checkpoint data
Inputs
chkpointer: a Checkpointer
comm_rank: MPI rank of this process
Utils.readCheckpointData
— Function.This function reads an object of type AbstractSolutionData from a file and reuturns the object. This is type unstable, so every AbstractCheckpointData implementation should create a constructor that calls this function and uses a type assertion to specify that the object must be of their concrete type.
Inputs
chkpointer: a Checkpointer
chkpoint: the index of hte checkpoint
comm_rank: the MPI rank of this process
Outputs
the
AbstractCheckpointData
loaded from the checkpoint
Utils.countFreeCheckpoints
— Function.This function returns the number of free checkpoints
Inputs
checkpointer: the Checkpoint object
Utils.getNextFreeCheckpoint
— Function.This function returns the index of the next free checkpoint, or 0 if there are no free checkpoints
Input
checkpointer: a Checkpointer
Utils.getLastCheckpoint
— Function.This function returns the index of the most recently written checkpoint
Inputs
checkpointer: the Checkpointer object
Outputs
the index of the most recently saved checkpoint
Utils.getOldestCheckpoint
— Function.Get the least recently written checkpoint
Inputs
checkpointer: the Checkpointer
Outputs
the index of the least recently written checkpoint
Utils.freeOldestCheckpoint
— Function.Frees the least recently written checkpoint. Calling this function when all checkpoints are free is allowed.
Inputs
checkpointer: the Checkpointer
Outputs
returns the checkpoint freed (0 all checkpoints were free on entry)
Utils.freeCheckpoint
— Function.This function frees a checkpoint (marks it as available to be overwritten) Unlike certain algorithms that use free lists (cough, cough, malloc) freeing an already free checkpoint is allowed.
Inputs
checkpointer: the Checkpointer object
checkpoint: the index of the checkpoint to free
The user must explictly free checkpoints (loading a checkpoint does not free it). Note that a free checkpoint can still be loaded for restarting if it has not been saved to yet.
Example
This section provides a simple example of how to use checkpointing for an explicit time marching scheme. The original scheme (without checkpointing) looks like:
function explicit_timemarching_example(mesh, sbp, eqn, opts, func::Function)
nsteps = round(Int, opts["tmax"]/opts["delta_t"])
t = 0.0
for step=1:nsteps
t = (step - 1)*delta_t
# q_vec -> q
array1DTo3D(mesh, sbp, eqn, opts, eqn.q, eqn.q_vec)
func(mesh, sbp, eqn, opts, t)
# res -> res_vec
array3DTo1D(mesh, sbp, eqn, opts, eqn.res, eqn.res_vec)
for i=1:mesh.numDof
eqn.q_vec[i] = ... # some function of eqn.res_vec[i]
end
end # end loop over timesteps
return nothing
end
With checkpointing, it becomes:
type ExampleCheckpointData <: AbstractCheckpointData
step::Int # for this method, the only thing that needs to be stored is the
# current time step, everything else can be recalculated
end
# this constructor is used when restarting, see below
function ExampleCheckpointData(chkpointer::Checkpointer, comm_rank)
chkpoint_data = readLastCheckpointData(chkpointer, comm_rank)
return chkpoint_data::ExampleCheckpointData
end
function explicit_timemarching_example(mesh, sbp, eqn, opts, func::Function)
nsteps = round(Int, opts["tmax"]/opts["delta_t"])
t = 0.0
if !opts["is_restart"]
# regular starting of a run
stepstart = 1
chkpointdata = ExampleCheckpointData(stepstart)
# this is valid even if we don't intend to checkpoint (ncheckpoints = 0)
chkpointer = Checkpointer(mesh.myrank, opts["ncheckpoints"]
skip_checkpoint = false
else
chkpointer = Checkpointer(opts, myrank) # read Checkpointer from most recent
# checkpoint
# now get the checkpoint data telling which timestep we are on
# eqn.q_vec already holds the state saved in the checkpoint. This is handled
# during the initialization of the equation object
chkpointdata = ExampleCheckpointData(chkpoinnter, mesh.myrank)
stepstart = chkpointdata.step
skip_checkpoint = true # skip writing the first checkpoint. Without this, a
# checkpoint would be written immediately.
# Writing the checkpoint is not harmful, but not useful
# either.
end
for step=stepstart:nsteps # start the loop from stepstart
t = (step - 1)*delta_t # t is recalculated from step, which was set using
# stepstart, thus only stepstart needs to be saved
# save checkpoint, if needed
if opts["use_checkpointing"] && step % opts["checkpoint_freq"] == 0 && !skip_checkpoint
skip_checkpoint = false
# save all needed variables to the ExampleCheckpointData object
# For simple time marching schemes, step is the only needed data
chkpointdata.step = step
if countFreeCheckpoints(chkpointer) == 0
freeOldestCheckpoint(chkpointer) # make room for a new checkpoint
end
saveNextFreeCheckpoint(chkpointer, mesh, sbp, eqn, opts, chkpointdata)
end
# q_vec -> q
array1DTo3D(mesh, sbp, eqn, opts, eqn.q, eqn.q_vec)
func(mesh, sbp, eqn, opts, t)
# res -> res_vec
array3DTo1D(mesh, sbp, eqn, opts, eqn.res, eqn.res_vec)
for i=1:mesh.numDof
eqn.q_vec[i] = ... # some function of eqn.res_vec[i]
end
end # end loop over timesteps
return nothing
end
Internal Functions
The internal functions used for checkpointing are documented here. Users should not call these functions directly. Improper use can cause checkpoint corruption.
Utils.writeCheckpointer
— Function.Call on master process only
Does not check if the checkpoint is already used (for reading the state back after restart, this checkpoint should already be marked as used)
Utils.saveCheckpoint
— Function.Save to a specified checkpoint. Throw error if checkpoint is not free. Users should not generally call this function directly. Instead, they should prefer saveNextFreeCheckpoint
.
This function automatically saves eqn.q_vec and the checkpointer to a file. Any additional data should be in checkpoint_data.
Note: mesh adaptation is not compatable with checkpointing #TODO; add a field to the mesh to record that it has been modified
Inputs
checkpointer: the CheckPointer
checkpoint: the index of the checkpoint
mesh: an AbstractMesh object
sbp: SBP operator
eqn: an AbstractSolutionData
checkpoint_data: an AbstractCheckpointData. This is the random bag of data the user needs saved.
Inputs/Outputs
* opts: options dictionary
Implementation Notes:
Uses options dictionary keys described by Checkpointer
Note that the checkpoint is eagerly marked as used, before finishing writing the checkpoint. Upon restart the code needs to check if this checkpoint is really finished.
Utils.loadCheckpoint
— Function.Loads a specified checkpoint. Only loads eqn.q_vec. Does not load the checkpointer (because this is loading a checkpoint and not a restart). Also does not load the AbstractCheckpointData. See readCheckpointData
for loading of the AbstractCheckpointData. Users should generally not call this function directly. Users should prefer loadLastCheckpoint
whenever possible.
Note that loading a checkpoint does not mark it as free. Users must explictly call freeCheckpoint
.
Inputs
checkpointer: the Checkpointer object
checkpoint: the index of the checkpoint
mesh: an AbstractMesh object
sbp: the SBP operator
opts: the options dictionary
Inputs/Output
eqn: an AbstractSolutionData, eqn.q_vec is overwritten
Utils.writeFlagFile
— Function.Write the flag file that signifies a checkpoint is complete Call on master process only!
Inputs
checkpointer: the Checkpointer object
checkpoint: the checkpoint index
Utils.checkFlagFile
— Function.Returns true if the flag file exists and is consistent, returns false otherwise
Inputs
checkpointer: the Checkpointer object
checkpoint: the checkpoint index
Sometimes need to check the flag file before the Checkpointer is available. See the other method ofr details.
Inputs
fpath: path to the checkpoint directory
Utils.deleteFlagFile
— Function.Deletes the flag file if it exists. Does not error if flag file does not exist. Call on master process only
Inputs
checkpointer: the Checkpointer object
checkpoint: the checkpoint index
Utils.writeCheckpointData
— Function.Writes the AbstractCheckpointData to a file. The file can be read by readCheckpointData
.
Inputs
checkpoint: Checkpointer object
checkpoint: the checkpoint index
obj: the AbstractCheckpointData object
comm_rank: the MPI rank of this process
Utils.markCheckpointUsed
— Function.Marks a checkpoint as unused.
Inputs
checkpointer: the Checkpointer object
checkpoint: the checkpoint index