Checkpointing

Checkpointing

Checkpointing has two use cases: saving the state of the solver to be loaded later (for example, in unsteady adjoint calculations), and to restart the solver after a crash. This checkpointing functionality described here is useful for both of these functions. Its purpose is to provide an interface for writing the current state to a file that can be read back later. As long as at least 2 checkpoints are saved, the implementation guarantees that at least one checkpoint is loadable at any time, even if the code is terminated while writing a checkpoint.

Log files require special handling when restarting. Log files should be opened in append mode when restarting (in fact, the code usually opens log files in append mode even when not restarting). One of the consequences of restarting is that there may be duplicate lines in log files. For example, if a code is doing a run of 10,000 timesteps, checkpointing every 1,000 timesteps, and gets terminated at timestep 6,500, it can be restarted from the checkpoint at timestep 6,000 and run to the end. This means any output written between timesteps 6,000 and 6,500 will appear twice in the log files. To deal with this problem, the script log_cleaner.jl (located in src/scripts) can be used to post-process delimited data files (space, comma, tab, etc.) and remove duplicate entries in favor of the entry appearing last in the log file. It also checks removed entries to make sure they are the same (to floating point tolerance) as the entries that are preserved.

API

Types

This type keeps track of which checkpoints are in use and has an API for loading and saving checkpoints

The fields of the type should never be directly accessed. Use the API instead.

Every checkpoint must be in its own directory.

Checkpointing has two uses: loading a previous state and restarting. The first one is rather easy, all that needs to happen is the solution variables get loaded. Restarting is more involved because all the local data of the NonlinearSolver that was running needs to be stored and then loaded again in a new session.

An input file called input_vals_restart is created in the current directory that can be used to restart the code.

Fields

  • ncheckpoints: total number of checkpoints

  • paths: absolute paths to the checkpoints (array of strings)

  • status: status of each checkpoint (used or free), (array of Ints)

  • history: list of checkpoints from most recently used to least recently used, unused entries set to -1

source
Utils.CheckpointerMethod.

Constructs a Checkpointer with a given number of checkpoints.

The checkpoints are put into directories called "checkpoint", where the "" is replaced by the index of the checkpoint (from 1 to ncheckpoints). This constructor does not error if the directories already exist, however any data in the checkpoints may be overwritten.

Inputs

  • myrank: MPI rank of this process

  • ncheckpoints: number of checkpoints, defaults to 2. Using fewer than 2 is not recommended (there will be a moment in time during which the old checkpoint has been partially overwritten but the new checkpoint is not complete yet)

  • prefix: added to directory names (with an underscore added in-between). Defaults to empty string. This is useful if there are multiple sets of checkpoints (and therefore multiple Checkpointer objects).

Outputs

  • a Checkpointer object, fully initialized

source
Utils.CheckpointerMethod.

This constructor loads a Checkpointer object from a file. This should be used for restarting. Do not use the other constructor for restarting. It will load the most recently written checkpoint that is complete (an important distinction if the code was killed in the middle of writing a checkpoint)

Inputs

  • opts: the options dictionary.

Outputs

  • a Checkpointer object, fully initialized and aware of which checkpoints are in use and which are free

This function only loads the Checkpointer from the checkpoint. See loadLastCheckpoint and readCheckpointData to load the rest of checkpoint data. The checkpoint that was loaded can be accessed via getLastCheckpoint

Implementation notes:

Uses 4 keys in the options dictionary:

  • "writing checkpoint": index of checkpoint that might be complete

  • "writing_checkpoint_path": absolute path of checkpoint

  • "most_recent_checkpoint": index of checkpoint that is definitely complete, but might be older than the above

  • "most_recent_checkpoint_path": absolute path of checkpoint

This system ensure it is possible to restart even if the code is killed in the middle of writing a checkpoint.

The Checkpointer object is the same on all processes.

source
Base.copyMethod.

Recursively copies all fields to make a new Checkpointer object. Note that changes to one Checkpointer object will not affect the other (ie. the record of which checkpoints are used and which are not). This could easily lead to corrupting a checkpoint. For this reason, this function should rarely be used.

source
Base.copy!Method.

2 argument version of copy(). See that function for details.

source

Every function that wants to checkpoint must implement an AbstractCheckpointData containing all the data that needs to be loaded when the code restarts. The values in the AbstractSolutionData can be different for each MPI process.

This type must contain only "julia" data, ie. objects that are managed by Julia and have deterministic values. No pointers, MPI communicators etc.

File IO is expensive, so include only as much data as needed in the AbstractCheckpointData. Anything that can be recalculated should be.

Generally, only the outermost function (ie. the time stepper or the nonlinear solver for steady problems) will need to checkpoint.

There are no required fields for this type

source

Functions

These function provide the basic operations required for checkpointing

This function saves to the next free checkpoint

Inputs

  • checkpointer: the Checkpointer object

  • mesh: an AbstractMesh

  • sbp: an SBP operator

  • eqn: an AbstractSolutionData object

  • checkpoint_data: the users AbstractCheckpointData implementation

Inputs/Outputs

  • opts: the options dictionary

Outputs

  • checkpoint: the index of the checkpoint saved.

source

This function loads the most recently saved checkpoint.

Inputs

  • checkpointer: the CheckPointer object

  • mesh: an AbstractMesh

  • sbp: an SBP operator

  • opts: the options dictionary

Inputs/Outputs

  • eqn: an AbstractSolutionData object

Outputs

  • the checkpoint index that was loaded

source

Simple wrapper around readCheckpointData to loading the most recently saved checkpoint data

Inputs

  • chkpointer: a Checkpointer

  • comm_rank: MPI rank of this process

source

This function reads an object of type AbstractSolutionData from a file and reuturns the object. This is type unstable, so every AbstractCheckpointData implementation should create a constructor that calls this function and uses a type assertion to specify that the object must be of their concrete type.

Inputs

  • chkpointer: a Checkpointer

  • chkpoint: the index of hte checkpoint

  • comm_rank: the MPI rank of this process

Outputs

source

This function returns the number of free checkpoints

Inputs

  • checkpointer: the Checkpoint object

source

This function returns the index of the next free checkpoint, or 0 if there are no free checkpoints

Input

  • checkpointer: a Checkpointer

source

This function returns the index of the most recently written checkpoint

Inputs

  • checkpointer: the Checkpointer object

Outputs

  • the index of the most recently saved checkpoint

source

Get the least recently written checkpoint

Inputs

  • checkpointer: the Checkpointer

Outputs

  • the index of the least recently written checkpoint

source

Frees the least recently written checkpoint. Calling this function when all checkpoints are free is allowed.

Inputs

  • checkpointer: the Checkpointer

Outputs

  • returns the checkpoint freed (0 all checkpoints were free on entry)

source
Utils.freeCheckpointFunction.

This function frees a checkpoint (marks it as available to be overwritten) Unlike certain algorithms that use free lists (cough, cough, malloc) freeing an already free checkpoint is allowed.

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the index of the checkpoint to free

The user must explictly free checkpoints (loading a checkpoint does not free it). Note that a free checkpoint can still be loaded for restarting if it has not been saved to yet.

source

Example

This section provides a simple example of how to use checkpointing for an explicit time marching scheme. The original scheme (without checkpointing) looks like:

function explicit_timemarching_example(mesh, sbp, eqn, opts, func::Function)

  nsteps = round(Int, opts["tmax"]/opts["delta_t"])
  t = 0.0

  for step=1:nsteps

    t = (step - 1)*delta_t
    # q_vec -> q
    array1DTo3D(mesh, sbp, eqn, opts, eqn.q, eqn.q_vec)
    func(mesh, sbp, eqn, opts, t)

    # res -> res_vec
    array3DTo1D(mesh, sbp, eqn, opts, eqn.res, eqn.res_vec)

    for i=1:mesh.numDof
      eqn.q_vec[i] = ... # some function of eqn.res_vec[i]
    end
  end  # end loop over timesteps

  return nothing
end

With checkpointing, it becomes:


type ExampleCheckpointData <: AbstractCheckpointData
  step::Int  # for this method, the only thing that needs to be stored is the
             # current time step, everything else can be recalculated
end

# this constructor is used when restarting, see below
function ExampleCheckpointData(chkpointer::Checkpointer, comm_rank)

  chkpoint_data = readLastCheckpointData(chkpointer, comm_rank)

  return chkpoint_data::ExampleCheckpointData
end

function explicit_timemarching_example(mesh, sbp, eqn, opts, func::Function)

  nsteps = round(Int, opts["tmax"]/opts["delta_t"])
  t = 0.0

  if !opts["is_restart"]
    # regular starting of a run
    stepstart = 1
    chkpointdata = ExampleCheckpointData(stepstart)
    # this is valid even if we don't intend to checkpoint (ncheckpoints = 0)
    chkpointer = Checkpointer(mesh.myrank, opts["ncheckpoints"]
    skip_checkpoint = false
  else
    chkpointer = Checkpointer(opts, myrank)  # read Checkpointer from most recent
                                             # checkpoint
    # now get the checkpoint data telling which timestep we are on
    # eqn.q_vec already holds the state saved in the checkpoint.  This is handled
    # during the initialization of the equation object
    chkpointdata = ExampleCheckpointData(chkpoinnter, mesh.myrank)
    stepstart = chkpointdata.step
    skip_checkpoint = true  # skip writing the first checkpoint.  Without this, a
                            # checkpoint would be written immediately.
                            # Writing the checkpoint is not harmful, but not useful
                            # either.
   end
    
  for step=stepstart:nsteps  # start the loop from stepstart

    t = (step - 1)*delta_t   # t is recalculated from step, which was set using
                             # stepstart, thus only stepstart needs to be saved

    # save checkpoint, if needed
    if opts["use_checkpointing"] && step % opts["checkpoint_freq"] == 0 && !skip_checkpoint
      skip_checkpoint = false

      # save all needed variables to the ExampleCheckpointData object
      # For simple time marching schemes, step is the only needed data
      chkpointdata.step = step

      if countFreeCheckpoints(chkpointer) == 0
        freeOldestCheckpoint(chkpointer)  # make room for a new checkpoint
      end

      saveNextFreeCheckpoint(chkpointer, mesh, sbp, eqn, opts, chkpointdata)
    end

    # q_vec -> q
    array1DTo3D(mesh, sbp, eqn, opts, eqn.q, eqn.q_vec)
    func(mesh, sbp, eqn, opts, t)

    # res -> res_vec
    array3DTo1D(mesh, sbp, eqn, opts, eqn.res, eqn.res_vec)

    for i=1:mesh.numDof
      eqn.q_vec[i] = ... # some function of eqn.res_vec[i]
    end
  end  # end loop over timesteps

  return nothing
end

Internal Functions

The internal functions used for checkpointing are documented here. Users should not call these functions directly. Improper use can cause checkpoint corruption.

Call on master process only

Does not check if the checkpoint is already used (for reading the state back after restart, this checkpoint should already be marked as used)

source
Utils.saveCheckpointFunction.

Save to a specified checkpoint. Throw error if checkpoint is not free. Users should not generally call this function directly. Instead, they should prefer saveNextFreeCheckpoint.

This function automatically saves eqn.q_vec and the checkpointer to a file. Any additional data should be in checkpoint_data.

Note: mesh adaptation is not compatable with checkpointing #TODO; add a field to the mesh to record that it has been modified

Inputs

  • checkpointer: the CheckPointer

  • checkpoint: the index of the checkpoint

  • mesh: an AbstractMesh object

  • sbp: SBP operator

  • eqn: an AbstractSolutionData

  • checkpoint_data: an AbstractCheckpointData. This is the random bag of data the user needs saved.

Inputs/Outputs

* opts: options dictionary

Implementation Notes:

Uses options dictionary keys described by Checkpointer Note that the checkpoint is eagerly marked as used, before finishing writing the checkpoint. Upon restart the code needs to check if this checkpoint is really finished.

source
Utils.loadCheckpointFunction.

Loads a specified checkpoint. Only loads eqn.q_vec. Does not load the checkpointer (because this is loading a checkpoint and not a restart). Also does not load the AbstractCheckpointData. See readCheckpointData for loading of the AbstractCheckpointData. Users should generally not call this function directly. Users should prefer loadLastCheckpoint whenever possible.

Note that loading a checkpoint does not mark it as free. Users must explictly call freeCheckpoint.

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the index of the checkpoint

  • mesh: an AbstractMesh object

  • sbp: the SBP operator

  • opts: the options dictionary

Inputs/Output

  • eqn: an AbstractSolutionData, eqn.q_vec is overwritten

source
Utils.writeFlagFileFunction.

Write the flag file that signifies a checkpoint is complete Call on master process only!

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the checkpoint index

source
Utils.checkFlagFileFunction.

Returns true if the flag file exists and is consistent, returns false otherwise

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the checkpoint index

source

Sometimes need to check the flag file before the Checkpointer is available. See the other method ofr details.

Inputs

  • fpath: path to the checkpoint directory

source
Utils.deleteFlagFileFunction.

Deletes the flag file if it exists. Does not error if flag file does not exist. Call on master process only

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the checkpoint index

source

Writes the AbstractCheckpointData to a file. The file can be read by readCheckpointData.

Inputs

  • checkpoint: Checkpointer object

  • checkpoint: the checkpoint index

  • obj: the AbstractCheckpointData object

  • comm_rank: the MPI rank of this process

source

Marks a checkpoint as unused.

Inputs

  • checkpointer: the Checkpointer object

  • checkpoint: the checkpoint index

source