Recoverability / Repeatability

Can execution resume from halfway.

Workflow Steps
%3cluster_acluster_bcluster_ccluster_dcluster_ecluster_fcluster_ha_text๐Ÿ“คย  Upload Appaabba->bb_text๐Ÿ›ข๏ธย  Create DBccb->cc_text๐Ÿ“ฅย  Download Appddc->dd_text๐Ÿ”Œย  Link App to DBeed->ee_text๐Ÿ’ฝย  Run Appffe->ff_text๐Ÿ“ซย  Allocate Domainggf->gg_text๐Ÿ“ชย  Attach Domainhhg->hh_text๐Ÿ“ฃย  Notify Completionperson๐Ÿง‘OOPS

Maturity Levels

๐Ÿงน Need manual clean up, then restart from scratch.

Usually takes effort to:

  • Find out what needs cleaning up.
  • Figure out how to clean up, and do it.

This will happen again.

๐Ÿ” Restart from scratch.

Tooling can automatically clean up for you.

Maybe you want to keep the failing environment around for investigation.

โ™ป๏ธ Reuse and recycle
%3g_text๐Ÿ“ชย  Attach Domaingghhg->hh_text๐Ÿ“ฃย  Notify Completionperson๐Ÿง‘YAY dotdot..dotdot->g
  • If it's already done, we won't do it again.

  • Replace / update existing resources.

  • Dependency updates must be propagated.

    • Re-download file if it changed.
    • Restart the web application if configuration changed.

This means if we fail at 90% / 2 hours into the process, we can restart at that point without waiting another 2 hours.

API Implications

  • Implementors should provide a "check" function, and a "do it" function.
  • The check function checks if the task is in the desired state.
  • If not, the do it function is called.

choochoo:

  • Requires a visit_fn ("do it" function).
  • check_fn is run if present, otherwise it always runs the visit_fn.
  • Bonus: The check function is run after the visit function to detect if the logic is correct.

Live Demo

  • Run it twice.
  • Replay last 2 steps.
  • Update the file.
time ./target/release/examples/demo
rm -rf /tmp/choochoo/demo/station_{g,h}
  • 514 KB, 1 second

    for i in {1..200000}; do printf "application contents $i"; done | gzip -f > app.zip
    
  • 5 MB, 7 seconds

    for i in {1..2000000}; do printf "application contents $i"; done | gzip -f > app.zip