TY - GEN
T1 - Runtime provenance refinement for notebooks
AU - Deo, Nachiket
AU - Glavic, Boris
AU - Kennedy, Oliver
N1 - Publisher Copyright: © 2022 Owner/Author.
PY - 2022/6/12
Y1 - 2022/6/12
N2 - Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL. Notebooks are more suited for interactive development of data pipelines than classical workflow systems, because they provide immediate feedback for the results of a computation and do not require the full computation to be specified upfront. However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when the code or inputs change, and does not allow for parallel execution of cells - - all symptoms of its kernel-based evaluation strategy. We propose a new "workbook"model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance for Python code at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected or refuted at runtime. We demonstrate the feasibility of this approach using a prototype implementation in our notebook engine Vizier.
AB - Computational notebooks (e.g., Jupyter or Apache Zeppelin) have become a popular choice for data exploration, preparation, and ETL. Notebooks are more suited for interactive development of data pipelines than classical workflow systems, because they provide immediate feedback for the results of a computation and do not require the full computation to be specified upfront. However, the notebook model suffers from poor reproducibility, does not support automatic incremental re-evaluation of code when the code or inputs change, and does not allow for parallel execution of cells - - all symptoms of its kernel-based evaluation strategy. We propose a new "workbook"model that combines the usability of notebooks with the provenance and parallel execution capabilities of workflow systems. This is made possible through a novel approach that refines a static approximation of provenance for Python code at runtime and a scheduler that dynamically adapts the execution order of cells based on data dependencies detected or refuted at runtime. We demonstrate the feasibility of this approach using a prototype implementation in our notebook engine Vizier.
UR - https://www.scopus.com/pages/publications/85133798081
U2 - 10.1145/3530800.3534535
DO - 10.1145/3530800.3534535
M3 - Conference contribution
T3 - Proceedings of 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022
SP - 44
EP - 47
BT - Proceedings of 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022
PB - Association for Computing Machinery, Inc
T2 - 14th International Workshop on the Theory and Practice of Provenance, TaPP 2022, held in conjunction with SIGMOD 2022
Y2 - 17 June 2022 through 17 June 2022
ER -