Exploration and Explanation in Computational Notebooks

ABSTRACT

Computational notebooks combine code, visualizations, and text in a single document. Researchers, data analysts, and even journalists are rapidly adopting this new medium. We present three studies of how they are using notebooks to document and share exploratory data analyses. In the first, we analyzed over 1 million computational notebooks on GitHub, finding that one in four had no explanatory text but consisted entirely of visualizations or code. In a second study, we examined over 200 academic computational notebooks, finding that although the vast majority described methods, only a minority discussed reasoning or results. In a third study, we interviewed 15 academic data analysts, finding that most considered computational notebooks personal, exploratory, and messy. Importantly, they typically used other media to share analyses. These studies demonstrate a tension between exploration and explanation in constructing and sharing computational notebooks. We conclude with opportunities to encourage explanation in computational media without hindering exploration.

References

Eric Allman. 2012. Managing Technical Debt. Communications of the ACM, 55, 5 (May 2012), 50--55.
Mike Bostock. 2017. A Better Way to Code. (Apr. 2017). Retrieved September 11, 2017 from https://medium.com/@mbostock/a-better-way-to-code2b1d2876a3a0
Nanette Brown, Ipek Ozkaya, Raghvinder Sangwan, Carolyn Seaman, Kevin Sullivan, Nico Zazworka, Yuanfang Cai, Yuepu Guo, Rick Kazman, Miryung Kim, Philippe Kruchten, Erin Lim, Alan MacCormack, Robert Nord. 2010. Managing technical debt in softwarereliant systems. In Proceedings of the FSE/SDP workshop on Future of Software Engineering Research (FoSER'10). ACM Press, New York, NY, 47--52.
Ward Cunningham. 1992. The WyCash portfolio management system. Addendum to the proceedings on Object-oriented Programming Systems, Languages, and Applications (OOPSLA'92). ACM Press, New York, NY, 29--30.
Tao Ding and Patrick Schloss. 2014. Dynamics and associations of microbial community types across the human body. Nature, 509, 7500 (Apr. 2014), 357--360.
Distil. 2017. Retrieved September 11, 2017 from https://distill.pub/
Brian Dunbar. 2010. NASA - Shuttle Computers Navigate Record of Reliability. (June 2010). Retrieved September 15, 2017 from https://www.nasa.gov/mission_pages/shuttle/flyout/flyf eature_shuttlecomputers.html
Martin Fowler. 1999. Refactoring: improving the design of existing code. Addison-Wesley Professional, Reading, MA.
Nahum Gershon and Ward Page. 2001. What storytelling can do for information visualization. Commun. ACM 44, 8 (Aug. 2001), 31--37.
Brian Granger, Chris Colbert, and Ian Rose. 2017. JupyterLab: The next generation jupyter frontend. JupyterCon 2017.
Philip Guo and Margo Seltzer. 2012. Burrito: Wrapping your lab notebook in computational infrastructure. USENIX Workshop on the Theory and Practice of Provenance (TaPP '12).
Richard Harper and Abigail Sellen. 1995. Collaborative tools and the practicalities of professional work at the international monetary fund. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI '95). ACM Press, New York, NY, 122129.
Thomas Herndon, Michael Ash, and Robert Pollin. 2014. Does high public debt consistently stifle economic growth? A critique of Reinhart and Rogoff. Cambridge journal of economics, 38, 2 (Dec 2013), 257--279.
Jessica Hullman and Nicholas Diakopoulos. 2011. Visualization rhetoric: Framing effects in narrative visualization. IEEE transactions on visualization and computer graphics, 17, 12 (Dec. 2011), 2231--2240.
Jessica Hullman, Steven Drucker, Nathalie Riche, Bongshin Lee, Daniel Fisher, and Eytan Adar. 2013. A deeper understanding of sequence in narrative visualization. IEEE Transactions on visualization and computer graphics, 19, 12 (Dec. 2013), 2406--2415.
Jupyter. A gallery of interesting Jupyter Notebooks. Retrieved September 11, 2017 from https://github.com/jupyter/jupyter/wiki/A-gallery-ofinteresting-Jupyter-Notebooks
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2012. Enterprise data analysis and visualization: An interview study. IEEE Transactions on Visualization and Computer Graphics, 18, 12 (Dec. 2012), 2917--2926.
Mary Beth Kery, Amber Horvath, and Brad Myers. 2017. Variolite: supporting exploratory programming by data scientists. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI '17). ACM Press, New York, NY, 1265--1276.
Alison Kidd. 1994. The marks are on the knowledge worker. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI '94). ACM Press, New York, NY, 186--191.
Donald Knuth. 1984. Literate programming. The Computer Journal, 27, 2 (Feb. 1984), 97--111.
Robert Kosara and Jock Mackinlay. 2013. Storytelling: The next step for visualization. Computer, 46, 5 (May 2013), 44--50.
Kateryna Kuksenok, Cecilia Aragon, James Fogarty, Charlotte P. Lee, and Gina Neff. 2017. Deliberate Individual Change Framework for Understanding Programming Practices in four Oceanography Groups. Journal of Computer Supported Cooperative Work, 26, 4--6 (May 2017), 1--29.
Paul McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension (ICPC '14). ACM Press, New York, NY, 279--290.
Peter Medawar. 2008. Induction and Intuition in Scientific Thought. Routledge.
Emerson Murphy-Hill, Chris Parnin, and Andrew Black. 2012. How we refactor, and how we know it. IEEE Transactions on Software Engineering, 38, 1 (Jan. 2012), pp.5--18.
Nature. 2017. Announcement: Transparency Upgrade for Nature Journals. Nature, 534, 7645 (Mar 2017), 288.
Roger Peng, 2011. Reproducible research in computational science. Science, 334, 6060 (Dec. 2011), 12261227.
Fernando Perez and Brian Granger. 2015. Project Jupyter: Computational Narratives as the Engine of Collaborative Data Science. Retrieved September 11, 207 from http://blog.jupyter.org/2015/07/07/projectjupyter-computational-narratives-as-the-engine-ofcollaborative-data-science/
Roman Radle. 2017. Codestrates: Literate Computing with Webstrates. In Proceedings of the ACM on User Interfaces and Systems Technology (UIST '17). ACM Press, New York, NY.
Dan Russell, Mark Stefik, Peter Pirolli, and Stuart Card. 1993. The cost structure of sensemaking. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems (CHI '93). ACM Press, New York, NY, 269--276.
Arvind Satyanarayan and Jeffrey Heer. 2014. Authoring narrative visualizations with ellipsis. Computer Graphics Forum, 33, 3 (Jun. 2014), 361--370.
Edward Segel and Jeffrey Heer. 2010. Narrative visualization: Telling stories with data. IEEE transactions on visualization and computer graphics, 16, 6 (Nov. 2010), 1139--1148.
Slava Shirokov. 2015. GitHub + Jupyter Notebooks = <3. (May 2015). Retrieved September 15, 2017 from https://github.com/blog/1995-github-jupyternotebooks-3
Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, K. Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering (ASE '10). ACM Press, New York, NY, 43--52.
Aurélien Tabard, Wendy Mackay, and Evelyn Eastmond. 2008. November. From individual to collaborative: the evolution of prism, a hybrid laboratory notebook. In Proceedings of the ACM conference on Computer Supported Cooperative Work (CSCW '08). ACM Press, New York, NY, 569--578.
Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, Denys Poshyvanyk. 2015, May. When and why your code starts to smell bad. In Proceedings of the 37th International Conference on Software Engineering. IEEE Press, 403--414.
John Tukey. 1977. Exploratory data analysis. Pearson.
Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy Teal. 2017. Good enough practices in scientific computing. PLoS computational biology, 13, 6 (Jun, 2017).

ABSTRACT

References

Supplemental Material

Index Terms

Comments

Recommend

How to Build Python from Source

Economics of Home Ownership Deep Dive

Replacing ReadTheDocs with GitHub

Bitcoin Machine Learning.

8 Best Keyboards for Programming in India 2020

charles leifer | Summer 2020 Track Days

Introduction to NLTK library in Python

Disable sleep on Ubuntu

Introduction to computer programming with flowcharts

Programmatically Identifying Political Media - Max Burstein's Blog

About Joyk