Why are we obsessed with Big Data, when small data is hiding in plain sight?

April 27, 2023

Dyslexia mode

One of the problems in digital technology is that a common language separates different approaches. In a previous article, I discussed the use of the term “small data:”

My use of the term refers to the apps and models, mostly in the realm of domain experts, that gather and derive data for planning, forecasting, and examination that ML may inform, but separate from it.

In the gravity distortion field of AI, Data Fabric/Data Mesh, Native Cloud, Modern Data Stack, and Data Lake/Data Lakehouse, there are dozens of new categories of data tools and hundreds of new companies, However, despite the focus on big data and large-scale analytics, there is a lack of attention given to small-scale analytics.

The small scale should also be given importance and attention. While big data analytics may be necessary for some applications, small data analytics are equally important for insights - and informed decision-making for domain experts in various fields.

I checked in with some colleagues to assess their points of view on the big data versus small data issue. Their comments and my response are included below, but first some definitions.

“Small data” is a fungible term with at least three definitions:

Causal emergence
An entirely different situation is how to address the problem of smaller volumes of data in machine learning models and the core issues dealing with small data sets
Data that is small in the first place, such as observations at a human scale

1. Causal Emergence
The (controversial) theory of causal emergence refutes the prevailing approach in experimental science in general, and data science in particular: learning: more data is better (reductionism), which may be the answer to some questions, but others may yield to valuable inference at a higher or aggregated level. The industry is enamored with the reductionist/complexity approach.

This quote from some respected but misled researchers summarizes succinctly what I believe is very wrong about our rush to data-driven decisions:

Massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves. [Katsikopoulos, K., & Canellas, M. (2022). Decoding human behavior with big data? Critical, constructive input from the decision sciences. AI Magazine, 43(1), 126.]

As early as 2008, in The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, Chris Adamson, editor-in-chief of Wired, announced:

It's time to ask: 'What can science learn from Google?'' The new availability of vast amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.

Judea Pearl, Turing Award winner, author of “The Book of Why” and inventor of Bayesian Networks, counters:

Solving probability problems takes exponential space and time. Bayes nets regain polynomial space and time plus they are much more transparent.

While not all data science and AI deal with probability, the iterative nature of Machine Learning processes enormous sets of data. It is a more expensive and less transparent approach than a Bayesian Network. However, the latter can become quite large but still small relative to the former.

The current uproar over the current state of LLMs is beginning to shine a light on the limitations of finding regularities in large data sets. We are all seeing the fallacy in Adamson's thinking on this matter. Reasoning by association is no substitute for reasoning from a model of reality. Conceptual models of reality have to be programmed by humans.

2. Transfer learning
A very technical meaning of small data refers to transfer learning. Transfer learning uses feature weights and representations from a pre-trained model to pose new questions without retraining the whole model. You bring smaller data. Transfer learning is practical when there is little data about the subject to study but tons of it on a related problem. This is not the small data issue I want to cover here.

3. Small in the first place
Big data gets all the hype. My use of the term refers to the apps and models mainly in the realm of domain experts that gather and derive data for planning, forecasting, examination and decision-making. Writeback and actual modeling logic distinguish it from most BI. Small data is perceived as inadequate for today's in-vogue algorithms. But by overlooking small data, are enterprises missing an excellent source of insight?

Don''t take my word for it - a small data sanity check

I asked some of my colleagues to give me a sanity check on this notion, asking if they shared my conviction that small data deserves more attention (and funding) than it gets, and here were their comments:

Tom Davenport, a professor, thought leader and prolific author:

There is just as much potential value in small data as in big data. It's not the size that matters but rather the fit between the data and the business problem it is intended to help solve.

The question of which approach fits a particular business problem requires a nuanced answer. Instead of choosing one approach over the other, both big data and small data should be seen as part of a portfolio of tools. For example, a data science or AI model may be used to generate detailed time series and geospatial maps of household formations. The results of such analysis can then be collected in aggregate form by small data models to experiment with new infrastructure construction for an energy company. In this way, big and small data analytics complement each other and can be used together to solve complex business problems.

Small data is the original source of data, models and metrics collected and derived by groups that fill in the blanks left by IT, such as reporting, specialized modeling, and data capture from external sources. It also covers analytics, especially budgets, planning and other aspects of what has come to be called Business Performance Management (BPM) and BI. These areas have been historically neglected by IT and enterprise systems software vendors, who focus more on good architecture and scalability, often to the exclusion of functionality at the knowledge worker level. Organizations are replete with valuable data on mobile devices, workstations, and departmental servers collected outside the realm of enterprise data architecture and enriched by models and ingested external data sources, perhaps in megabytes or gigabytes.

Dave Kellogg, advisor, director, consultant, angel investor, and blogger focused on enterprise software startups, offers some insight from his experience with technology in the small data realm (not exclusively):

They say, in today's world, data beats algorithm, and that's true when it comes to big data. Small data is the dark matter of the data world, it's everywhere but unless you go looking for it, you can't see it. It's tucked away in spreadsheets and local databases that can often be invisible to the enterprise. Rare is the corporate budget that reaches all the way down into this detail … which sadly is where the real optimizations can be done. I sometimes call that the buckets of money of problem

A real-life example: an insurance commissioner bans using FICO scores in auto insurance rating. That sends the actuaries scrambling to build quick models before making formal filings. Or, you learn that a competitor revamps their pricing, and you quickly create scenarios of possible responses. Perhaps your division just reorganized, and you must promptly redraw the departmental responsibility budgets without altering the totals.

David Menninger, SVP & Research Director of Technology Research, Ventana Research, offers an industry view:

By my count there are only five BI vendors that offer these types of capabilities (with varying degrees of integration). I use the term portfolio decision making (looking across a collection of decisions) to contrast it with transactional decision making (making a individual decision such as fraud or not; best next offer, etc.). It seems like all the focus is on the latter and I can't understand how you can have one without the other.

I don't, either.

My take

Our fascination with our punctuated progress with AI coupled with almost limitless computing resources does not portend the end of science. The above quote, “With enough data, the numbers speak for themselves,” is just wrong. Data doesn’t speak for itself.

Over twenty-five years ago, I read something that Tom Davenport wrote in what was the very first issue of Fast Company, and it stuck with me. From The Fad that Forgot People:

When the Next Big Thing in management hits, try to remember the lessons of Re-engineering. Don't drop all your ongoing approaches to change in favor of the handsome newcomer. Don't listen to the new approach's most charismatic advocates, but only to the most reasoned. Talk softly about what you're doing and carry a big ruler to measure real results.

Why are we obsessed with Big Data, when small data is hiding in plain sight?

Why are we obsessed with Big Data, when small data is hiding in plain sight?

Don''t take my word for it - a small data sanity check

My take

Recommend

【出海罗盘】拉美皇冠上的明珠—巴西-跨境头条-AMZ123亚马逊导航-跨境电商出海门户

Transformers from Scratch

Can ChatGPT modernize my VB6 app?

Use every situation as an opportunity to learn

从家电冠军到场景领先！海尔AWE发布新成果 ——新智家、新科技、新生活

SwiftLint in Depth [FREE]

当我第一次通过Kotlin和Compose来实现一个Canvas时, 我收获了什么? - 又似在水一方

皮卡车也开卷了！雷达RD6发布新版本，7秒破百、续航410公里，仅14万出头...

Brazil judge orders temporary suspension of Telegram

This Big Feature Is Still Missing From The Rivian R1S

About Joyk