Industrial Grade Data Science

Managing data science is hard. Data science projects have many opportunities to fail along the path. Risks of such projects are not widely known. The reason is that data science is still a young filed compared to software development. In this post, we will explore the major risks of data science projects and see approaches for controlling them.

Exploring risks of an average data science project

ZrU36fu.jpg!web

Photo by Tobias Tullius on Unsplash

Let’s first see how a data science project can appear in an average organization. It all starts with some enthusiast, it may be a software vendor, or company’s employee, who sees a way of changing the word and businesses around him with data science. There are so many successful use cases over the Internet. Feeling that his company can benefit from this too, so he advises company management to look into data science. The management buys in and asks the internal IT or analytics team to look into how we can apply data science to make our business better . The team starts to look for large data sources inside the organization, and generally finds one or several good databases. Then, they think hard about how they can apply data science and machine learning to make their company better. Most of the teams discover some kind of dataset where one can apply machine learning algorithms, so they go on with the task. In the end, such projects often conclude in the following ways:

Business does not understand what benefit they get from the new system or algorithm. The results are bewildering for company management. As a result, they blame the data science team for time and money spent on an unnecessary project
The system goes into production. The KPI (Key Performance Indicator) of the business process demonstrates a sudden fall after the team deploys the model. The company’s management is in rage and blames the data science team for the company’s losses

But how can we avoid failure in data science projects? To understand this, let’s dive into the major risks of data science projects.

Knowledge

The first and the major risk of any data science project is the availability and spread of knowledge. Decision-makers often lack of basic expertise in data science. That leads to inflated expectations and incorrect problem statements. People start to talk about AI and how it would magically solve problems by looking into their databases and finding new profits in the data.

Goals

Lack of knowledge of what data science and machine learning lead to an incorrect and vague problem statement. In reality, explaining the problem in a solvable form is a critical step. Correct problem makes up 80% of success in a data science project. The reason lies in that data science uses the scientific method and research processes to measure results. In general, data scientists look to improve some kind of metric, a formula for measuring the project’s performance. Vague goal definitions lead to incorrect solutions from the data science team side. Data scientists are mostly data experts, not business consultants. Business should collaborate with the data team to create a problem statement that will be worth the investment.

Before starting a project, make sure that your idea is:

Attached to the business
Can be measured using a set of metrics
Can be presented to the business side in an understandable way
Has related data. No data means that you have a data collection project, not a data science project

Having covered the strategic risks, let’s resort to the execution risks.

Management approach

Project managers often resort to software development management methodologies for data science projects. This seems like a good start: we are developing software in the end. Yes, there may be a machine learning model deep inside, but why should it need a different management approach? In reality, management practices like Agile are a good start for managing data science projects. But they need extra tweaking and adaptation to serve their purpose for data science projects.

The core problem is that Agile, like many management methodologies, focuses on handling the external scope changes. For example, Scrum centers on a product manager filling in all requested changes in a project backlog, which is later systematized. In data science projects, changes may come out only outside of the project’s team, but on the inside too. For example, the results of an experiment can change the approach for modeling techniques used at the project. This may lead to inevitable changes in internal software architecture and major changes in the entire system. You should adapt the management approach for data science projects to handle such cases, as they are quite common. In particular, think about splitting the project into two integrated parts with separate backlogs. Research subproject should deal with modeling and data preprocessing, while software subproject should encompass an end-to-end solution and integrate the results from the research subproject.

Another interesting aspect of any data science project is testing. In good software projects, testing is an integral part of the process. It starts together with the development stage and continues throughout the project up until to the production deployment. In data science project, testing is even more prominent. Ideally, You should document model testing approach before writing a single line of code. This testing approach is an integral part of project goal definition. The problem statement can’t be complete without the set of business and technical metrics that will evaluate the effects of the project.

Fancy tech

Data science, AI and machine learning are all buzzwords, surrounded by new and attractive technologies. On the other hand, data science solves a wide variety of practical problems, which makes organizations hungry to invest in the field. Often, technical experts are the first people who evangelize data science in the organization. In some cases, people are eager to try out new technologies and play with new algorithms, leaving the problem statement to a later stage of the project. In this setting, data science will be labeled as just another toy for the tech guys by business experts.

This tech-first approach for integrating data science is risky. Without support and ideas from the business side, even the best experts in the field will find it hard to advance the company’s business using its data. The motivation for data science and new project ideas should always be a collaborative effort, and not only a technological one.

Team

Data science projects often look like research projects. You need to test new, unexplored approaches and technologies to solve a business problem. And you need a team to produce this solution. Looking at data science projects solely from the research perspective makes you create R&D focused teams. However, many organizations omit the necessity of developing a production-ready system. In reality, data engineering and software around your model will take up to 90% of total time investments in the project. This means that you need to assemble teams, who are ready to develop prototypes into reliable, highly-available and production-ready software solutions. Cross-functional teams oriented at practical applications of machine learning are much more valuable for most businesses than an internal research lab that will advance state-of-the-art approaches and push the science forward. To control this risk, you should think about the data science team goals and adapt your hiring strategies to be in sync with those goals.

Tooling

Another important issue lies on the tech side. Data science projects are often looked upon as research endeavors. However, research is only a part of delivering a solution for any business problem. If you want your team to implement projects efficiently and reliably you need to look at data science from an engineering standpoint. ModelOps is a DevOps close relative — a discipline that studies engineering processes around machine learning model development and deployment. Nowadays, ModelOps presents to you tools for data versioning, constructing reusable pipelines ( https://dvc.org ), experiment tracking and model deployment ( http://mlflow.org ), and fast project setup ( https://github.com/pyscaffold/pyscaffoldext-dsproject ). Traditional CI/CD tools such as GitLab CI can also bring great benefits to your project, so try to consider using them. To control this risk, start thinking not only about model accuracy, but also about the model delivery process.

Exploring risks of an average data science project

Knowledge

Goals

Management approach

Fancy tech

Team

Tooling

Where to go next?

Recommend

波士顿动力机器狗的首个商业买家，为何是它？

电子产品正在废掉农村娃引发价值观混乱等多重问题

专访 OPPO ColorOS 设计总监陈希：这是一款懂用户情感的系统

GitHub - JohnSundell/Ink: A fast and flexible Markdown parser written in Swift.

业绩快报 | 小米Q3智能手机业务营收下降，价格下跌，但IoT占比大幅增加

我，一个CEO，发现团队越来越难带了

9102 年，蚂蚁金服前端是怎么写图表的?

208.43.231.11 Git - php-src.git/tag

睡太多患痴呆风险加剧？保持稳定的睡眠时间是关键

360金融Q3财报：方源资本战略入股，科技服务收入占比提升150%

About Joyk