26

Industrial Grade Data Science

 4 years ago
source link: https://towardsdatascience.com/industrial-grade-data-science-717c3b3a350b?gi=97acb93bec5a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Managing data science is hard. Data science projects have many opportunities to fail along the path. Risks of such projects are not widely known. The reason is that data science is still a young filed compared to software development. In this post, we will explore the major risks of data science projects and see approaches for controlling them.

Exploring risks of an average data science project

ZrU36fu.jpg!web

Photo by Tobias Tullius on Unsplash

Let’s first see how a data science project can appear in an average organization. It all starts with some enthusiast, it may be a software vendor, or company’s employee, who sees a way of changing the word and businesses around him with data science. There are so many successful use cases over the Internet. Feeling that his company can benefit from this too, so he advises company management to look into data science. The management buys in and asks the internal IT or analytics team to look into how we can apply data science to make our business better . The team starts to look for large data sources inside the organization, and generally finds one or several good databases. Then, they think hard about how they can apply data science and machine learning to make their company better. Most of the teams discover some kind of dataset where one can apply machine learning algorithms, so they go on with the task. In the end, such projects often conclude in the following ways:

  • Business does not understand what benefit they get from the new system or algorithm. The results are bewildering for company management. As a result, they blame the data science team for time and money spent on an unnecessary project
  • The system goes into production. The KPI (Key Performance Indicator) of the business process demonstrates a sudden fall after the team deploys the model. The company’s management is in rage and blames the data science team for the company’s losses

But how can we avoid failure in data science projects? To understand this, let’s dive into the major risks of data science projects.

Knowledge

The first and the major risk of any data science project is the availability and spread of knowledge. Decision-makers often lack of basic expertise in data science. That leads to inflated expectations and incorrect problem statements. People start to talk about AI and how it would magically solve problems by looking into their databases and finding new profits in the data.

Goals

Lack of knowledge of what data science and machine learning lead to an incorrect and vague problem statement. In reality, explaining the problem in a solvable form is a critical step. Correct problem makes up 80% of success in a data science project. The reason lies in that data science uses the scientific method and research processes to measure results. In general, data scientists look to improve some kind of metric, a formula for measuring the project’s performance. Vague goal definitions lead to incorrect solutions from the data science team side. Data scientists are mostly data experts, not business consultants. Business should collaborate with the data team to create a problem statement that will be worth the investment.

Before starting a project, make sure that your idea is:

  • Attached to the business
  • Can be measured using a set of metrics
  • Can be presented to the business side in an understandable way
  • Has related data. No data means that you have a data collection project, not a data science project

Having covered the strategic risks, let’s resort to the execution risks.

Management approach

Project managers often resort to software development management methodologies for data science projects. This seems like a good start: we are developing software in the end. Yes, there may be a machine learning model deep inside, but why should it need a different management approach? In reality, management practices like Agile are a good start for managing data science projects. But they need extra tweaking and adaptation to serve their purpose for data science projects.

The core problem is that Agile, like many management methodologies, focuses on handling the external scope changes. For example, Scrum centers on a product manager filling in all requested changes in a project backlog, which is later systematized. In data science projects, changes may come out only outside of the project’s team, but on the inside too. For example, the results of an experiment can change the approach for modeling techniques used at the project. This may lead to inevitable changes in internal software architecture and major changes in the entire system. You should adapt the management approach for data science projects to handle such cases, as they are quite common. In particular, think about splitting the project into two integrated parts with separate backlogs. Research subproject should deal with modeling and data preprocessing, while software subproject should encompass an end-to-end solution and integrate the results from the research subproject.

Another interesting aspect of any data science project is testing. In good software projects, testing is an integral part of the process. It starts together with the development stage and continues throughout the project up until to the production deployment. In data science project, testing is even more prominent. Ideally, You should document model testing approach before writing a single line of code. This testing approach is an integral part of project goal definition. The problem statement can’t be complete without the set of business and technical metrics that will evaluate the effects of the project.

Fancy tech

Data science, AI and machine learning are all buzzwords, surrounded by new and attractive technologies. On the other hand, data science solves a wide variety of practical problems, which makes organizations hungry to invest in the field. Often, technical experts are the first people who evangelize data science in the organization. In some cases, people are eager to try out new technologies and play with new algorithms, leaving the problem statement to a later stage of the project. In this setting, data science will be labeled as just another toy for the tech guys by business experts.

This tech-first approach for integrating data science is risky. Without support and ideas from the business side, even the best experts in the field will find it hard to advance the company’s business using its data. The motivation for data science and new project ideas should always be a collaborative effort, and not only a technological one.

Team

Data science projects often look like research projects. You need to test new, unexplored approaches and technologies to solve a business problem. And you need a team to produce this solution. Looking at data science projects solely from the research perspective makes you create R&D focused teams. However, many organizations omit the necessity of developing a production-ready system. In reality, data engineering and software around your model will take up to 90% of total time investments in the project. This means that you need to assemble teams, who are ready to develop prototypes into reliable, highly-available and production-ready software solutions. Cross-functional teams oriented at practical applications of machine learning are much more valuable for most businesses than an internal research lab that will advance state-of-the-art approaches and push the science forward. To control this risk, you should think about the data science team goals and adapt your hiring strategies to be in sync with those goals.

Tooling

Another important issue lies on the tech side. Data science projects are often looked upon as research endeavors. However, research is only a part of delivering a solution for any business problem. If you want your team to implement projects efficiently and reliably you need to look at data science from an engineering standpoint. ModelOps is a DevOps close relative — a discipline that studies engineering processes around machine learning model development and deployment. Nowadays, ModelOps presents to you tools for data versioning, constructing reusable pipelines ( https://dvc.org ), experiment tracking and model deployment ( http://mlflow.org ), and fast project setup ( https://github.com/pyscaffold/pyscaffoldext-dsproject ). Traditional CI/CD tools such as GitLab CI can also bring great benefits to your project, so try to consider using them. To control this risk, start thinking not only about model accuracy, but also about the model delivery process.

Where to go next?


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK