19

Recipe for a Data Science Project

 4 years ago
source link: https://towardsdatascience.com/recipe-for-a-data-science-project-69ad5c5ecf29?gi=16ff98dbba15
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Is a data science project a piece of cake? Check out my full 6 step instruction on how to handle it.

YFFziuu.jpg!web

Photo by Kateryna Ozler

In the article below you will read about the phases and activities of data science projects from beginning to production according to my experiences.

While reading the article, please keep in mind these 2 following information:

  • Each activity below represents a group of similar activities so they can be considered as task groups.
  • None of the phases can be passed once. There will always be iterations.

A naive thought can be that a Data Science project is mainly composed of predictive modelling. Certainly, it is not, especially in enterprise companies. There are lots of steps you have to make and difficulties you have to face. Therefore some process frameworks are created to make data science projects more manageable and to prevent failures. One of the most well-known processes for data science projects is CRISP-DM . It breaks the process of data science project into the following 6 phases.

I prefer something very similar but with slightly different phases to describe the whole process:

  • Evaluating the Value : Understanding the business, evaluating whether it is worthy to go on, defining the scope of the project
  • Data Preparation : Understanding the data, preparing necessary structures, data transfers, data preprocessing, feature engineering
  • Modeling : Modelling, simulations, evaluating results
  • Software Development : Developing APIs, dashboards, reports, notifications, simulation & management applications, deployment & integration with the existing products
  • Monitoring : A/B testing, monitoring and evaluating the pilot phase, rolling-out
  • Involvement: Making all stakeholders believe the benefits of the project and involve them in it

Before going through the phases, let me give some brief information about the total completion time. A data science project takes between 3 to 6 months according to my experience and observations in an enterprise. Please keep in mind it is just a simple average number, there are cases which take more than a year.

Phase-1: Evaluating the Value

  • I believe this is the most important part of a data science project.
  • It may be shorter or longer but I think 2 weeks is the median duration for this phase.
  • Product Owner and Data Translators are the key players.

Main activities of the phase:

  • Understanding the current state by the help of some visualizations.
  • Finding some pain points or opportunities in the current state and create project scope to gain value from them.
  • Defining the target variable. An example for a churn case is “ If a customer closes their accounts or if the total balance of customer’s accounts decreased to below xxx USD and the number of monthly transactions is lower than yyy.”. We mark these customers as 1 and the others as 0.

Phase-2: Data Preparation

  • This phase consumes 60% to 80% of the data-related task force of the project and the tasks are really not fun.
  • It may take from 4 to 12 weeks.
  • Data Engineers are the key players.

Main activities of the phase:

  • Creating the file structure of the project in source control
  • Deciding which data sources will be used in the project and creating the first version of the data dictionary
  • Transferring data from their sources to the analytics environment
  • Preparing datasets for modelling by removing some fields, joining tables together and creating new features

Phase-3: Modelling

  • The coolest part of the data projects. It is a pleasure to create a predictive model and evaluate its results.
  • It may be shorter or longer but I think 2 weeks is the median duration for this phase.
  • Data Scientists are the key players.

Main activities of the phase:

  • Creating a baseline simple model to let everybody start talking about it
  • Choosing the proper model for taking care of the trade-off between performance and explainability
  • Deep diving into the data to understand every part of it very clearly
  • Feature selection
  • Cross validation
  • Tuning
  • Evaluating the results

Phase-4: Software Development

  • This is the part where people realize that almost every data project is a part of a software product.
  • If it is the simplest case in which just sending some reports is enough, it may take just 1 week. On the other hand, if it is the case you develop an api, dashboard, simulation and management applications, then it can even take till even 12 weeks.
  • Software Developers are the key players.

Main activities of the phase:

  • Developing API to open the model to the world as a service
  • Developing dashboards to monitor the business with the new model after roll-out
  • Creating reports to observe the roll-out process and evaluate the production life of the new model
  • Implementing simulation & management applications to let business stakeholders enter parameters necessary for the model and make simulations to be able to have what-if analysis.
  • Deploying the components and integrating them with the existing products

Phase-5: Monitoring

  • The most exciting part of the project because all participants start to observe the results of the model in production.
  • Minimum duration I have experienced is 12 weeks and depending on the case it can prolonge till 1 year.
  • Everybody is on board.

Main activities of the phase:

  • Observing the production life of the new model by controlling the performance of it.
  • Executing A/B testing by observing and evaluating the results of control and pilot groups.
  • Making go/no-go decisions for next roll-out steps.

Phase-6: Involvement

  • Depending on how conservative stakeholders are, it can be the most challenging part of the project.
  • It starts at day 0 and finishes at the end of the project when the team is released.
  • Product Owner and Data Translator are the key players.

Main activities of the phase:

  • Explaining the road map to all stakeholders.
  • Involving them into the all possible work.
  • Helping them adopting their current business processes to the upgraded AI-based one.

In Conclusion,

If you asked me to tell three key essential points to minimize the risk of a failure in a Data Science project, I would say:

  • Find the right case/problem to work on it (top priority),
  • Focus on the cases of stakeholders who have a vision about AI or at least who are not against it,
  • Don’t fall into the trap of ignoring software development efforts and deployment complexities.

If you have any further questions, please don’t hesitate to write: [email protected].


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK