Blogs

Data Science Project Management Methodology

There are few data science project management methodologies such as:

  • Scrum
  • CRISP-DM
  • Kanban
  • SEMMA
  • KDD
  • Waterfall
  • Bimodal
  • Domino Lifecycle
  • Microsoft TDSP

But before moving towards the details, let us have a look what actually is data science.

What is Data Science?

Data Science is the study of information gathered from large amount of complex dataset. It involves different concepts of statistics and computations to drive useful data for decision making purposes.

A Brief History

Data Science is not a new field of study, rather it was introduced 3 decades ago. The term “data science” was used for “computer science” previously, but since early 2000s it was considered as an independent course. Now, Let us dig deeper into the detail of the concepts and terminologies used in data science.

Stages of Analytics

There are 4 stages or levels of analytics that are considered to be the best for decision making process.

  • Descriptive analytics
  • Diagnostic analytics
  • Predictive analytics
  • Perspective analytics

It is essential to understand the 4 stages of analytics in detail.

1. Descriptive Analytics

Descriptive analytics is the first level of analytics and it is considered the most time consuming task. This is the most important step of decision making because all the other concepts are based on it. In descriptive analytics, raw data is collected and filtered to reuse it for useful purpose. This useful information is used to draw conclusions and important decisions. This step is also considered as the observation step and we move forward into insights, which is the 2nd step.

2. Diagnostic Analytics

As the word “diagnostic” suggest, we move forward to check why the graph is sloping that way. Unlike, descriptive analytics where we just focus on what is happening but in diagnostic we will focus on why it is happening this way. We diagnose the data and drive the conclusions over it. Let us have a look at the example.

Suppose you visit the doctor and the doctor says you that you are sick, not more than this. It means that he has observed you and drive the conclusion that you are sick but he has not diagnosed you and tell you the problem that why you are sick. You have sore throat or you have typhoid? No one knows. Same is the case with analytics, at first you analyze the data and then diagnose it.

Click here to know more about Data Science Course in Hyderabad

3. Predictive Analytics

In predictive analytics, organizations predict the outcome of the first two analytics. They predict different parameters of success and test them to check the reliability of their decision. It involves different machine learning concepts, AI, data science, and statistics to analyze the data and make predictions for the future.

By the combination of these 3 stages of analytics and different ML and AI concepts, organizations can make the decision about the failure or success in the future. Predictive analytics helps the organizations to make better decisions and look forward upon the behavior of the data. Unfortunately, many organizations ignore this stage and focus only on first two stages. Predictive analytics is applied in many applications such as health care, customer relationship management, risk management, fraud detection, theft detection, under writing and direct marketing etc. Processes of predictive analysis are:

  • Define project
  • Collect data
  • Analyze data
  • Modeling
  • Deployment
  • Model monitoring

4. Prescriptive Analytics

It is the most powerful and most advanced phase of analytics in which data itself prescribe the future of the organization. It automates the ability of data to make decisions and prescribe what should be done next. At first level you just observe the behavior of the data. In the next level you diagnose the data and answer that why the data is behaving that way, and then we make predictions that what will happen.

In the last and most vital step, the data will guide us that what should you do for the future benefit. This is the reason why prescriptive analytics is so important. For the success of the organization it is important for all four stages to work together. If any one of the phase is not working well, the decision making strategy will be failed.

Machine Learning Categories

Machine Learning is the application of Artificial Intelligence, in which computer systems automatically learn and improve their ability to learn from past experiences. Computer programs are developed in such a way that they access the data for example images, learn from that images and process further. There are several different categories of learning in machine learning but we will discuss here:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Supervised Learning

Supervised learning is all about labeled data. As it is clear form the name that a supervisor or a teacher who tells you the correct answer. Supervised learning means that the machines are trained using the labels. Label is a tag which is placed on an object, for example we give the image of dog labeled as dog. In this way we have guided our system that this is the image of a dog.

Afterwards, you train your machine by feeding thousands of images of a dog. When you will give an image that has a dog and an elephant, your machine will classify the dog through its features.

Supervised learning is further divided into two categories of algorithms:

  • Classification: means the machine classifies an output such as “Black” or “Blue” and “Chair” or “no Chair”
  • Regression: means the machine predicts an output as “bed” or “chair”
Types of Supervised Learning
  • Decision Tree
  • Classification
  • Regression
  • Logistic regression
  • Naïve Bayes Classifier
  • Support vector Machine

Unsupervised Learning

Unsupervised learning means that you train the machine using the data that is anonymous to the machine. You give the dataset to the machine that is neither labeled nor classified and you ask the algorithm to classify the images without any training. Machine divides the objects on the basis of size, color, and pattern. This is basically known as clustering, which is very popular algorithm of unsupervised learning.

For example, you feed the image having both dog and a man. As there is no label or tag on the image as machine has no idea that whether it is a dog, a cat or a man. But machine will distinguish them on the basis of similarities and differences.

  • Clustering: It means dividing the whole population into subcategories or groups. The members or entities of the group are similar on some basis and different to some other group.
  • Association: We define our rules to categorize large amount of data for example students that study A also study B.

Reinforcement Learning

Reinforcement Learning is the part of machine learning in which the machine learning or deep learning models are trained to make complex decisions. The software agents try to achieve a goal or take action in difficult and complex situations. Unlike supervised learning, where labels guide the machine about the object in the image or in simple words the model is trained by the correct answer. In reinforcement learning the agent learn from its experience and make the decision on the run time. In reinforcement learning, the output depends upon the input and the next input depends upon the previous output.

Types of reinforcement learning

Reinforcement Learning is of two types:

  • Positive
  • Negative
Positive Reinforcement Learning

It is the type of reinforcement learning in which a particular action makes a positive change in behavior.

Negative reinforcement Learning

It is the type of reinforcement learning in which the behavior starts increasing because the negative effect is stopped or neglected.

Advantages
  • It helps in increasing performance
  • It increases the positive behavior against the action
  • It helps to maintain the change for a long period of time

Data Science Project Life Cycle

In data science, data is the king and if we lack data or there is no data we cannot do anything. There are several questions arise when we start working over data. These questions are:

  • Why the data is needed?
  • How we will get the data?
  • Which type of data is required?
  • How to make the data useful for us?

All these questions are answered by data science project life cycle. We will discuss each step of project life cycle in detail. We will choose the OSEMN framework, which will cover the complete process from data gathering to interpretation.

The Osemn Model

The Osemn Model data-science-project

There are following steps in this framework:

  • Obtain data
  • Scrub data
  • Explore data
  • Model data
  • Interpret data

1. Obtain Data

This is the first step in data science project life cycle. It is the most basic step where we gather the required data from the available data sources. In order to retrieve the data you need to apply different queries to the database. The diverse kind of databases you may experience resemble PostgreSQL, Oracle, or even non-social databases (NoSQL) like MongoDB. Another approach to get data is to scratch from the sites utilizing web scratching apparatuses, for example, Beautiful Soup. Moreover, another well-known choice to assemble data is interfacing with Web APIs. Sites, for example, Facebook and Twitter permits clients to associate with their web workers and access their data. You should simply to utilize their Web API to slither their data.

We can also gather the data from different scripts and records for example, downloading it from Kaggle or existing corporate data which are put away in CSV file (Comma Separated Value) or TSV (Tab Separated Values) design. These records are level content documents. You should utilize exceptional Parser design, as a customary programming language like Python doesn't locally get it.

2. Scrub Data

After obtaining the data, the next step is to filter or scrub the data. Scrubbing the data is very important because we need only important data that is useful for us. We have to garbage the extra data otherwise it will be useless for us. There are a lot of things you need to do in scrubbing your data. If your data is in a special file format such as CSV file, then you need to organize those file and make a single repository so that it will be easy for you to analyze it. You can also add missing information where needed and replace the values accordingly. In short, you have to merge the data, neglect the ambiguous data, and delete the data which is useless for you.

3. Explore Data

After gathering the data from all resources and scrubbing it, you need to examine your data before applying Machine Learning and Artificial Intelligence techniques to it. There are several types of data such as ordinal data, descriptive data, categorical data, numerical data, and nominal data.

Moreover, you need to extract different features form your filtered data. The features can be anything related to your data. In ML and AI, the features can be edges, corners, and slope etc. we will visualize the data using data visualization tools like Tableau.

4. Model Data

Before rushing to the data modeling step, you need to ensure that your data gathering, filtering and extracting the useful information through it is done very crucially. When you are confident about the first three steps than you are good to proceed towards the modeling step.

In data modeling, the main task is to reduce the dimensionality of your filtered dataset. All the features and properties of the dataset is not required to predict the model rather you need to select the important features and predict on the basis of that.

5. Interpreting Data

Interpreting the data is the most powerful step in data science project life cycle. In this step, your technical skills are not enough to interpret the data but you also need a non-technical or layman to describe your findings. If the presented data is understood by the non-technical audience than you are successful, otherwise your communication is not effective and you need to work on that.

Frameworks For Building Machine Learning Systems/ Applications

Machine Learning frameworks are the tools or libraries that are used by the developers to develop machine learning and deep learning models. As you all know what is machine learning let us directly dive into the popular tools that can convert your simple AI project into the wonderful innovation. According to Ml and AI experts, following are the best frameworks for building machine learning systems.

1. Tensorflow

Tensorflow

TensorFlow is an open source framework which was released in 2015. It was developed by Google Brain team using Python, C++ and CUDA and is considered the best framework to develop ML models. This tool is used for data flow and training Machine Learning models very easily. It is also used for high mathematical computations and applications such as neural networks and deep neural networks. It also comes up with TensorFlow Lite to train Machine Learning models on your smart phone.

It is a vast ML framework which can be computed on CPU as well as GPU. It also supports classification, regression and association along with other deep neural networks.

2. Google Cloud Machine Learning Engine

Google Cloud Machine Learning Engine

Google cloud ML engine is one of the best framework for data scientists and deep learning enthusiasts to develop machine learning models. It gives data scientists, an opportunity to build models that can help in prediction and training services. It is also used in weather forecasting systems to measure the density of clouds. Moreover, these forecasting services can be used independently or solely.

3. Apache Mahout

Apache Mahout

Apache Mahout is an open source framework developed by Apache software foundation. It is the deep learning framework build on the top of Apache Hadoop that operates on a distributed linear algebra library to implement machine learning algorithms. It is very extensive and easy to use framework that is used for clustering, regression and batch-based collaborative filtering.

4. Shogun

Shogun

Gunnar Raetsch and Soeren Sonnenburg developed Shogun in 1999 that supports vector systems that support classification and regression systems along with other machine learning systems. Shogun is an open source machine learning framework used in designing data structures and algorithms. It supports C++, R, python, java, octave, ruby, C#, MatLab, Lua programing languages. It is used to develop hidden Markov models and can process extensive datasets. This is the main reason why shogun is extensively used in ML modeling.

5. Sci-Kit Learn

Sci-Kit Learn

Sci-Kit Learn is developed by David Cournapeau and is compatible with windows, Linux and MacOS. It is also an open source machine learning framework for the python programming language. It supports several classification, regression and vector systems. It is also very easy to learn by beginners because its documentation is widely available. It is also used in data mining and data analytics.

6. Pytorch By Torch

Pytorch By Torch

Python was released back in 2016, developed by Facebook’s AI Research Lab (FAIR). It is written in Python, C++ and CUDA language. Since the release, several researchers are adopting PyTorch because of its high computational power. It can build highly complex neural networks. It is developed to run high numerical computations which becomes easier as it is developed in Python programming language. PyTorch is the best framework for developing hybrid front end and computational graphs. It makes it easy to operate as compared to others.

7. H2O

H2O

H2O is an open source deep learning framework mainly used in data science in analytics. It is used to predict the decisions in prediction analytics. It helps in making decisions on the basis of data available. This framework is use for risk analysis, fraud detection, insurance analytics, whether forecasting systems, health care systems, advertising and predictive modeling systems.

8. Microsoft Cognitive Toolkit (Cntk)

Microsoft Cognitive Toolkit (Cntk)

Microsoft Cognitive Toolkit, previously known as CNTK is an open source deep learning library. This deep learning framework was written in C++ and productions readers developed by Microsoft Research Team. Microsoft cognitive Toolkit is used to drive neural networks in the form of graph and it was used by Microsoft for its products such as Skype, Bing, Cortana and Xbox. It is used in various ML models such as convolutional neural networks (CNN), recurrent neural networks (RNN) and deep neural networks (DNN).

9. Apache Mxnet

Apache Mxnet

Apache MXNet is an open source deep learning library which is used to train and deploy Convolutional Neural Networks and Deep neural Networks. It was developed by Apache Software foundation and it can be used with multiple programming languages such as C++, Python, R, Julia, Java, JavaScript, Go, Perl and Scala. Apache MXNet can perform be executed on both CPU and GPU.

10. Apple’s Core ML

Apache Mxnet

Apple’s core ML is used to develop machine learning models that allows image classification, regression, NLP, sentence classification and object detection and recognition. It can be used with CPU and GPU for better performance. This ML library is easy to use and beginner and intermediaries can also learn it easily.

Data Science Project Management Methodologies

There are various processes of data mining that are used in data science projects. Some of them are listed below:

  • Knowledge discovery in databases (KDD)
  • Sample, Explore, Modify, Model, Assess (SEMMA)
  • Cross-Industry Standard Process for Data Mining (CRISP-DM)

Before moving towards the detail of each methodology, let us have a brief introduction to data mining.

What is Data Mining?

Data mining is the art of finding hidden relationships among the data in large datasets. Data mining uses the techniques of Machine learning, statistics, Artificial intelligence and some database management techniques. The models trained using data mining techniques can be used in fraud detection, theft detection and marketing strategies. There are several types of data on which data engineers can perform data mining. Following are the types of data:

  • Simple databases
  • Relational databases
  • Advanced databases
  • Information repositories
  • Object oriented databases (OOD)
  • Object relational databases
  • Transactional and spatial databases
  • Heterogeneous databases
  • Text databases
  • Text mining
  • Web mining

Knowledge Discovery In Databases (KDD)

KDD is a methodology in data science in which data engineers extract the required information from the raw data to make it useful for their purpose. It has 5 steps:

Steps involved in KDD

  • Selection
  • Pre-Processing
  • Transformation
  • Data Mining
  • Interpretation

We will discuss all in detail.

Apache Mxnet

Selection

In this step, data engineer selects the larger dataset from where he needs to extract the useful information for further use.

Pre-processing

Data is filtered or scrub as the original data has many unnecessary details in it.

Transformation

Using transformation methods, data is reduced dimensionally. You need to extract special features for example eyes, nose, ears, edges and corners. You cannot apply ML or AI algorithms before dimensionality reduction.

Data Mining

It is the practice of examining larger datasets in order to obtain the information of your interest. For example prediction and prescription.

Interpretation

After mining, engineers evaluate the results against the mined data patterns.

Sample, Explore, Modify, Model, Assess (SEMMA)

SEMMA is the abbreviation for sample, explore, modify, model and assess. It is another data mining approach which is somehow similar to KDD but it can be applied on simple data science projects. Unlike KDD, SEMMA is a cyclic in nature and it strictly follows it.

Sample, Explore, Modify, Model, Assess (SEMMA)

Steps involved in SEMMA

There are 5 steps involved in SEMMA.

  • Sample
  • Explore
  • Modify
  • Model
  • Access

Sample

A sample is the part of information that is extracted from the larger dataset and it can be manipulated very easily.

Explore

Engineers explore the data and find useful trends and anomalies by understanding the dataset.

Modify

This step is similar as transformation in which unnecessary data variables and outliers are ignored. Engineers create and select the data for further processing.

Model

In this step, different modeling techniques are used. The techniques are selected on the basis of nature of goal that you want to achieve through data mining.

Access

It is the final stage and at this stage results are evaluated and check whether the finding are reliable or not. Data scientists measure the performance and if the goal is not reached, all the steps are repeated.

Cross-industry Standard Process For Data Mining

CRISP – DM is an abbreviation that stands for cross-industry standard process for data mining. This process model is used in almost all types of data mining projects including simple and complex one. This process model is also same as KDD and SEMMA in its cyclic nature but it is only different in the way that we can traverse back after moving forward as well. For example if you are at modeling step and you realize that the data is not enough to process further then you can move back to the data preparation step without repeating the whole cycle again. This model has 6 steps that describes the data science project life cycle.

Cross-industry Standard Process For Data Mining

This process model was established in the late 90s and since then it has become the most advantageous and reliable process model in the data mining and data science projects. This process model is very suitable with the agile methodology and it stream lines the processes.

Steps in CRISP – DM

  • Business Understanding
  • Data Understanding
  • Data Preparation
  • Modeling
  • Evaluation
  • Deployment

KDD Vs. SEMMA Vs. CRISP – DM

According to our point of view, KDD, CRISP – DM and SEMMA are almost same with minor differences. KDD and SEMMA are almost same, just the difference is that KDD does not follows cyclic nature but SEMMA strictly follows it. SEMMA can be used in simple data science projects rather than data specific ones. On the other hand, KDD specifically focuses on data specific projects where it has to deal with large and huge datasets and it process all the steps of it. CRISP – DM differs in the way that it combines the Sample and explore process of SEMMA and selection and preprocessing of KDD and makes it business understanding. One of the major difference is that CRISP – DM offers the deployment process also, which is not present in KDD and SEMMA.

Now we will discuss each step in detail. Let us start from the initial stage that is business understanding.

Business Understanding

This is the first and important step of project life cycle. This phase is related to information gathering and requirement elicitation. Moreover, other than data science or data mining projects, success of every other project strongly depends upon how well the requirements are elicited from the project stakeholders. This step is the base of other steps and procedures that is why we are supposed to correctly determine the business goals and constraints of the project. This step is further divided into few stages.

  • Define business problem
  • Assess and analyze scenarios
  • Define data mining goals
  • Build the project plan

1. Define Business Problem – Objectives and Constraints

In this step, you sit with the project stakeholders and ask them to give you all the details related to the project. What is the business problem what is he expecting to achieve? You elicit each and every requirement from your stakeholders. You note the constraints applied on the project by the customer. You also need to focus that the goals and objectives of the customer should be “SMART”.

“S” = specific

“M” = measurable

“A” = attainable

“R” = relevant

“T” = time bound

It means that the customer should not demand the product which is neither reliable nor possible for someone to develop. It should be time bound, it often happens that client demands the complex project in very limited time.

2. Assess and Analyze Scenarios

In this step, data engineer check for the resources available to accomplish the project. He analyze the project requirements, business constraints and assumptions. He note down the possible risks and contingencies. Moreover, a cost benefit analysis is carried out to determine that how we will invest and get revenue in future. Let me briefly explain what is cost-benefit analysis?

Cost Benefit Analysis is one of the techniques used in feasibility study phase of the software project management. In this technique, we evaluate the cost versus benefits in project proposal. There is a list of expenses of every project and the expected benefits after successfully completing the project. Here you can calculate return on investment (ROI), internal rate of return (IRR), net present value (NPV) and the payback period.

The purpose of cost benefit analysis in the project management is to analyze the pros and cons of various key points including transaction, different tasks, software requirements and investments. In short, Cost Benefit Analysis gives you the best approach to achieve your goals in least possible investment.

3. Define Data Mining Goals and Problem

This is the third step in business understanding phase of CRISP – DM. In this phase, along with defining business success criteria and objectives, you should also define the data mining goals. What is the criteria of their success? Means that in what ways they will be assure that the success is achieved. State the conditions that how the success looks like from the data mining experts perspective.

4. Build the Project Plan

After gathering all the possible business requirements and functional requirements along with the constraints, you are now good to proceed to the project plan development phase. In this phase, you develop the project plan by defining all the steps involved in the completion of project. You will describe about the tools and technologies which will be used in the complete project.

The next stage in CRISP – DM is data understanding. Let us have a look on that.

Data Understanding

This is the second phase of CRISP – DM and it runs on behave of first step that is business understanding. In this step, engineers focus on identifying the datasets, collect the relevant information from that dataset and analyze it so that it would become enough to accomplish the goals. The goals of this phase is to make high quality dataset and to make a strong bond of this dataset with the target variable. This goal is achieved in following steps.

  • Data collection
  • Data description
  • Exploratory data analysis
  • Data quality analysis

1. Data Collection

This is the initial phase of collecting data in which you collect the data from all available resources like websites and forums. Afterwards, you combine this dataset to analyze it and make a report out of it.

2. Data Description

After collecting all the necessary data from available resources, the next step is to describe the data. Data description means that you will examine the data and create a data description report. The report must have all the properties of data such as data format, data size, number of records, and all the fields associated with the data. At the end of this phase your data description report will be ready and you are good to proceed to the next step.

3. Exploratory Data Analysis

The next step is to fully explore your data. Before moving towards the quality assessment, you need to have the good understanding of your data. Usually, the datasets are ambiguous, noisy, and have missing entities and values. Engineers use different data visualization and summarization tools to remove the ambiguous data and fill out the missing information about the data. In this step, data scientist query the data and check whether we are getting the required information or not. They visualize the data and make suitable relationships among them. One they are sure about the data, they make the data exploratory report.

4. Data Quality Analysis

After the successful data exploration and removing ambiguous data, the next step is to verify the quality of the dataset available. Data may not be 100% accurate but the quality can be maximizes by taking certain measures and steps.

  • Data should be complete and it should have all the desired attributes in it.
  • Moreover, the data should be accurate and credible. It should be taken from the credible resources so that the results must be true.
  • You should take up to dated data from the sources and if it is not available you should move to the next resource as it is very important to have latest dataset.
  • Make sure that the data is consistent. You have to check that the facts and figures of data are same in all resources.
  • Your data should be validated from the authentic resources and you should be careful about the data integrity.

In this way you can achieve the quality data. Now, this data will benefit engineers to stream line their tasks and it will be easy for them to operate on the data. Once your data is cleaned, make a data quality report.

Data Preparation

In this phase you develop the final dataset for modeling. According to data mining experts and data scientists, preparing the data to further modeling is the main task in the whole project life cycle and it is a lengthy task. This step is also divided in to further steps.

  • Data selection
  • Data cleaning
  • Data construction
  • Data integration
  • Data wrangling
  • Feature extraction and evaluation
  • Attribute selection and generation
  • Data formatting

1. Data Selection

In this step, the data is finalized and data engineers select the data to use. They also create the document stating the reason that why this particular dataset is selected and why the other is rejected.

2. Data Cleaning

After selection of the data, it is cleaned and data engineers remove all the duplicate values.

3. Data Construction

Data construction is also an interesting task in data preparation phase. In this phase you combine different attributes of the data to make a new attribute which will definitely help you. In general, when you combine orange and yellow color, it gives you a red. In the same way when you combine a data attribute height and weight, it will give you the body mass index.

4. Data Integration

Data integration is the technique in which data is combined from different heterogeneous data resources like databases and websites into a single file. It is the preprocessing technique and it gives a unified view of the data. There are two types of data integration:

  • Tight coupling
  • Loose coupling

We will discuss each in brief detail.

Tight Coupling

In this type of data integration, the data ware house is considered as the authentic resource of the data. Engineers gather the information from the data ware house. They gather the data from different resources and combine them into a single physical location for modeling. It involves the extraction, transformation and loading process.

Loose Coupling

In this type of data integration, an interface is developed through which the query is transferred to the database and it fetch the results according to the query of the user. The data is not duplicated and it remains only in the database.

Issues in Data Integration

There are several issues in data integration like schema integration, redundancy, detection and resolving data value conflicts.

  • Schema Integration: It is one of the problems that can arise in data integration. In this issue, the entities of different databases could match through any attribute.
  • Redundancy: One attribute can be available in other databases and can be fetched from there.
  • Resolving Data Value Problems: value in one database could be different in another database. It is one of the main problems that could arrive in integrating the data.

Data Wrangling

Data wrangling is the way toward cleaning, organizing and enhancing crude data into an ideal configuration for better dynamic results in less time. It is also known as Data Munging some times. Data wrangling is progressively universal at the present top firms. Data has gotten more differing and unstructured, requesting expanded time spent separating, cleaning, and arranging data in front of more extensive examination. Simultaneously, with data illuminating pretty much every business choice, business clients have less an ideal opportunity to look out for specialized assets for arranged data.

Feature Extraction and Engineering

Feature extraction is the process of extracting the relevant feature from the dataset. Data engineers select the feature from the dataset and extract it for their use. We have discussed about feature selection below. There are several features of the data which is in the dataset, but extracting the relevant feature is the goal of this step.

Attribute Generation and Selection

Attribute generation is also known as feature generation. In feature or attribute generation you pick one or two attributes from the available data in the dataset and combine them. It will generate a new feature for you. For example, you can take an average of two variables, and you can calculate the percentage of a value etc.

In attribute selection or feature selection, important and useful features are selected from the several available features in the dataset. It is considered the important process in data mining project life cycle as it allows you to judge which features are important for predicting models and how these attributes are related to each other. Data engineers select the good features and ignore the ambiguous data even after the cleaning process.

Data Formatting

The last step before moving towards the modeling of the data is data formatting. In data formatting, engineers re-format the data for better understanding. It means to bring data to a common standard so that everyone can understand. Engineers can easily perform operations on the formatted dataset. For example, making all the characters as string data type.

Modeling

According to data scientist experts, modeling is the least time consuming task in the project life cycle. Modeling is the very crucial step in CRISP – DM. In this step all the findings and cleaned data is executed for further evaluation. You have to provide the input to the model using different data modeling techniques. At this step, you are well aware of that which data model you have to choose. There are four more steps in this phase.

  • Select modeling method or technique
  • Generate test design
  • Build model
  • Model evaluation and tuning
  • Model assessment

Select Modeling Methods Or Techniques

This is the first step in modeling and you have to choose the best possible modeling technique that you will use for training your model. On the other hand, you may have already selected the modeling methodology in early phase of CRIPS – DM that is business understanding. But now you will select the final modeling technique. Different modeling techniques are as follows:

  • Regression
  • Clustering
  • Decision Tree building
  • Neural network generation
  • Classification
  • Outlier detection
  • Association
  • Prediction
  • Sequential patterns
  • Statistical techniques
  • Machine learning and artificial intelligence

If your experts think that many modeling techniques are applied then you should apply separately all the techniques and then fetch the results.

  • Modeling technique: finalize the modeling technique after discussing and document it.
  • Modeling assumptions: if you apply more than one modeling technique than it is possible that each technique gives the different result.

Generate Test Design

Before building the model, you have to generate the test designs to check for the validity and quality of the model. In this step, engineers tend to split their data into test, training and validating parts. For example in supervised data mining projects the error rates are considered to be the quality measure for data mining projects. The main goal of this phase is to divide your whole dataset into test dataset, validation dataset and training dataset.

Build Model

In this step, engineers execute the modeling tool on the available fine dataset to generate one or more models. Before starting building the model you have to ensure few important things such as parameters. All modeling tools have their specific set of parameters that needs to be adjusted on the basis of target dataset. Set the parameters properly by mentioning their values. Now, your models have been built by using any modeling technique.

Model Evaluation and Improvement by Tuning

It is very important to measure the accuracy and efficiency of your data mining model. Although it is a tough task but you have to do it for accurate results. These data mining models are used for prediction and it is important to build a reliable model. This reliability often depends upon the selection of modeling technique. Every modeling technique has different criteria of model evaluation but the most common used are clustering, classification and regression. They are used in supervised data mining projects.

Model Assessment

As discussed earlier there are several data mining models that fit into their own domain and specifications. Data scientists decipher the models as per their area information, their data mining achievement rules and the ideal test plan. Judge the accomplishment of the utilization of modeling and revelation strategies actually, at that point contact business investigators and domain specialists later so as to talk about the data mining brings about the business setting. This document just thinks about models, while the evaluation stage additionally considers all different outcomes that were delivered over the span of the task. At this point you should rank the models and survey them as indicated by the evaluation measures. You should consider the business targets and business achievement measures as far as possible here. In most data mining ventures a solitary procedure is applied more than once and data mining results are created with a few distinct methods. After this step, you have to fine tune the model until it becomes excellent to proceed further.

Evaluation

At this stage, our work is leading towards the finish line. Data engineers evaluate the obtained results. They check that whether the business objective is achieve or not. The results are evaluated on the basis of the requirements mentioned by the customer. This phase also has three steps.

  • Evaluation of the results
  • Review process
  • Determine the next steps

Evaluation of Results

Once the models are build and tuned, check that do they meet the client’s requirement or not. If they are not meeting the client’s business requirements build another one. Also check for the “SMART” objective.

Review Process

Skim the whole process and check for the missing work. Check if all the steps were covered properly and do the correction of minor mistakes if any. Go through the documentation and finalize the process.

Determine the Next Steps

It is the step where you make the decision of deployment or making revisions or changes to the previous work. If you are not satisfied with the results or your customer is not satisfied, you can rebuild your data model. You can also build new data mining projects without deploying the existing one.

Deployment

Once you have made the decision of bringing your model to life than its time to deploy your project. You have to deploy it so that customer or client can access it and he can enjoy the benefits of model. This last phase has also 4 steps.

  • Plan deployment
  • Plan monitoring and maintenance
  • Produce final report
  • Review complete project

Plan Deployment

Everything needs a proper documentation for keeping the record. In the same way, you have to develop a plan document to deploy the model.

Plan Monitoring and Maintenance

As the name suggest, monitoring and maintenance plan is made to avoid any mishap in the future. Unfortunately, if the model get collapsed some engineer could make it. Moreover, if the model stops delivering the results, another data engineer could read the maintenance document and make it run.

Produce Final Report

Like every project, it is considered a good practice to create a final report of the completed project. This report might have a presentation of how all the tasks get completed and finalized. How the model was build and what measures were taken to build the data mining model. Moreover, it may also contain the results of the evaluation to be presented to the client upon final delivery.

Review Complete Project

This step is related to going through the documentation of whole project. Stakeholders check that how the project went through? What milestones have achieved and what hurdles came in the path? How can we improve our results? Which process methodology is better and why? All these things are noticed because they are very helpful in the future data mining projects.

Conclusion

This article covers all the important fields of data science including study of different data mining project life cycles. Moreover, different terminologies and concepts related to data mining and data science have been discussed. Moreover, in our opinion if you only focus on simple data and modeling, you will not be able to move ahead. That is why you need to focus on CRISP – DM methodology because it is the leading data mining project life cycle methodology. As you have seen before that it differs from the above two (KDD and SEMMA) in the way that it gives deployment advantage and business understanding. It is the most crucial part as all the other steps depend on this first step. We will also suggest you to read more about Machine Learning and Artificial Intelligence as they both are the innovation of this and next generation.

Address:

360DigiTMG - Data Analytics, Data Science Course Training Hyderabad

2-56/2/19, 3rd floor,, Vijaya towers, near Meridian school,, Ayyappa Society Rd, Madhapur,, Hyderabad, Telangana 500081

Phone: 099899 94319