Introduction
From stupid data to intelligent solutions
Raw data is useless. Raw data is nothing more than recorded facts. Data can be unstructured like a picture, a sound, or a video, a web page or a piece of text. Data can be more structured in tables such as excel sheets and databases. This raw data is basically stupid data. We need techniques to extract information from it. Once we extract a pattern from the underlying data we can call this information. This information can help is to gain more insight, to make a decision or to predict the future. For instance we learned that high blood pressure, being diabetes and a high age is a risk for a heart attack. The relation between the blood pressure + diabetes and age values and the heart attack risk is the pattern. This pattern of the underlying data can only be detected with a certain reliability if we use a lot of examples and if we study the relation by advanced statistics. Once we determined the pattern we can use the pattern to instruct the computer to extract information from new examples. Most of the time we need a sequence of instructions to tell the computer how to follow the pattern to come to a result. This instruction sequence is called an algorithm. Machine learning techniques are techniques to automatically finding patterns in data. We call this training a machine learning model. The model contains the algorithm to use to predict the outcome based on data. We can write computer programs to train these models. We can write computer programs to evaluate and improve these models. Once we are satisfied about the performance of the model we can write computer programs to use a model in a user or production environment. This computer program that uses the developed model is called an AI system. We have now an intelligent solution.
The data scientist workflow
Before we answer any research question it is best practice to explore the data, so we have an idea about the quality and quantity of the data and its relations. Most of the time data is unstructured and needs to be cleaned and structured before we can use any model or analysis tool. This is called data processing. A data scientist’s workflow can be broken down into several components:
Define the problem
Gather the data (or ask the data engineer for the data)
Import the data
Clean, transform and structure data
Convert to tidy dataframe structure
Explore the data
Practical
Graphical
Analytical
Model and or analyse the data
Answer the problem
This workflow is not a linear process. It involves (in many cases) multiple back-and-forths between all the different parts of the process. (Okahim 2019).The step to gather the data is often the most time-consuming part. Unfortunately, data from real-life cases is often not nicely structured. We need to manipulate unstructured and/or messy data into a structured or clean form. We need to drop rows and columns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel columns or reformat characters into numerical values. Or we need to combine data from several sources. Cleaning and manipulating data into a structured form is called data processing. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs, etc.), giving it the form and context necessary to be interpreted by computers and utilized by users.
The Data Scientists toolbox
In previous courses, you learned about the basics in programming python and object-oriented python. In this course, we use python and the libraries NumPy
and Pandas
. These libraries are high-performance libraries especially suitable for data manipulations and data computations. It helps to structure and explore the gathered data in order to prepare the data into a suitable format for modeling.
Last updated
Was this helpful?