Introduction
From stupid data to intelligent solutions
Raw data is useless. Raw data is nothing more than recorded facts. Data can be unstructured like a picture, a sound, or a video, a web page or a piece of text. Data can be more structured in tables such as excel sheets and databases. This raw data is basically stupid data. We need techniques to extract information from it. Once we extract a pattern from the underlying data we can call this information. This information can help is to gain more insight, to make a decision or to predict the future. For instance we learned that high blood pressure, being diabetes and a high age is a risk for a heart attack. The relation between the blood pressure + diabetes and age values and the heart attack risk is the pattern. This pattern of the underlying data can only be detected with a certain reliability if we use a lot of examples and if we study the relation by advanced statistics. Once we determined the pattern we can use the pattern to instruct the computer to extract information from new examples. Most of the time we need a sequence of instructions to tell the computer how to follow the pattern to come to a result. This instruction sequence is called an algorithm.
Machine learning techniques are techniques to automatically finding patterns in data. We call this training a machine learning model. The model contains the algorithm to use to predict the outcome based on data. We can write computer programs to train these models. We can write computer programs to evaluate and improve these models. Once we are satisfied about the performance of the model we can write computer programs to use a model in a user or production environment. This computer program that uses the developed model is called an AI system. We have now an intelligent solution.
A Data Scientist workflow
Key for modelling is the structure and quality of the data. This Programming 1 module will learn you to structure the data into the required format by means of data wrangling into tidy dataframes and to improve its quality. You learn to analyse the quantity and quality of the data by means of data inspection and data exploration
A data scientist’s workflow can be broken down into several components:
Define the problem
Gather the data
Explore the data
Model with the data
Evaluate the model
Answer the problem
This workflow is not a linear process. It involves (in many cases) multiple back and forths between all the different parts of the process. (Okahim 2019).
The step to gather the data is often the most time-consuming part. Unfortunately, data from real-life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and columns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel columns or reformat characters into numerical values. Or we need to combine data from several sources. Cleaning and manipulating data into a structured form is called data processing. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs, etc.), giving it the form and context necessary to be interpreted by computers and utilized by users.
The Data Scientists toolbox
In previous courses, you learned about the basics in programming python and object- oriented python. In this course, we use python and the libraries NumPy
and Pandas
. These libraries are high-performance libraries especially suitable for data manipulations and data computations. It helps to structure and explore the gathered data in order to prepare the data into a suitable format for modelling.
Last updated
Was this helpful?