Data inspection
Get to know your data
Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a DataFrame, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are
What is the structure of the data (datatypes)
How many features and how many observations are there available
Are there any missing data and is imputation needed?
Is my data balanced?
Are there outliers?
Is normalization needed?
Is data transformation needed?
Is data redundant?
Is data independent from each other or are there any relations?
Furthermore, it might be needed to reshape data, subset the data, or combine it with other sources to structure it in a final format. Some of the above questions can be answered by analyzing the structure and the statistics of the data. Other questions need to be answered by plotting the data. A list of most commonly used methods and attributes:
How data is distributed can be derived with the .describe
method in case of numerical data and with .value_counts()
in case of categorical data or booleans. A graphical inspection however is highly recommended. Please read the Data visualization tutorial for more information. An example of data inspection is given in the Heart Failure Case Study.
Last updated
Was this helpful?