Data inspection

Get to know your data

Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a DataFrame, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are

What is the structure of the data (datatypes)
How many features and how many observations are there available
Are there any missing data and is imputation needed?
Is my data balanced?
Are there outliers?
Is normalization needed?
Is data transformation needed?
Is data redundant?
Is data independent from each other or are there any relations?

Furthermore, it might be needed to reshape data, subset the data, or combine it with other sources to structure it in a final format. Some of the above questions can be answered by analyzing the structure and the statistics of the data. Other questions need to be answered by plotting the data. A list of most commonly used methods and attributes:

df.head() # show the first records of a dataset
list(df.columns) # list the columnnames
len(df) # returns number of rows
df.shape # returns the number of rows and columns of the dataset
df.dtypes # returns the datatypes
df.info() # returns a summary of the structure
df.describe().T #returns descriptive stats (transposed) 
df.isnull().sum() # returns the sum of missing values across columns
df['column'].unique() # returns uniques in a column
df['solumn'].value_counts() # returns uniques and counts
df.corr() # creates a correlation matrix, input for a heatmap

How data is distributed can be derived with the .describe method in case of numerical data and with .value_counts() in case of categorical data or booleans. A graphical inspection however is highly recommended. Please read the Data visualization tutorial for more information. An example of data inspection is given in the Heart Failure Case Study.

PreviousSQL NextData exploration

Last updated 4 years ago

Was this helpful?