Data Inspection

Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a DataFrame, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are

  • What is the structure of the data (datatypes)

  • How many features and how many observations are there available

  • Are there any missing data and is imputation needed?

  • Is my data balanced?

  • Are there outliers?

  • Is normalization needed?

  • Is data transformation needed?

  • Is data redundant?

  • Is data independent from each other or are there any relations?

Furthermore, it might be needed to reshape data, subset the data, or combine it with other sources to structure it in a final format. Some of the above questions can be answered by analyzing the structure and the statistics of the data. Other questions need to be answered by plotting the data. A list of most commonly used methods and attributes:

df.head() # show the first records of a dataset
list(df.columns) # list the columnnames
len(df) # returns number of rows
df.shape # returns the number of rows and columns of the dataset
df.dtypes # returns the datatypes
df.info() # returns a summary of the structure
df.describe().T #returns descriptive stats (transposed) 
df.isnull().sum() # returns the sum of missing values across columns
df['column'].unique() # returns uniques in a column
df['solumn'].value_counts() # returns uniques and counts
df.corr() # creates a correlation matrix, input for a heatmap

pandas-profiling

pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

# Generate a quick report from our dataset 
from pandas_profiling import ProfileReport  
profile = ProfileReport(myfile, minimal=True)
profile.to_file("EDA-Report.html")
profile

It generates a report with the following information:

  • Type inference: detect the types of columns in a dataframe.

  • Essentials: type, unique values, missing values

  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

  • Most frequent values

  • Histograms

  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

  • Missing values matrix, count, heatmap and dendrogram of missing values

  • Duplicate rows Lists the most occurring duplicate rows

  • Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data

An example of data inspection can be found on: https://bioinf.nl/~fennaf/DSLS/summerschool/dataprep.html

Last updated

Was this helpful?