Data Inspection
Before we move into the analysis we need to inspect the structure and format of the data. We need to check the quality and quantity of the data. For each column in a DataFrame, we should understand the content, the meaning of the column label, the used datatype, and the used units. We cannot interpret the results if we do not understand fully the context of the data and the quality of the data. A few standard questions you should always consider are
What is the structure of the data (datatypes)
How many features and how many observations are there available
Are there any missing data and is imputation needed?
Is my data balanced?
Are there outliers?
Is normalization needed?
Is data transformation needed?
Is data redundant?
Is data independent from each other or are there any relations?
Furthermore, it might be needed to reshape data, subset the data, or combine it with other sources to structure it in a final format. Some of the above questions can be answered by analyzing the structure and the statistics of the data. Other questions need to be answered by plotting the data. A list of most commonly used methods and attributes:
pandas_profiling
extends the pandas DataFrame with df.profile_report()
for quick data analysis.
It generates a report with the following information:
Type inference: detect the types of columns in a dataframe.
Essentials: type, unique values, missing values
Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent values
Histograms
Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
Missing values matrix, count, heatmap and dendrogram of missing values
Duplicate rows Lists the most occurring duplicate rows
Text analysis learn about categories (Uppercase, Space), scripts (Latin, Cyrillic) and blocks (ASCII) of text data
Last updated
Was this helpful?