Tidy data

We have lots and lots of data at our disposal. But data is the first step in translating data into meaningful information. This translation we can be done by analysis and visualisation of the data. There are many tools developed for data analysis in python. Statistical tools, Visualisation tools as well as Machine learning tools. Most of these tools need a tidy data. The principles of tidy data provide a standard way to organise data values within a dataset. A standard makes initial data cleaning easier because you don’t need to start from scratch and reinvent the wheel every time. The tidy data standard has been designed to facilitate initial exploration and analysis of the data, and to simplify the development of data analysis tools that work well together[1]

The principles are as follow:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

[1] Source: R for Data Science, Hadley Wickham & Garrett Grolemund, CC-BY-NC-ND 3.0 US, https://r4ds.had.co.nz/tidy-data.html

DataFrame

The tidy principles lead to a table, we call DataFrame . The library pandas support organising data in dataframes. A DataFrame is a collection of columns, each of which represent a variable. Each column can have a different value type (numeric, strings, booleans etc). DataFrames are like matrixes. They can be slides by column and row index.

#import libray pandas
import pandas as pd
# read csv data into pandas dataframe
df = pd.read_csv('data/TB_burden_age_sex_2019-04-23.csv')
#show head of the table
df.head()

PreviousData Preparation NextCombine data

Last updated 3 years ago

Was this helpful?