Structured data

Taxonomy of data and why the type of data and structured data is important Structure

Structured and unstructured data

Data comes from many sources: sensor measurements, laboratory equipment measurements, dataloggers in factories, computer administration systems, cameras, apps, databases, websites, documents, spreadsheets, or even flat files. Most of this data is unstructured, like text, which is a sequence of words often not structured. Images and audio-records or video’s are example of unstructured data. More structured data is data in tabular form with rows and columns, for instance tables in a spreadsheets or queries from a database. Structured data we call tidy data

A challenge in data science is to structure raw unstructured data into structured data since most of the data science tools and algorithm need a basic structure of rectangular data called a Data frame. In a Data frame the rows contain the observations and the columns contain the features.

Key terms for structured data

data frame a two-dimensional matrix, rectangular data (like a spreadsheet), with rows containing records and columns containing features and labels (optional)

record a row within the data frame containing the observation synonyms: sample, event, instance, example, case

feature a column within the data frame containing the feature information synonyms: attribute, input, variable, x, independent variables

label Sometimes the data frame contains columns with outcomes, to be modelled by a prediction model. synonyms: dependent variable, response, target, outcome, y

	PATIENT					DAYS	CODE_x	VALUE	UNITS
0	f58bf921-cba1-475a-b4f8-dc6fa3b8f89c	0 days	731-0	1.1	10*3/uL
1	f58bf921-cba1-475a-b4f8-dc6fa3b8f89c	0 days	48065-7	0.4	ug/mL
2	f58bf921-cba1-475a-b4f8-dc6fa3b8f89c	0 days	2276-4	332.4	ug/L
3	f58bf921-cba1-475a-b4f8-dc6fa3b8f89c	0 days	89579-7	2.3	pg/mL
4	f58bf921-cba1-475a-b4f8-dc6fa3b8f89c	0 days	14804-9	223.9	U/L

a typical data frame. The principles are as follow:

Each variable must have its own column
Each observation must have its own row
Each value must have its own cell

[1] Source: R for Data Science, Hadley Wickham & Garrett Grolemund, CC-BY-NC-ND 3.0 US, https://r4ds.had.co.nz/tidy-data.html

Data types

There are two basic types of structured data: numerical and categorical. Numerical data can be continuous and discrete. Categorical is data that has categories (like names of universities: RUG, Hanze, Stenden). But can be binary (yes/no) or in a certain order (less, sufficient, too much). The data type is important to determine which type of statistical model or machine learning model or visual display should be used in our data analysis. This is the reason why we classify data in software by type.

Key terms for data types

numeric

data that can be measured on a numeric scale - continuous data can be any value on an interval (interval, float, numeric) - discrete data can only take whole numbers (integer, counts)

categorical

data that contains a specific set of categories (factors, nominals) - logical categorical data with only two values (binary, boolean) - ordinal categorical data that has an ordening

Inspect for data types

One of the important task in data science is checking for data types. Some time we need to type cast (change it to another data type) to be able to analyse our data with the specific statistical or graphical display tool we would like to use. This can be due to a mistake - the data was stored in the wrong data type format, for instance numbers were stored as text.

#type cast to float
df.value.astype(float)

This can also be because we want to use a specific tool that only can take numerical data. We can for instance than change the (yes/no) to (0/1). The is called encoding.

Non two-dimensional data structures

Nowadays graphs (networks) are also used to represent data structures. This gitbook will focus on two-dimensional data structures. The graph structures are out of scope.

PreviousIntroduction NextVectorised solutions

Last updated 2 years ago

Was this helpful?