Structured data
Taxonomy of data and why the type of data and structured data is important Structure
Structured and unstructured data
Data comes from many sources: sensor measurements, laboratory equipment measurements, dataloggers in factories, computer administration systems, cameras, apps, databases, websites, documents, spreadsheets, or even flat files. Most of this data is unstructured, like text, which is a sequence of words often not structured. Images and audio-records or video’s are example of unstructured data. More structured data is data in tabular form with rows and columns, for instance tables in a spreadsheets or queries from a database. Structured data we call tidy data
A challenge in data science is to structure raw unstructured data into structured data since most of the data science tools and algorithm need a basic structure of rectangular data called a Data frame. In a Data frame the rows contain the observations and the columns contain the features.
Key terms for structured data
data frame a two-dimensional matrix, rectangular data (like a spreadsheet), with rows containing records and columns containing features and labels (optional)
record a row within the data frame containing the observation synonyms: sample, event, instance, example, case
feature a column within the data frame containing the feature information synonyms: attribute, input, variable, x, independent variables
label Sometimes the data frame contains columns with outcomes, to be modelled by a prediction model. synonyms: dependent variable, response, target, outcome, y
a typical data frame. The principles are as follow:
Each variable must have its own column
Each observation must have its own row
Each value must have its own cell
Data types
There are two basic types of structured data: numerical and categorical. Numerical data can be continuous and discrete. Categorical is data that has categories (like names of universities: RUG, Hanze, Stenden). But can be binary (yes/no) or in a certain order (less, sufficient, too much). The data type is important to determine which type of statistical model or machine learning model or visual display should be used in our data analysis. This is the reason why we classify data in software by type.
Key terms for data types
numeric
data that can be measured on a numeric scale - continuous data can be any value on an interval (interval, float, numeric) - discrete data can only take whole numbers (integer, counts)
categorical
data that contains a specific set of categories (factors, nominals) - logical categorical data with only two values (binary, boolean) - ordinal categorical data that has an ordening
Inspect for data types
One of the important task in data science is checking for data types. Some time we need to type cast (change it to another data type) to be able to analyse our data with the specific statistical or graphical display tool we would like to use. This can be due to a mistake - the data was stored in the wrong data type format, for instance numbers were stored as text.
This can also be because we want to use a specific tool that only can take numerical data. We can for instance than change the (yes/no) to (0/1). The is called encoding
.
Non two-dimensional data structures
Nowadays graphs (networks) are also used to represent data structures. This gitbook will focus on two-dimensional data structures. The graph structures are out of scope.
Last updated
Was this helpful?