Database concepts
Last updated
Was this helpful?
Last updated
Was this helpful?
A database is a file that is organized for storing data. Most databases are organized like a dictionary in the sense that they map from keys to values. The biggest difference is that a database is on permanent storage, so it persists after the program ends. Because a database is stored on permanent storage, it can store far more data than a dictionary, which is limited to the size of the memory in the computer. The advantage of a database is, compared to lists of flat files, that like a dictionary, database software is designed to keep the inserting and accessing of data very fast, even for large amounts of data. Database software maintains its performance by building indexes as data is added to the database to allow the computer to jump quickly to a particular entry.
A database is a structured data storage
There are many different database systems that are used for a wide variety of purposes including Oracle, MySQL, Microsoft SQL Server, and SQLite. SQLite is already built into Python. SQLite is designed to be embedded into other applications to provide database support within the application. SQLite is easy to combine with pandas.
SQL is the abbreviation for Structured Query Language. SQL is the language we use to issue commands to the database to Create data (a.k.a Insert), retrieve data, update data, and delete data. Normally the end-user requests are translated to SQL commands in an application parsing the requests to the database. The developer creates the application with SQL commands to insert, retrieve, update, or delete data, whereas the database administrator is responsible for the design, implementation, maintenance, and repair of a database.
Talking to a database is like making a phone call. Before we can talk to the database we need to make a connection first. We can connect to a database with the Python Database API. Access to the database is made available through connection objects. The constructor is:
The parameters are database system dependable. In the example below, we connect to a sqlite database. Sqlite is a free downloadable database system.
Sometimes we need a user name and password since a relational database can have access control. There can be multiple connections to the same database at the same time. The connection object has several methods: .close()
, .commit
, .rollback
, .cursor()
where the cursor()
method returns a new cursor object.
A cursor is a file handler that we can use to perform operations on the data stored in the database. This object is mostly used to give 'instructions' to the database. Simple instructions we can give with the conn.cursor().execute
, multiline scripts we can execute with conn.cursor().executescript
.The most commonly used methods of the cursor object are execute(), fetchone()
and fetchall()
When you first look at a database it looks like a bunch of tables. The primary data structures in a database are tables, rows, and columns. In technical descriptions of relational databases, the concept of the table is formally referred to as relation. Columns are formally referred to as attributes, and rows are formally referred to as records or tuples.
Relational databases capture related datasets. In a database, you have multiple relations (tables) that are related to each other by specific attributes (keys).
There is tons of database theory - way too much for the scope of this course. The main principles are as follow:
Do not replicate data but reference data by pointing at data
Use integers for keys and for other references, integers process faster.
Add a special “key” column to each table (relation) to which you make references. By convention, many programmers call this column “id”. Basically, this is similar to the index column of a pandas dataframe
These principles are called Normalization
There are generally three kinds of keys used in a database model.
A logical key is a key that the “real world” might use to look up a row. You will often find that it makes sense to add a UNIQUE constraint to a logical key. Since the logical key is how we look up a row from the outside world, it makes little sense to allow multiple rows with the same value in the table.
A primary key is usually a number that is assigned automatically by the database. It generally has no meaning outside the program and is only used to link rows from different tables together. When we want to look up a row in a table, usually searching for the row using the primary key is the fastest way to find the row. Since primary keys are integer numbers, they take up very little storage and can be compared or sorted very quickly.
A foreign key is usually a number that points to the primary key of an associated row in a different table. We are using a naming convention of always calling the primary key field name id and appending the suffix _id to any field name that is a foreign key
A foreign key is when a table has a column that contains a key which points to the primary key of another table. When all primary keys are integers, the foreign keys are integers as well!
SQLite is a very popular database - it is free and fast and small. SQLite is embedded in Python and a number of other languages. In our case, we use python for the application and an SQLite server as a database data server. A common database tool for SQLite is DB Browser ​