First steps
What is a database?
A database is an organized collection of data, generally stored and accessed electronically from a computer system. We can perform queries on them and change their content. For example the following are databases:
- Usernames and passwords of web pages
- Products and storage status of a web store
- Banks' information about their customers and their accounts
- Daily weather measurements in different locations
- Flight schedules and booking status for airlines
There are a vast amount of databases, and most people are in contact with multiple databases during an ordinary day.
Challenges for databases
There are several technical difficulties conserning databases, and it is not easy to build a well functioning database. The most common challenges are:
Amount of information
Many databases hold large quantities of information, into which multiple queries and changes are made constantly. How to make the database in such a way, that the information can be accessed efficiently?
Concurrency
Usually a database has multiple users, who can change and get information at the same time. Why do we have to take this into account, when designing our database?
Surprises
The database should stay coherent in surprising situations, as well. For example, what happens if the power goes out when the data is being changed?
Developing databases
The development of databases took flight in 1970's and there were multiple different solutions, but one became the most popular: relational model and the Structural Query Language, SQL. The relational databases have been popular since then, and most of the databases are still based on relational database models.
In this course we will concentrate on relational database from both database users' and the technical point of view. We aim to answer the question, why relational model is such a good way to create a database?
Even though relational databases are still dominant, recently some powerful challengers have risen. One reason has been the need for different kinds of databases, which are more suitable for distributed systems, such as enormous websites.
A term often used, NoSQL, refers to a database based on something else than the relational model and SQL language. Especially document-oriented databases have been popular lately. Even though we concentrate on the relational model, it is good to keep in mind that alternatives exist.
Simple database
Before we start to look into existing database systems, let's try and make a database and its handling by ourselves in some simple way and see, what kinds of problems we run into.
Let's assume we want to create a database for a bank. The purpose of the database is to store the customer information and the transactions on their accounts.
Database structure
We will store the database into a text file, bank.txt
, whose lines reflect the transactions in the database. There are two different transactions: You can create a new account into the bank (the balance) or the balance of an account changes. The content of the file could be something like this:
CREATE ACCOUNTNUMBER: 131778223OWNER: UoleviCREATE ACCOUNTNUMBER: 175299717OWNER: MaijaMAKE TRANSACTIONACCOUNT: 131778223SUM: 500MAKE TRANSACTIONACCOUNT: 131778223SUM: -100MAKE TRANSACTIONACCOUNT: 175299717SUM: 100
In this case we have created accounts in the bank for Maija and Uolevi. After this 500 euros are transferred into Uolevi's account. Then Uolevi transfers 100 euros into Maija's account (so Uolevi's balance is decreased by 100 and Maija's saldo is increased by 100). In the end, Uolevi's balance is 400 and Maija's balance is 100.
The idea is, that with each transaction we add three lines into to the end of the file, which represent the transaction. When we read through the lines of the file, we know all the accounts in the bank and we can calculate the current balance for them all.
I did it myself and saved?
This kind of a database works fine, if we can assume the bank having only a few customers and transaction, and only one customer uses the bank at a time and the computer is always reliable. Unfortunately, this is usually not the case.
Amount of information
What happens, if the bank has a million customers, and the accounts of said customers have on average five transactions per day? This way we have to add 15 million lines to the file per day, or more than five billion lines in a year.
There is a critical design flaw conserning efficiency in our database: when we want to know the balance of a customer, we have to go through all the lines in the file. When the database grows, this becomes extremely slow.
In practice, problems with efficiency are shown to customers as the software using the database working slowly. For example in this case, Maija would probably wais for a minute, before they could see their balance from the bank.
Concurrency
Many systems work fine as long as they have only one user, but concurrent users can cause problems. What happens for example, if two users try to create accounts at the same time?
Let's assume we want to create accounts for Liisa and Aapeli, and write the following lines into the file:
CREATE ACCOUNTNUMBER: 185421761OWNER: LiisaCREATE ACCOUNTNUMBER: 111562714OWNER: Aapeli
Alas, as two users are writing into the database at the same time, the lines could mix up and we could end up with the following file:
CREATE ACCOUNTNUMBER: 185421761CREATE ACCOUNTNUMBER: 111562714OWNER: AapeliOWNER: Liisa
Now Liisa has put in the first two lines, then Aapeli all three of theirs and finally one more line from Liisa. With this it is impossible to know, which account is whose, and the data in our database is corrupted.
Suprises
One more problem is that computers can go off at any time (for example a power outage, someone plugs it off the socket, etc). What happens, if at exactly that point we are making changes to our database?
Let's have the following scenario, where Maija transfers 50 euros to Uolevi:
MAKE TRANSACTIONACCOUNT: 175299717SUM: -50MAKE TRANSACTIONACCOUNT: 131778223SUM: 50
However the computer dies after three lines, where only the following lines are saved:
MAKE TRANSACTIONACCOUNT: 175299717SUM: -50
Now Maija has lost 50 euros and Uolevi did not get anything, or 50 euros just vanished and this cannot even be seen from the file, as the power comes back.
How do we fix this?
Handling a database is a tricky task, and on this course we will not try to achieve everything ourselves, but rely on existing database systems.
Relational models
Relational models replaced their competitors in the 1970's and has ruled the database world ever since. The model is based on two basic concepts: all the data is stored in the database as rows, which can reference to one another, and the database user handles information with SQL language, which hides the internal details of the database from the user.
Database structure
Relational database consists of tables which have fixed columns. The data is stored in the tables as rows, which have certain values in the columns. Each table contains information related to a certain collection.
In the following tables we have an example of a database, which could be used as a part of a web store. In the tables Products
, Customers
and Purchases
we have information about products, customers and their shopping carts.
Productsid name price-- ------ -----1 radish 72 carrot 53 turnip 44 parsley 85 celery 4Customersid name-- ------1 Uolevi2 Maija3 AapeliPurchasescustomer_id product_id----------- ----------1 21 52 12 42 5
In the tables Products
and Customers
each row has a unique id number, with which they can be referenced at. Because of this, the table Purchases
we can represent in number form, which items which customer has chosen. In this example, Uolevi has carrots and celery in their basket, and Maija har redish, turnip and celery in theirs.
SQL Language
The Structured Query Language or SQL is an established way to handle data in relational databases. The language as commands, with which each database user (for example the programmer or software using the database) can add, search, change and remove information.
SQL commands are formed with keywords such as SELECT
and WHERE
, names of tables and columns and other values. For example, the following command
SELECT price FROM Products WHERE name='radish';
Fetches the price for radish from the database described above. The command end with a semicolon ;
and we can use whitespace and linebreaks the way we want. For example, we could write the command above like this as well:
SELECT priceFROM ProductsWHERE name='radish';
As SQL is from the ages of old, it has plenty of resemblance to the programming of that time. For example
- The keywords are complete English words, and the commands resemble English sentences
- The keywords are not casesensitive. For example,
SELECT
,select
andSeLeCT
are all the same. The keywords are usually written with all capital letters. - The equality sign
=
means both equality and setting a value (and==
is not used at all).
SQL language has a standard, which aims to give common base for the langauge. In practice every implementation does something their own way. We try ton concentrate on the properties which are common for all SQL databases.
Internal structure of a database
Between the user and the database there is a relational database management system, whose function is to handle the SQL commands from the user. For example when the user gives a command which retrieves information from the database, the database system has to find a good way to handle the command and deliver the results back to the user as fast as possible.
The nice idea behind SQL is that the user only has to describe what information they want, and the database system does all the work and fetches the information from the database. This is convenient for the user, because they don't have to know anything about the internal functionality but can rely on the database system to function.
Designing a database system is a complex task, and only those really interested are aware of the internal details of database systems. Most of this is outside the scope of this course, but we'll have a glimpse of that in parts 6 and 7.