Breastcancer NKI data

Create views with SQL and with pandas (performance study)

Performance study case

The SQLlite database breastCancerNKI has three tables

erStatus
probes
expressionData

The goal is to compose one dataframe with expression data and ER status for the PTEN gene his probe/gene mapping and sample ER status information is needed for further vizualisation In this study case we create the dataframe using SQL and using pandas. The SQL appears te be around 240 times faster!

Source: https://bioinformatics-core-shared-training.github.io/shiny-bioinformatics/database-back-end

You can create this database yourself by running an R script https://github.com/fenna/BFVM22PROG1/blob/main/scripts/breastCancerNKI.R

In order to create the dataframe we need to load the tables into pandas dataframes. After that we will merge them with pandas. But we also will merge them directly on the SQL server. These two different operations will result in the same but the performance is different.

First connect to the database and query the data into a dataframes.

import sqlite3
import pandas as pd

conn = sqlite3.connect('data/breastCancerNKI.sqlite')
status = pd.read_sql_query('SELECT * FROM erStatus', conn)
probes = pd.read_sql_query('SELECT * FROM probes', conn)
expression = pd.read_sql_query('SELECT * FROM expressionData', conn)
expression.head()

probe

sample

expression

Contig45645_RC

NKI_4

-0.215

Contig44916_RC

NKI_4

-0.207

D25272

NKI_4

-0.158

J00129

NKI_4

-0.819

Contig29982_RC

NKI_4

-0.267

status.head()

sample

NKI_4

NKI_6

NKI_7

NKI_8

NKI_9

probes.head()

probe

gene

Contig45645_RC

GREM2

Contig44916_RC

SUHW2

D25272

None

J00129

FGB

Contig29982_RC

SCARA5

Next we use SQL statements to merge the data. We use select from to retrieve and where to filter the proper gene and join to merge the data. The execution time is 9 milliseconds

Use SQL to create the required dataframe

import time

start_time = int(round(time.time() * 1000))

S = """select E.probe, E.expression, E.sample, P.gene, S.er
    from expressionData E join probes P on E.probe = P.probe 
    join erStatus as S on E.sample = S.sample WHERE P.gene = 'PTEN'"""
view = pd.read_sql_query(S, conn )

end_time = int(round(time.time() * 1000))
print (f"Execution time {end_time - start_time} millis")
view

Execution time 9 millis

probe

expression

sample

gene

AF019083

-0.062

NKI_4

PTEN

AF019083

-0.072

NKI_6

PTEN

AF019083

0.064

NKI_7

PTEN

AF019083

0.132

NKI_8

PTEN

AF019083

0.018

NKI_9

PTEN

...

663

NM_000314

-0.284

NKI_398

PTEN

664

NM_000314

-0.363

NKI_401

PTEN

665

NM_000314

-0.058

NKI_402

PTEN

666

NM_000314

-0.059

NKI_403

PTEN

667

NM_000314

-0.059

NKI_404

PTEN

668 rows × 5 columns

conn.close()

Using pandas to achieve the same result can be done by the usage of the dataframes loaded from the SQL tables. The execution time now is 1941 seconds.

Use pandas to create the required dataframe¶

start_time = int(round(time.time() * 1000))

view = expression.merge(probes[probes.gene=='PTEN'], on='probe').merge(status, on='sample')

end_time = int(round(time.time() * 1000))
print (f"Execution time {end_time - start_time} millis")

view

Execution time 1941 millis

probe

sample

expression

gene

AF019083

NKI_4

-0.062

PTEN

NM_000314

NKI_4

0.005

PTEN

AF019083

NKI_6

-0.072

PTEN

NM_000314

NKI_6

0.023

PTEN

AF019083

NKI_7

0.064

PTEN

...

663

NM_000314

NKI_402

-0.058

PTEN

664

AF019083

NKI_403

0.090

PTEN

665

NM_000314

NKI_403

-0.059

PTEN

666

AF019083

NKI_404

0.097

PTEN

667

NM_000314

NKI_404

-0.059

PTEN

Conclusion: Operations on the SQL server with SQL statement is much more faster!

668 rows × 5 columns

PreviousUSDA food database (JSON)NextPlotting time data with panel dashboard and bokeh

Last updated 2 years ago