Breastcancer NKI data

Create views with SQL and with pandas (performance study)

Performance study case

The SQLlite database breastCancerNKI has three tables

  • erStatus

  • probes

  • expressionData

The goal is to compose one dataframe with expression data and ER status for the PTEN gene his probe/gene mapping and sample ER status information is needed for further vizualisation In this study case we create the dataframe using SQL and using pandas. The SQL appears te be around 240 times faster!

Source: https://bioinformatics-core-shared-training.github.io/shiny-bioinformatics/database-back-end

You can create this database yourself by running an R script https://github.com/fenna/BFVM22PROG1/blob/main/scripts/breastCancerNKI.R

In order to create the dataframe we need to load the tables into pandas dataframes. After that we will merge them with pandas. But we also will merge them directly on the SQL server. These two different operations will result in the same but the performance is different.

First connect to the database and query the data into a dataframes.

import sqlite3
import pandas as pd

conn = sqlite3.connect('data/breastCancerNKI.sqlite')
status = pd.read_sql_query('SELECT * FROM erStatus', conn)
probes = pd.read_sql_query('SELECT * FROM probes', conn)
expression = pd.read_sql_query('SELECT * FROM expressionData', conn)
expression.head()

probe

sample

expression

0

Contig45645_RC

NKI_4

-0.215

1

Contig44916_RC

NKI_4

-0.207

2

D25272

NKI_4

-0.158

3

J00129

NKI_4

-0.819

4

Contig29982_RC

NKI_4

-0.267

status.head()

sample

er

0

NKI_4

1

1

NKI_6

1

2

NKI_7

0

3

NKI_8

0

4

NKI_9

1

probes.head()

probe

gene

0

Contig45645_RC

GREM2

1

Contig44916_RC

SUHW2

2

D25272

None

3

J00129

FGB

4

Contig29982_RC

SCARA5

Next we use SQL statements to merge the data. We use select from to retrieve and where to filter the proper gene and join to merge the data. The execution time is 9 milliseconds

Use SQL to create the required dataframe

import time
start_time = int(round(time.time() * 1000))

S = """select E.probe, E.expression, E.sample, P.gene, S.er
    from expressionData E join probes P on E.probe = P.probe 
    join erStatus as S on E.sample = S.sample WHERE P.gene = 'PTEN'"""
view = pd.read_sql_query(S, conn )

end_time = int(round(time.time() * 1000))
print (f"Execution time {end_time - start_time} millis")
view
Execution time 9 millis

probe

expression

sample

gene

er

0

AF019083

-0.062

NKI_4

PTEN

1

1

AF019083

-0.072

NKI_6

PTEN

1

2

AF019083

0.064

NKI_7

PTEN

0

3

AF019083

0.132

NKI_8

PTEN

0

4

AF019083

0.018

NKI_9

PTEN

1

...

...

...

...

...

...

663

NM_000314

-0.284

NKI_398

PTEN

0

664

NM_000314

-0.363

NKI_401

PTEN

1

665

NM_000314

-0.058

NKI_402

PTEN

0

666

NM_000314

-0.059

NKI_403

PTEN

1

667

NM_000314

-0.059

NKI_404

PTEN

1

668 rows × 5 columns

conn.close()

Using pandas to achieve the same result can be done by the usage of the dataframes loaded from the SQL tables. The execution time now is 1941 seconds.

Use pandas to create the required dataframe¶

start_time = int(round(time.time() * 1000))

view = expression.merge(probes[probes.gene=='PTEN'], on='probe').merge(status, on='sample')

end_time = int(round(time.time() * 1000))
print (f"Execution time {end_time - start_time} millis")

view
Execution time 1941 millis

probe

sample

expression

gene

er

0

AF019083

NKI_4

-0.062

PTEN

1

1

NM_000314

NKI_4

0.005

PTEN

1

2

AF019083

NKI_6

-0.072

PTEN

1

3

NM_000314

NKI_6

0.023

PTEN

1

4

AF019083

NKI_7

0.064

PTEN

0

...

...

...

...

...

...

663

NM_000314

NKI_402

-0.058

PTEN

0

664

AF019083

NKI_403

0.090

PTEN

1

665

NM_000314

NKI_403

-0.059

PTEN

1

666

AF019083

NKI_404

0.097

PTEN

1

667

NM_000314

NKI_404

-0.059

PTEN

1

Conclusion: Operations on the SQL server with SQL statement is much more faster!

668 rows × 5 columns

Last updated