Breastcancer NKI data
Create views with SQL and with pandas (performance study)
Last updated
Was this helpful?
Create views with SQL and with pandas (performance study)
Last updated
Was this helpful?
The SQLlite database breastCancerNKI has three tables
erStatus
probes
expressionData
The goal is to compose one dataframe with expression data and ER status for the PTEN gene his probe/gene mapping and sample ER status information is needed for further vizualisation In this study case we create the dataframe using SQL and using pandas. The SQL appears te be around 240 times faster!
Source:
In order to create the dataframe we need to load the tables into pandas dataframes. After that we will merge them with pandas. But we also will merge them directly on the SQL server. These two different operations will result in the same but the performance is different.
First connect to the database and query the data into a dataframes.
probe
sample
expression
0
Contig45645_RC
NKI_4
-0.215
1
Contig44916_RC
NKI_4
-0.207
2
D25272
NKI_4
-0.158
3
J00129
NKI_4
-0.819
4
Contig29982_RC
NKI_4
-0.267
sample
er
0
NKI_4
1
1
NKI_6
1
2
NKI_7
0
3
NKI_8
0
4
NKI_9
1
probe
gene
0
Contig45645_RC
GREM2
1
Contig44916_RC
SUHW2
2
D25272
None
3
J00129
FGB
4
Contig29982_RC
SCARA5
Next we use SQL statements to merge the data. We use select from
to retrieve and where
to filter the proper gene and join
to merge the data. The execution time is 9 milliseconds
probe
expression
sample
gene
er
0
AF019083
-0.062
NKI_4
PTEN
1
1
AF019083
-0.072
NKI_6
PTEN
1
2
AF019083
0.064
NKI_7
PTEN
0
3
AF019083
0.132
NKI_8
PTEN
0
4
AF019083
0.018
NKI_9
PTEN
1
...
...
...
...
...
...
663
NM_000314
-0.284
NKI_398
PTEN
0
664
NM_000314
-0.363
NKI_401
PTEN
1
665
NM_000314
-0.058
NKI_402
PTEN
0
666
NM_000314
-0.059
NKI_403
PTEN
1
667
NM_000314
-0.059
NKI_404
PTEN
1
668 rows × 5 columns
Using pandas to achieve the same result can be done by the usage of the dataframes loaded from the SQL tables. The execution time now is 1941 seconds.
probe
sample
expression
gene
er
0
AF019083
NKI_4
-0.062
PTEN
1
1
NM_000314
NKI_4
0.005
PTEN
1
2
AF019083
NKI_6
-0.072
PTEN
1
3
NM_000314
NKI_6
0.023
PTEN
1
4
AF019083
NKI_7
0.064
PTEN
0
...
...
...
...
...
...
663
NM_000314
NKI_402
-0.058
PTEN
0
664
AF019083
NKI_403
0.090
PTEN
1
665
NM_000314
NKI_403
-0.059
PTEN
1
666
AF019083
NKI_404
0.097
PTEN
1
667
NM_000314
NKI_404
-0.059
PTEN
1
Conclusion: Operations on the SQL server with SQL statement is much more faster!
668 rows × 5 columns