Why we love Numpy

Numerical Python

NumPy (http://numpy.org) is a module for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The name is an acronym for "Numeric Python" or "Numerical Python". It is an extension module for Python, mostly written in C. This makes sure that the precompiled mathematical and numerical functions and functionalities of Numpy guarantee great execution speed.

NumPy enriches the programming language Python with powerful data structures, implementing multi-dimensional arrays and matrices. These data structures guarantee efficient calculations with matrices and arrays. The implementation is even aiming at huge matrices and arrays, better known under the heading of "big data". Besides that, the module supplies a large library of high-level mathematical functions to operate on these matrices and arrays.

import numpy as np

Study case the linear regression model error

In the example below the usage of NumPy and the matrix calculations are demonstrated. In the example the cost $J(\theta)$ (the error) of a linear regression model is computed using the equation:

$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} ( h_\theta(x^{(i)}) - y^{(i)} ) ^2$

where $h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \theta_3x_3 + ..\theta_nx_n$

In which $J(\theta)$ is the total cost calculated by the current weight values of $\theta$ ; $h_\theta(x)$ is the hypothesised value, the prediction, and $y$ is the actual value. $h_\theta(x)$ is calculated for each observation $h_\theta(x^{(i)})$ and compared to the actual value $y^{(i)}$ . By adding up and eventually averaging the difference between these two values (hypothesis - actual) for each data observation, we arrive at the predictive value that the formula has with the current weight values of $\theta$

To compute this we can use a naive loop or we can use the matrix computation functions included in Numpy, the so-called vectorized implementation. To demonstrate the difference in performance solutions for both methods are provided. We time the execution time

import time

Use NumPy to generate $m \times n$ matrix and $m \times 1$ vector and $1 \times n$ vector

First a dataset is generated. The dataset contains a number of features (columns in the dataset except for the last one $n-1$ ). The final column contains a class variable. The dataset has a number of observations (the rows $m$ ). For this the numpy function np.random.rand(m, n) is used. Next a vector containing the weights is generated (the $\theta$ vector). The last column, containing the class variable is sliced to a vector $y$ and the features columns are put into a matrix $X$ . For the computational purpose, a column of 1's is added to the feature matrix $X$

$X = \begin{bmatrix} 1 & x_1^{(1)} & x_2^{(1)} & .. & x_n^{(1)}\\ 1 & x_1^{(2)} & x_2^{(2)} & .. & x_n^{(2)}\\ 1 & x_1^{(3)} & x_2^{(3)} & .. & x_n^{(3)} \\ 1 & .. & .. & .. & .. \\ 1 & x_1^{(m)} & x_2^{(m)} & .. & x_n^{(m)} \ \end{bmatrix}$ $y = \begin{bmatrix} \ y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ .. \\ y^{(m)} \\ \end{bmatrix}$ $\theta = \begin{bmatrix} \theta_0 & \theta_1 & .. & \theta_n \end{bmatrix}$

num_features = 150
num_observaties = 50000
data = np.random.rand(num_observaties, num_features) #generate dataset
theta = np.random.rand(1,num_features) #generate vector containing weights

m,n = data.shape
X = data[:, :n-1]    #all the columns except the last one contain the features
y = data[:, [n-1]]   #last column is the class variable
X = np.c_[np.ones(m), X] #add a first column with ones for the theta0 compution

print("y", y.shape, "vector")
print("X", X.shape, "matrix")
print("𝜃", theta.shape, "vector")
print (f"There are {num_features} features,  and {num_observaties} observations")

y (50000, 1) vector
X (50000, 150) matrix
𝜃 (1, 150) vector
There are 150 features,  and 50000 observations

Naive loop implementation

The naive loop implementation of calculating the error $J$ computes for each row $i$ the prediction $h$ which is subtracted with the actual value $h - y^{(i)}$ to get the difference between the actual value and the model value. The prediction is calculated using a for loop to compute the weight times the feature value for each feature according the equation $h = \theta_0 \times 1 + \theta_1 \times x_1 + \theta_2 \times x_2 + ...\theta_n \times x_n$ The difference between the actual value and the model value is squared and averaged to estimate the average error of the model

#naive implementation
print ("Naive implementation")
start_time = int(round(time.time() * 1000))
J_val1 = 0
theta_nav = theta[0] # get rid of the [[]] -> []
for i in range(m):
    xi = X[i]
    prediction = 0
    for j in range(len(theta_nav)):
        prediction += theta_nav[j]*xi[j] # predict value based on weight theta and feature xi
    delta = (prediction - y[i]) ** 2     # square difference of hypothesed value and actual value
    J_val1 += delta                      # sum of squares

J_val_nav = J_val1/ (2 * m)              # take average of sum of squares
end_time = int(round(time.time() * 1000))
print (f"Error: {J_val_nav}")
print (f"Execution time {end_time - start_time} millis")

Naive implementation
Error: [636.06025802]
Execution time 5116 millis

Vectorized implemention

For the hypothesis we can use a vectorized implementation: $h_\theta(x) = \theta^T.X$

#vectorial implementatie
print ("Vectorial implementation")

start_time = int(round(time.time() * 1000))
h = np.dot(X, theta.T)            #matrix calculation features times weights theta resulting in prediction vector
errors = (h - y) ** 2             #vector substraction predictions minus actual values
J_val_vec = np.mean(errors)/2     #vector average

end_time = int(round(time.time() * 1000))
print (f"Error: {J_val_vec}")
print (f"Execution time {end_time - start_time} millis")

Vectorial implementation
Error: 636.060258018252
Execution time 4 millis

Conclusion

With Numpy we can easily generate and manipulate vectors and matrices.
We can transpose vectors and matrices using .T
We can apply vectorized computations using power, division, subtractions, multiplications with np.dot and get mean with np.mean
Vectorized implementation is incredibly faster than an ordinary loop
You should use Numpy arrays, or a library that builds upon Numpy like pandas, for data processing
You are stupid if you use a for loop for dataprocessing

Learn more about Numpy: https://nbviewer.jupyter.org/github/ageron/handson-ml/blob/master/tools_numpy.ipynb

PreviousStatistics in Python NextHeart failure Case study

Last updated 2 years ago

Study case the linear regression model error

Use NumPy to generate m×nm \times nm×n matrix and m×1m \times 1m×1 vector and 1×n1 \times n1×n vector

Naive loop implementation

Vectorized implemention

Conclusion

Next

Use NumPy to generate $m \times n$ matrix and $m \times 1$ vector and $1 \times n$ vector