Why we love Numpy
Numerical Python
NumPy (http://numpy.org) is a module for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The name is an acronym for "Numeric Python" or "Numerical Python". It is an extension module for Python, mostly written in C. This makes sure that the precompiled mathematical and numerical functions and functionalities of Numpy guarantee great execution speed.
NumPy enriches the programming language Python with powerful data structures, implementing multi-dimensional arrays and matrices. These data structures guarantee efficient calculations with matrices and arrays. The implementation is even aiming at huge matrices and arrays, better known under the heading of "big data". Besides that, the module supplies a large library of high-level mathematical functions to operate on these matrices and arrays.
import numpy as np
Study case the linear regression model error
In the example below the usage of NumPy and the matrix calculations are demonstrated. In the example the cost (the error) of a linear regression model is computed using the equation:
where
In which is the total cost calculated by the current weight values of ; is the hypothesised value, the prediction, and is the actual value. is calculated for each observation and compared to the actual value . By adding up and eventually averaging the difference between these two values (hypothesis - actual) for each data observation, we arrive at the predictive value that the formula has with the current weight values of
To compute this we can use a naive loop or we can use the matrix computation functions included in Numpy, the so-called vectorized implementation. To demonstrate the difference in performance solutions for both methods are provided. We time the execution time
import time
Use NumPy to generate matrix and vector and vector
First a dataset is generated. The dataset contains a number of features (columns in the dataset except for the last one ). The final column contains a class variable. The dataset has a number of observations (the rows ). For this the numpy function np.random.rand(m, n)
is used. Next a vector containing the weights is generated (the vector). The last column, containing the class variable is sliced to a vector and the features columns are put into a matrix . For the computational purpose, a column of 1's is added to the feature matrix
num_features = 150
num_observaties = 50000
data = np.random.rand(num_observaties, num_features) #generate dataset
theta = np.random.rand(1,num_features) #generate vector containing weights
m,n = data.shape
X = data[:, :n-1] #all the columns except the last one contain the features
y = data[:, [n-1]] #last column is the class variable
X = np.c_[np.ones(m), X] #add a first column with ones for the theta0 compution
print("y", y.shape, "vector")
print("X", X.shape, "matrix")
print("𝜃", theta.shape, "vector")
print (f"There are {num_features} features, and {num_observaties} observations")
y (50000, 1) vector
X (50000, 150) matrix
𝜃 (1, 150) vector
There are 150 features, and 50000 observations
Naive loop implementation
The naive loop implementation of calculating the error computes for each row the prediction which is subtracted with the actual value to get the difference between the actual value and the model value. The prediction is calculated using a for loop to compute the weight times the feature value for each feature according the equation The difference between the actual value and the model value is squared and averaged to estimate the average error of the model
#naive implementation
print ("Naive implementation")
start_time = int(round(time.time() * 1000))
J_val1 = 0
theta_nav = theta[0] # get rid of the [[]] -> []
for i in range(m):
xi = X[i]
prediction = 0
for j in range(len(theta_nav)):
prediction += theta_nav[j]*xi[j] # predict value based on weight theta and feature xi
delta = (prediction - y[i]) ** 2 # square difference of hypothesed value and actual value
J_val1 += delta # sum of squares
J_val_nav = J_val1/ (2 * m) # take average of sum of squares
end_time = int(round(time.time() * 1000))
print (f"Error: {J_val_nav}")
print (f"Execution time {end_time - start_time} millis")
Naive implementation
Error: [636.06025802]
Execution time 5116 millis
Vectorized implemention
For the hypothesis we can use a vectorized implementation:
#vectorial implementatie
print ("Vectorial implementation")
start_time = int(round(time.time() * 1000))
h = np.dot(X, theta.T) #matrix calculation features times weights theta resulting in prediction vector
errors = (h - y) ** 2 #vector substraction predictions minus actual values
J_val_vec = np.mean(errors)/2 #vector average
end_time = int(round(time.time() * 1000))
print (f"Error: {J_val_vec}")
print (f"Execution time {end_time - start_time} millis")
Vectorial implementation
Error: 636.060258018252
Execution time 4 millis
Conclusion
With Numpy we can easily generate and manipulate vectors and matrices.
We can transpose vectors and matrices using .T
We can apply vectorized computations using power, division, subtractions, multiplications with
np.dot
and get mean withnp.mean
Vectorized implementation is incredibly faster than an ordinary loop
You should use Numpy arrays, or a library that builds upon Numpy like pandas, for data processing
You are stupid if you use a for loop for dataprocessing
Next
Learn more about Numpy: https://nbviewer.jupyter.org/github/ageron/handson-ml/blob/master/tools_numpy.ipynb
Last updated
Was this helpful?