Why we love Numpy
Numerical Python
Last updated
Numerical Python
Last updated
NumPy () is a module for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The name is an acronym for "Numeric Python" or "Numerical Python". It is an extension module for Python, mostly written in C. This makes sure that the precompiled mathematical and numerical functions and functionalities of Numpy guarantee great execution speed.
NumPy enriches the programming language Python with powerful data structures, implementing multi-dimensional arrays and matrices. These data structures guarantee efficient calculations with matrices and arrays. The implementation is even aiming at huge matrices and arrays, better known under the heading of "big data". Besides that, the module supplies a large library of high-level mathematical functions to operate on these matrices and arrays.
In the example below the usage of NumPy and the matrix calculations are demonstrated. In the example the cost (the error) of a linear regression model is computed using the equation:
where
In which is the total cost calculated by the current weight values of ; is the hypothesised value, the prediction, and is the actual value. is calculated for each observation and compared to the actual value . By adding up and eventually averaging the difference between these two values (hypothesis - actual) for each data observation, we arrive at the predictive value that the formula has with the current weight values of
To compute this we can use a naive loop or we can use the matrix computation functions included in Numpy, the so-called vectorized implementation. To demonstrate the difference in performance solutions for both methods are provided. We time the execution time
First a dataset is generated. The dataset contains a number of features (columns in the dataset except for the last one ). The final column contains a class variable. The dataset has a number of observations (the rows ). For this the numpy function np.random.rand(m, n)
is used. Next a vector containing the weights is generated (the vector). The last column, containing the class variable is sliced to a vector and the features columns are put into a matrix . For the computational purpose, a column of 1's is added to the feature matrix
The naive loop implementation of calculating the error computes for each row the prediction which is subtracted with the actual value to get the difference between the actual value and the model value. The prediction is calculated using a for loop to compute the weight times the feature value for each feature according the equation The difference between the actual value and the model value is squared and averaged to estimate the average error of the model
For the hypothesis we can use a vectorized implementation:
With Numpy we can easily generate and manipulate vectors and matrices.
We can transpose vectors and matrices using .T
We can apply vectorized computations using power, division, subtractions, multiplications with np.dot
and get mean with np.mean
Vectorized implementation is incredibly faster than an ordinary loop
You should use Numpy arrays, or a library that builds upon Numpy like pandas, for data processing
You are stupid if you use a for loop for dataprocessing
Learn more about Numpy: