Numpy
- Published on
- • 37 min read
Introduction
An overview of data processing and the NumPy library.
A. Data Processing
When asked about Google's model for success, Peter Norvig, the director of research at Google, famously stated,
"We don't have better algorithms than anyone else; we just have more data."
Though probably an understatement (given the amount of talent employed at Google), the quote does provide a sense of just how vital data is to having successful outcomes.
People normally discuss the importance of data in the context of machine learning. No matter how sophisticated a machine learning model is, it will not perform well unless it has a reasonable amount of data to train on. On the other hand, given a large and diverse set of training data, a good deep learning model will significantly outperform non-deep learning algorithms. However, data is not just limited to machine learning. Companies use data to identify customer trends, political parties use data to determine which demographics they should target, sports teams use data to analyze players, etc.
Example baseball data used in sabermetrics. The concept was popularized by the 2011 film, Moneyball.
The universal usage of data makes data processing, the act of converting raw data into a meaningful form, an essential skill to have.
B. NumPy
Many scenarios involve mostly numeric datasets. For example, medical data contains many numeric metrics, such as height, weight, and blood pressure. Furthermore, the majority of neural networks use input data that is either numeric or has been converted to a numeric form.
When we deal with numeric data, the best Python library to use is NumPy. The NumPy library allows us to perform many operations on numeric data, and convert the data to more usable forms.
import numpy as np
# import the NumPy library
# Initializing a NumPy array
arr = np.array([-1, 2, 5], dtype=np.float32)
# Print the representation of the array
print(repr(arr))
Numpy Arrays
A. Arrays
NumPy arrays are basically just Python lists with added features.
In fact, you can easily convert a Python list to a Numpy array using the np.array function, which takes in a Python list as its required argument.
The function also has quite a few keyword arguments, but the main one to know is dtype.
The dtype keyword argument takes in a NumPy type and manually casts the array to the specified type.
'''
Array is manually cast to np.float32.
'''
import numpy as np
arr = np.array([[0, 1, 2], [3, 4, 5]], dtype=np.float32)
print(repr(arr)) #=> array([[0., 1., 2.],[3., 4., 5.]], dtype=float32)
When the elements of a NumPy array are mixed types, then the array's type will be upcast to the highest level type. This means that if an array input has mixed int and float elements, all the integers will be cast to their floating-point equivalents. If an array is mixed with int, float, and string elements, everything is cast to strings.
arr = np.array([0, 0.1, 2])
print(repr(arr))
B. Copying
Similar to Python lists, when we make a reference to a NumPy array it doesn't create a different array. Therefore, if we change a value using the reference variable, it changes the original array as well. We get around this by using an array's inherent copy function. The function has no required arguments, and it returns the copied array.
'''
In the code example below, c is a reference to a while d is a copy.
Therefore, changing c leads to the same change in a,
while changing d does not change the value of b.
'''
a = np.array([0, 1])
b = np.array([9, 8])
c = a
print('Array a: {}'.format(repr(a))) #=> Array a:` array([0, 1])
c[0] = 5
print('Array a: {}'.format(repr(a))) #=> Array a: array([5, 1])
d = b.copy()
d[0] = 6
print('Array b: {}'.format(repr(b))) #=> Array b: array([9, 8])
C. Casting
We cast NumPy arrays through their inherent astype function. The function's required argument is the new type for the array. It returns the array cast to the new type.
arr = np.array([0, 1, 2])
print(arr.dtype) #=> int64
arr = arr.astype(np.float32)
print(arr.dtype) #=> float32
D. NaN
When we don't want a NumPy array to contain a value at a particular index, we can use np.nan to act as a placeholder.
A common usage for np.nan is as a filler value for incomplete data.
'''
The code below shows an example usage of np.nan.
Note that np.nan cannot take on an integer type.
'''
arr = np.array([np.nan, 1, 2])
print(repr(arr)) #=> array([nan, 1., 2.])
arr = np.array([np.nan, 'abc'])
print(repr(arr)) #=> array(['nan', 'abc'], dtype='<U32')
# Will result in a ValueError
np.array([np.nan, 1, 2], dtype=np.int32)
E. Infinity
To represent infinity in NumPy, we use the np.inf special value.
We can also represent negative infinity with -np.inf.
print(np.inf > 1000000) #=> True
arr = np.array([np.inf, 5])
print(repr(arr)) #=> array([inf, 5.])
arr = np.array([-np.inf, 1])
print(repr(arr)) #=> array([-inf, 1.])
# Will result in an OverflowError
np.array([np.inf, 3], dtype=np.int32)
Numpy Basics
Perform basic operations to create and modify NumPy arrays.
A. Ranged Data
While np.array can be used to create any array, it is equivalent to hardcoding an array.
This won't work when the array has hundreds of values.
Instead, NumPy provides an option to create ranged data arrays using np.arange.
The function acts very similar to the range function in Python, and will always return a 1-D array.
arr = np.arange(5)
print(repr(arr)) #=> array([0, 1, 2, 3, 4])
arr = np.arange(5.1)
print(repr(arr)) #=> array([0., 1., 2., 3., 4., 5.])
arr = np.arange(-1, 4)
print(repr(arr)) #=> array([-1, 0, 1, 2, 3])
arr = np.arange(-1.5, 4, 2)
print(repr(arr)) #=> array([-1.5, 0.5, 2.5])
To specify the number of elements in the returned array, rather than the step size, we can use the np.linspace function.
This function takes in a required first two arguments, for the start and end of the range, respectively.
The end of the range is inclusive for np.linspace, unless the keyword argument endpoint is set to False.
To specify the number of elements, we set the num keyword argument (its default value is 50).
arr = np.linspace(5, 11, num=4)
print(repr(arr)) #=> array([ 5., 7., 9., 11.])
arr = np.linspace(5, 11, num=4, endpoint=False)
print(repr(arr)) #=> array([5. , 6.5, 8. , 9.5])
arr = np.linspace(5, 11, num=4, dtype=np.int32)
print(repr(arr)) #=> array([ 5, 7, 9, 11], dtype=int32)
B. Reshaping Data
The function we use to reshape data in NumPy is np.reshape. It takes in an array and a new shape as required arguments. The new shape must exactly contain all the elements from the input array. For example, we could reshape an array with 12 elements to (4, 3), but we can't reshape it to (4, 4).
We are allowed to use the special value of -1 in at most one dimension of the new shape. The dimension with -1 will take on the value necessary to allow the new shape to contain all the elements of the array.
arr = np.arange(8)
reshaped_arr = np.reshape(arr, (2, 4))
print(repr(reshaped_arr)) #= > array([[0, 1, 2, 3], [4, 5, 6, 7]])
print('New shape: {}'.format(reshaped_arr.shape)) #=> New shape: (2, 4)
reshaped_arr = np.reshape(arr, (-1, 2, 2))
print(repr(reshaped_arr)) #=> array([[[0, 1], [2, 3]], [[4, 5], [6, 7]]])
print('New shape: {}'.format(reshaped_arr.shape)) #=> New shape: (2, 2, 2)
NumPy provides an inherent function for flattening an array, flatten.
Flattening an array reshapes it into a 1D array. Since we need to flatten data quite often, it is a useful function
arr = np.arange(8)
arr = np.reshape(arr, (2, 4))
print(repr(arr)) #=> array([[0, 1, 2, 3], [4, 5, 6, 7]])
flattened = arr.flatten()
print(repr(flattened)) #=> array([0, 1, 2, 3, 4, 5, 6, 7])
print('flattened shape: {}'.format(flattened.shape)) #=> flattened shape: (8,)
C. Transposing
We can transpose the data, using the np.transpose function, to convert it to the proper format that we require.
arr = np.arange(8)
arr = np.reshape(arr, (4, 2)) #=> makes arr a 4x2 matrux
transposed = np.transpose(arr) #=> makes arr 2x4 matrix
It also has a single keyword argument called axes, which represents the new permutation of the dimensions.
The permutation is a tuple/list of integers, with the same length as the number of dimensions in the array.
It tells us where to switch up the dimensions.
For example, if the permutation had 3 at index 1, it means the old third dimension of the data becomes the new second dimension (since index 1 represents the second dimension).
arr = np.arange(24)
arr = np.reshape(arr, (3, 4, 2))
transposed = np.transpose(arr, axes=(1, 2, 0))
print('arr shape: {}'.format(arr.shape))
print('transposed shape: {}'.format(transposed.shape))
In this example, the old first dimension became the new third dimension, the old second dimension became the new first dimension, and the old third dimension became the new second dimension. The default value for axes is a dimension reversal (e.g. for 3-D data the default axes value is [2, 1, 0]).
D. Zeros and Ones
Sometimes, we need to create arrays filled solely with 0 or 1. NumPy provides the functions np.zeros and np.ones.
They both take in the same arguments, which includes just one required argument, the array shape.
The functions also allow for manual casting using the dtype keyword argument.
arr = np.zeros(4) #=> array([0., 0., 0., 0.])
arr = np.ones((2, 3), dtype=np.int32) #=> array([[1, 1, 1],[1, 1, 1]], dtype=int32)
If we want to create an array of 0's or 1's with the same shape as another array, we can use np.zeros_like and np.ones_like
arr = np.array([[1, 2], [3, 4]])
print(repr(np.ones_like(arr, dtype=np.int32)))
#=> array([[1, 1],[1, 1]], dtype=int32)
Math
Understand how arithmetic and linear algebra work in NumPy.
A. Arithmetic
One of the main purposes of NumPy is to perform multi-dimensional arithmetic. Using NumPy arrays, we can apply arithmetic to each element with a single operation.
arr = np.array([[1, 2], [3, 4]])
# Add 1 to element values
print(repr(arr + 1))
# Subtract element values by 1.2
print(repr(arr - 1.2))
# Double element values
print(repr(arr * 2))
# Halve element values
print(repr(arr / 2))
# Integer division (half)
print(repr(arr // 2))
# Square element values
print(repr(arr**2))
# Square root element values
print(repr(arr**0.5))
Using NumPy arithmetic, we can easily modify large amounts of numeric data with only a few operations. For example, we could convert a dataset of Fahrenheit temperatures to their equivalent Celsius form.
def f2c(temps):
return (5/9)*(temps-32)
fahrenheits = np.array([32, -4, 14, -40])
celsius = f2c(fahrenheits)
print('Celsius: {}'.format(repr(celsius)))
It is important to note that performing arithmetic on NumPy arrays does not change the original array, and instead produces a new array that is the result of the arithmetic operation.
B. Non-Linear Functions
NumPy also allows you to use non-linear functions such as exponentials and logarithms.
The function np.exp performs a base e exponential on an array, while the function np.exp2 performs a base 2 exponential.
Likewise, np.log, np.log2, and np.log10 all perform logarithms on an input array, using base e, base 2, and base 10, respectively.
Note that np.e and np.pi represent the mathematical constants e and π, respectively.
arr = np.array([[1, 2], [3, 4]])
# Raised to power of e
print(repr(np.exp(arr)))
# Raised to power of 2
print(repr(np.exp2(arr)))
arr2 = np.array([[1, 10], [np.e, np.pi]])
# Natural logarithm
print(repr(np.log(arr2)))
# Base 10 logarithm
print(repr(np.log10(arr2)))
To do a regular power operation with any base, we use np.power. The first argument to the function is the base, while the second is the power. If the base or power is an array rather than a single number, the operation is applied to every element in the array.
arr = np.array([[1, 2], [3, 4]])
# Raise 3 to power of each number in arr
print(repr(np.power(3, arr)))
arr2 = np.array([[10.2, 4], [3, 5]])
# Raise arr2 to power of each number in arr
print(repr(np.power(arr2, arr)))
In addition to exponentials and logarithms, NumPy has various other mathematical functions, which are listed here.
C. Matrix Multiplication
Since NumPy arrays are basically vectors and matrices, it makes sense that there are functions for dot products and matrix multiplication.
Specifically, the main function to use is np.matmul, which takes two vector/matrix arrays as input and produces a dot product or matrix multiplication.
Note that the dimensions of the two input matrices must be valid for a matrix multiplication.
Specifically, the second dimension of the first matrix must equal the first dimension of the second matrix, otherwise np.matmul will result in a ValueError.
arr1 = np.array([1, 2, 3])
arr2 = np.array([-3, 0, 10])
print(np.matmul(arr1, arr2))
Random
Generate numbers and arrays from different random distributions.
A. Random Integers
Similar to the Python random module, NumPy has its own submodule for pseudo-random number generation called np.random.
It provides all the necessary randomized operations and extends it to multi-dimensional arrays.
To generate pseudo-random integers, we use the np.random.randint function.
print(np.random.randint(5))
random_arr = np.random.randint(-3, high=14,size=(2, 2))
print(repr(random_arr)) #=> array([[-1, 4],[ 0, 6]])
The np.random.randint function takes in a single required argument, which actually depends on the high keyword argument.
If high=None (which is the default value), then the required argument represents the upper (exclusive) end of the range, with the lower end being 0.
Specifically, if the required argument is n, then the random integer is chosen uniformly from the range [0, n).
If high is not None, then the required argument will represent the lower (inclusive) end of the range, while high represents the upper (exclusive) end.
The size keyword argument specifies the size of the output array, where each integer in the array is randomly drawn from the specified range. As a default, np.random.randint returns a single integer.
B. Utility Functions
Some fundamental utility functions from the np.random module are np.random.seed and np.random.shuffle. We use the np.random.seed function to set the random seed, which allows us to control the outputs of the pseudo-random functions.
The function takes in a single integer as an argument, representing the random seed.
The code below uses np.random.seed with the same random seed. Note how the outputs of the random functions in each subsequent run are identical when we set the same random seed.
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100, size=(2, 2))
print(repr(random_arr))
# New seed
np.random.seed(2)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100, size=(2, 2))
print(repr(random_arr))
# Original seed
np.random.seed(1)
print(np.random.randint(10))
random_arr = np.random.randint(3, high=100, size=(2, 2))
print(repr(random_arr))
The np.random.shuffle function allows us to randomly shuffle an array.
Note that the shuffling happens in place (i.e. no return value), and shuffling multi-dimensional arrays only shuffles the first dimension.
The code below shows example usages of np.random.shuffle. Note that only the rows of matrix are shuffled (i.e. shuffling along first dimension only).
vec = np.array([1, 2, 3, 4, 5])
np.random.shuffle(vec)
print(repr(vec))
np.random.shuffle(vec)
print(repr(vec))
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
np.random.shuffle(matrix)
print(repr(matrix))
C. Distributions
Using np.random we can also draw samples from probability distributions.
For example, we can use np.random.uniform to draw pseudo-random real numbers from a uniform distribution.
print(np.random.uniform())
print(np.random.uniform(low=-1.5, high=2.2))
print(repr(np.random.uniform(size=3)))
print(repr(np.random.uniform(low=-3.4, high=5.9, size=(2, 2))))
The function np.random.uniform actually has no required arguments. The keyword arguments, low and high, represent the inclusive lower end and exclusive upper end from which to draw random samples. Since they have default values of 0.0 and 1.0, respectively, the default outputs of np.random.uniform come from the range [0.0, 1.0).
The size keyword argument is the same as the one for np.random.randint, i.e. it represents the output size of the array.
Another popular distribution we can sample from is the normal (Gaussian) distribution. The function we use is np.random.normal.
print(np.random.normal())
print(np.random.normal(loc=1.5, scale=3.5))
print(repr(np.random.normal(loc=-2.4, scale=4.0, size=(2, 2))))
Like np.random.uniform, np.random.normal has no required arguments.
The loc and scale keyword arguments represent the mean and standard deviation, respectively, of the normal distribution we sample from.
NumPy provides quite a few more built-in distributions, which are listed here.
D. Custom Sampling
While NumPy provides built-in distributions to sample from, we can also sample from a custom distribution with the np.random.choice function.
colors = ['red', 'blue', 'green']
print(np.random.choice(colors))
print(repr(np.random.choice(colors, size=2)))
print(repr(np.random.choice(colors, size=(2, 2), p=[0.8, 0.19, 0.01])))
The required argument for np.random.choice is the custom distribution we sample from. The p keyword argument denotes the probabilities given to each element in the input distribution.
Note that the list of probabilities for p must sum to 1.
In the example, we set p such that 'red' has a probability of 0.8 of being chosen, 'blue' has a probability of 0.19, and 'green' has a probability of 0.01. When p is not set, the probabilities are equal for each element in the distribution (and sum to 1).
Indexing
Index into NumPy arrays to extract data and array slices.
A. Array accessing
Accessing NumPy arrays is identical to accessing Python lists. For multi-dimensional arrays, it is equivalent to accessing Python lists of lists.
arr = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[6, 3], [0, 2]])
print(repr(arr2[0]))
B. Slicing
NumPy arrays also support slicing. Similar to Python, we use the colon operator (i.e. arr[:]) for slicing. We can also use negative indexing to slice in the backwards direction.
arr = np.array([1, 2, 3, 4, 5])
print(repr(arr[:])) #=> array([1, 2, 3, 4, 5])
print(repr(arr[::-1])) #=> array([5, 4, 3, 2, 1])
print(repr(arr[2:4])) #=> array([3, 4])
The code below shows example slices of a 2-D NumPy array.
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(repr(arr[:]))
print(repr(arr[1:]))
print(repr(arr[:, -1]))
print(repr(arr[:, 1:]))
print(repr(arr[0:1, 1:])) #=> array([[2, 3]])
print(repr(arr[0, 1:]))
C. Argmin and Argmax
In addition to accessing and slicing arrays, it is useful to figure out the actual indexes of the minimum and maximum elements. To do this, we use the np.argmin and np.argmax functions.
The code below shows example usages of np.argmin and np.argmax. Note that the index of element -6 is index 5 in the flattened version of arr.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[-3, 9, 1]])
print(np.argmin(arr[0])) #=> 2
print(np.argmax(arr[2])) #=> 1
print(np.argmin(arr)) #=> 5
The np.argmin and np.argmax functions take the same arguments.
The required argument is the input array and the axis keyword argument specifies which dimension to apply the operation on.
The code below shows how the axis keyword argument is used for these functions.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[-3, 9, 1]])
print(repr(np.argmin(arr, axis=0)))
print(repr(np.argmin(arr, axis=1)))
print(repr(np.argmax(arr, axis=-1)))
In our example, using axis=0 meant the function found the index of the minimum row element for each column. When we used axis=1, the function found the index of the minimum column element for each row.
Setting axis to -1 just means we apply the function across the last dimension. In this case, axis=-1 is equivalent to axis=1.
Filtering
A. Filtering data
Sometimes we have data that contains values we don't want to use.
For example, when tracking the best hitters in baseball, we may want to only use the batting average data above 300.
In this case, we should filter the overall data for only the values that we want.
The key to filtering data is through basic relation operations, e.g. ==, >, etc.
In NumPy, we can apply basic relation operations element-wise on arrays.
The code below shows relation operations on NumPy arrays. The ~ operation represents a boolean negation, i.e. it flips each truth value in the array.
arr = np.array([[0, 2, 3],
[1, 3, -6],
[-3, -2, 1]])
print(repr(arr == 3))
print(repr(arr > 0))
print(repr(arr != 1))
# Negated from the previous step
print(repr(~(arr != 1)))
Something to note is that np.nan can't be used with any relation operation.
Instead, we use np.isnan to filter for the location of np.nan.
The code below uses np.isnan to determine which locations of the array contain np.nan values.
arr = np.array([[0, 2, np.nan],
[1, np.nan, -6],
[np.nan, -2, 1]])
print(repr(np.isnan(arr)))
Each boolean array in our examples represents the location of elements we want to filter for. The way we perform the filtering itself is through the np.where function.
B. Filtering in NumPy
The np.where function takes in a required first argument, which is a boolean array where True represents the locations of the elements we want to filter for.
When the function is applied with only the first argument, it returns a tuple of 1-D arrays.
The tuple will have size equal to the number of dimensions in the data, and each array represents the True indices for the corresponding dimension.
Note that the arrays in the tuple will all have the same length, equal to the number of True elements in the input argument. The code below shows how to use np.where with a single argument.
print(repr(np.where([True, False, True]))
arr = np.array([0, 3, 5, 3, 1])
print(repr(np.where(arr == 3)))
arr = np.array([[0, 2, 3],
[1, 0, 0],
[-3, 0, 0]])
x_ind, y_ind = np.where(arr != 0)
print(repr(x_ind)) # x indices of non-zero elements
print(repr(y_ind)) # y indices of non-zero elements
print(repr(arr[x_ind, y_ind]))
The interesting thing about np.where is that it must be applied with exactly 1 or 3 arguments. When we use 3 arguments, the first argument is still the boolean array.
However, the next two arguments represent the True replacement values and the False replacement values, respectively.
The output of the function now becomes an array with the same shape as the first argument.
The code below shows how to use np.where with 3 arguments.
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
negatives = np.array([[-2, -5], [-1, -8]])
print(repr(np.where(np_filter, positives, negatives)))
np_filter = positives > 2
print(repr(np.where(np_filter, positives, negatives)))
np_filter = negatives > 0
print(repr(np.where(np_filter, positives, negatives)))
Note that our second and third arguments necessarily had the same shape as the first argument.
However, if we wanted to use a constant replacement value, e.g. -1, we could incorporate broadcasting.
Rather than using an entire array of the same value, we can just use the value itself as an argument.
The code below showcases broadcasting with np.where.
np_filter = np.array([[True, False], [False, True]])
positives = np.array([[1, 2], [3, 4]])
print(repr(np.where(np_filter, positives, -1)))
C. Axis-wise filtering
If we wanted to filter based on rows or columns of data, we could use the np.any and np.all functions. Both functions take in the same arguments, and return a single boolean or a boolean array. The required argument for both functions is a boolean array.
The code below shows usage of np.any and np.all with a single argument.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[3, 9, 1]])
print(repr(arr > 0))
print(np.any(arr > 0))
print(np.all(arr > 0))
The np.any function is equivalent to performing a logical OR (||), while the np.all function is equivalent to a logical AND (&&) on the first argument.
np.any returns true if even one of the elements in the array meets the condition and np.all returns true only if all the elements meet the condition. When only a single argument is passed in, the function is applied across the entire input array, so the returned value is a single boolean.
However, if we use a multi-dimensional input and specify the axis keyword argument, the returned value will be an array. The axis argument has the same meaning as it did for np.argmin and np.argmax from the previous chapter. Using axis=0 means the function finds the index of the minimum row element for each column. When we used axis=1, the function finds the index of the minimum column element for each row. Setting axis to -1 just means we apply the function across the last dimension.
The code below shows examples of using np.any and np.all with the axis keyword argument.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[3, 9, 1]])
print(repr(arr > 0))
print(repr(np.any(arr > 0, axis=0)))
print(repr(np.any(arr > 0, axis=1)))
print(repr(np.all(arr > 0, axis=1)))
We can use np.any and np.all in tandem with np.where to filter for entire rows or columns of data.
In the code example below, we use np.any to obtain a boolean array representing the rows that have at least one positive number.
We then use the boolean array as the input to np.where, which gives us the actual indices of the rows with at least one positive number.
arr = np.array([[-2, -1, -3],
[4, 5, -6],
[3, 9, 1]])
has_positive = np.any(arr > 0, axis=1)
print(has_positive)
print(repr(arr[np.where(has_positive)]))
Statistics
Learn how to apply statistical metrics to NumPy data.
A. Analysis
It is often useful to analyze data for its main characteristics and interesting trends. For example, we can obtain minimum and maximum values of a NumPy array using its inherent min and max functions. This gives us an initial sense of the data's range, and can alert us to extreme outliers in the data.
The code below shows example usages of the min and max functions.
arr = np.array([[0, 72, 3],
[1, 3, -60],
[-3, -2, 4]])
print(arr.min())
print(arr.max())
print(repr(arr.min(axis=0)))
print(repr(arr.max(axis=-1)))
B. Statistical metrics
NumPy also provides basic statistical functions such as np.mean, np.var, and np.median, to calculate the mean, variance, and median of the data, respectively.
The code below shows how to obtain basic statistics with NumPy. Note that np.median applied without axis takes the median of the flattened array.
arr = np.array([[0, 72, 3],
[1, 3, -60],
[-3, -2, 4]])
print(np.mean(arr))
print(np.var(arr))
print(np.median(arr))
print(repr(np.median(arr, axis=-1)))
Each of these functions takes in the data array as a required argument and axis as a keyword argument.
For a more comprehensive list of statistical functions (e.g. calculating percentiles, creating histograms, etc.), check out the NumPy statistics page.
Aggregation
Use aggregation techniques to combine NumPy data and arrays.
A. Summation
In the chapter on Math, we calculated the sum of individual values between multiple arrays.
To sum the values within a single array, we use the np.sum function.
The function takes in a NumPy array as its required argument, and uses the axis keyword argument in the same way as described in previous chapters. If the axis keyword argument is not specified, np.sum returns the overall sum of the array.
The code below shows how to use np.sum.
arr = np.array([[0, 72, 3],
[1, 3, -60],
[-3, -2, 4]])
print(np.sum(arr))
print(repr(np.sum(arr, axis=0)))
print(repr(np.sum(arr, axis=1)))
In addition to regular sums, NumPy can perform cumulative sums using np.cumsum. Like np.sum, np.cumsum also takes in a NumPy array as a required argument and uses the axis argument. If the axis keyword argument is not specified, np.cumsum will return the cumulative sums for the flattened array. The code below shows how to use np.cumsum. For a 2-D NumPy array, setting axis=0 returns an array with cumulative sums across each column, while axis=1 returns the array with cumulative sums across each row. Not setting axis returns a cumulative sum across all the values of the flattened array.
arr = np.array([[0, 72, 3],
[1, 3, -60],
[-3, -2, 4]])
print(repr(np.cumsum(arr)))
print(repr(np.cumsum(arr, axis=0)))
print(repr(np.cumsum(arr, axis=1)))
B. Concatenation
An important part of aggregation is combining multiple datasets. In NumPy, this equates to combining multiple arrays into one. The function we use to do this is np.concatenate.
Like the summation functions, np.concatenate uses the axis keyword argument.
However, the default value for axis is 0 (i.e. dimension 0). Furthermore, the required argument for np.concatenate is a list of arrays, which the function combines into a single array.
The code below shows how to use np.concatenate, which aggregates arrays by joining them along a specific dimension. For 2-D arrays, not setting the axis argument (defaults to axis=0) concatenates the arrays vertically. When we set axis=1, the arrays are concatenated horizontally.
arr1 = np.array([[0, 72, 3],
[1, 3, -60],
[-3, -2, 4]])
arr2 = np.array([[-15, 6, 1],
[8, 9, -4],
[5, -21, 18]])
print(repr(np.concatenate([arr1, arr2])))
print(repr(np.concatenate([arr1, arr2], axis=1)))
print(repr(np.concatenate([arr2, arr1], axis=1)))
Saving Data
Learn how to save and load NumPy data.
A. Saving
After performing data manipulation with NumPy, it's a good idea to save the data in a file for future use. To do this, we use the np.save function.
The first argument for the function is the name/path of the file we want to save our data to. The file name/path should have a ".npy" extension.
If it does not, then np.save will append the ".npy" extension to it. The second argument for np.save is the NumPy data we want to save.
The function has no return value. Also, the format of the ".npy" files when viewed with a text editor is largely gibberish when viewed with a text editor. If np.save is called with the name of a file that already exists, it will overwrite the previous file.
The code below shows examples of saving NumPy data.
arr = np.array([1, 2, 3])
# Saves to 'arr.npy'
np.save('arr.npy', arr)
# Also saves to 'arr.npy'
np.save('arr', arr)
B. Loading
After saving our data, we can load it again using np.load.
The function's required argument is the file name/path that contains the saved data.
It returns the NumPy data exactly as it was saved. Note that np.load will not append the ".npy" extension to the file name/path if it is not there.
The code below shows how to use np.load to load NumPy data.
arr = np.array([1, 2, 3])
np.save('arr.npy', arr)
load_arr = np.load('arr.npy')
print(repr(load_arr))
# Will result in FileNotFoundError
load_arr = np.load('arr')