OverviewTeaching: 15 min
Exercises: 15 minQuestions
How do we store large amounts of data?Objectives
Learn to use lists and Numpy arrays, and explain the difference between each.
At the end of the last lesson, we saw noticed that
sys.argv gave us a new datastructure:
A list is a set of objects enclosed by a set of square brackets (
example = [1, 2, 4, 5] example
[1, 2, 4, 5]
Note that a list can hold any type of item, even other lists!
example = [1, True, None, ["word", 123], "test"] example
[1, True, None, ['word', 123], 'test']
We can get different pieces of a list via indexing. We add a set of square brackets after the list in question along with the index of the values we want. Note that in Python, all indices start from 0 - the first element is actually the 0th element (this is different from languages like R or Matlab). The best way to think about array indices is that they are the number of offsets from the first position - the first element does not require an offset to get to.
And a few examples of this in action:
# first element example # second element example # fetch the list inside the list example
1 True ['word', 123]
Note that we can index a range using the colon (
A colon by itself means fetch everything.
[1, True, None, ['word', 123], 'test']
A colon on the right side of an index means everything after the specified index.
[None, ['word', 123], 'test']
A colon on the left side of an index means everyting before, but not including, the index.
And if we use a negative index, it means get elements from the end, going backwards.
# last element example[-1] # everything except the last two elements example[:-2]
'test' [1, True, None]
Note that we can use the index multiple times to retrieve information from nested objects.
If we index out of range, it is an error:
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-12-98429cb6526b> in <module>() ----> 1 example IndexError: list index out of range
We can also add two lists together to create a larger list.
[45, 2] + 
[45, 2, 3]
Like other objects in Python, lists have a unique behavior that can catch a lot of people off guard. What happens when we run the following code?
list1 = [1, 2, 3, 4] list2 = list1 list2 += [5, 6, 7] print('List 2 is: ', list2) print('List 1 is: ', list1)
List 2 is: [1, 2, 3, 4, 5, 6, 7] List 1 is: [1, 2, 3, 4, 5, 6, 7]
list2 actually modified
list1 as well.
In Python, lists are objects.
Objects are not copied when we assign them to a new value (like in R).
This is an important optimization,
as we won’t accidentally fill up all of our computer’s memory by renaming a variable a couple of times.
When we ran
list2 = list1, it just created a new name for
list1 still points at the same underlying object.
We can verify this with the
id() prints an objects unique identifier.
Two objects will not have the same ID unless they are the same object.
In order to create
list2 as a unique copy of
We have to use the
list1 = [1, 2, 3, 4] list2 = list1.copy() list2 += [5, 6, 7] print('List 2 is: ', list2) print('List 1 is: ', list1) id(list2) id(list1)
List 2 is: [1, 2, 3, 4, 5, 6, 7] List 1 is: [1, 2, 3, 4] 140319554648072 140319554461896
.copy() is a method.
Methods are special functions associated with an object and define what it can do.
They always follow the syntax
object.method(arg1, arg2) and have predefined number of arguments mostly with default values. We may also specify a subset of agruments, e.g.
Other frequently used methods of lists include
[1, 2, 3, 4, 77]
# this adds a one-element list list1.append()
[1, 2, 3, 4, 77, ]
.extend() (combines two lists, instead of adding the second list as an element):
list1.extend([99, 88, 101])
[1, 2, 3, 4, 77, , 99, 88, 101]
And of course,
.clear() (both do exactly what you think they should do):
list1.remove() print(list1) list1.clear() print(list1)
[1, 2, 3, 4, 77, 99, 88, 101] 
Dynamic resizing of lists
Python’s lists are an extremely optimized data structure. Unlike R’s vectors, there is no time penalty to continuously adding elements to list. You never need to preallocate a list at a certain size for performance reasons.
We’ll very frequently want to iterate over lists and perform an operation with every element. We do this using a for loop.
A for loop generally looks like the following:
for variable in things_to_iterate_over: do_stuff_with(variable)
An example of an actually functioning for loop is shown below:
for i in range(10): print(i)
0 1 2 3 4 5 6 7 8 9
In this case we are iterating over the values provided by
range() is a special generator function we can use to provide
a sequence of numbers.
We can also iterate over a list, or any collection of elements:
for element in ['a', True, None]: print(type(element))
<class 'str'> <class 'bool'> <class 'NoneType'>
Numpy is a numerical library designed to make working with numbers easier than it would otherwise be.
For example, say we had a list of a thousand numbers. There’s no way to do vector math without iterating through all the elements!
vals = list(range(1000)) new_vals = vals.copy() print(new_vals[:5]) for idx in range(1000): new_vals[idx] += 10 print(new_vals[:5])
[0, 1, 2, 3, 4] [10, 11, 12, 13, 14]
That was a lot of work.
Numpy lets us do vector math like in R, saving us a lot of effort.
The most basic function is
np.array() which creates a numerical
array from a list.
A numpy array is a collection of numbers that can have any number of dimensions.
In this case, there is only one dimension, since we created the array from a list.
import numpy as np new_vals = np.array(vals) new_vals += 10 new_vals[:5]
array([10, 11, 12, 13, 14])
One very nice thing about Numpy is that it’s much more performant than ordinary Python lists.
A nice trick we can use with IPython to measure execution times is the
%timeit magic function.
Anything following the
%timeit gets measured for speed.
%% to the
timeit command instead of
% means that
timeit is run on the entire cell, not just a single line. Note that
%%timeit must be on the first line of an IPython/Jupyter cell for it to work, whereas the
%timeit command can be used anywhere.
Using Python’s lists:
%%timeit for idx in range(1000): vals[idx] + 10
10000 loops, best of 3: 165 µs per loop
%timeit new_vals + 10
The slowest run took 22.13 times longer than the fastest. This could mean that an intermediate result is being cached. 1000000 loops, best of 3: 1.63 µs per loop
Numpy was about 100x faster, though %timeit did mention that Numpy could be cheating a bit. Even in Numpy’s worst case scenario however, it still ran 5x faster than using Python’s basic lists.
Sometimes, you’ll encounter a dataset with multiple dimensions and will need to be able to retrieve elements from it as such.
arr2d = np.arange(0, 40) # sequence of numbers from 0 to 39 arr2d = arr2d.reshape([5, 8]) # reshape so it has 5 rows and 8 columns arr2d
array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39]])
In this case, we must index using multiple indices, separated by a comma.
To grab the first element, we would use
The first index, corresponds to rows, the second corresponds to columns, and the third to the next dimension…
arr2d[0, :] arr2d[:, 0]
array([0, 1, 2, 3, 4, 5, 6, 7]) array([ 0, 8, 16, 24, 32])
Retrieve everything defined in the range of rows 4-5 and columns 1-4.
Lists store a sequence of elements.
Numpy allows vector math in Python.