-
Notifications
You must be signed in to change notification settings - Fork 182
/
Copy pathlecture_07_data_repr_and_numpy.tex
1070 lines (896 loc) · 41.3 KB
/
lecture_07_data_repr_and_numpy.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[12pt,letterpaper,twoside]{article}
\usepackage{../cme211}
\usepackage{atbegshi}% http://ctan.org/pkg/atbegshi
\begin{document}
\title{Lecture 7: Data Representation, Numerical Python\vspace{-5ex}}
\date{Fall 2020}
\maketitle
{\footnotesize
\paragraph{Topics Introduced:} Binary vs. Decimal systems, simplified
concepts of computer architecture including hierarchical memory and
arithmetic logic units, representation of integers, strings, and
floating point types, as well as \texttt{numpy}.
}
\vspace{-3ex}
\section{Representation of Data}
\subsection{Computer Representation of Integers}
Computers represent and store everything in
\href{https://en.wikipedia.org/wiki/Binary_number#Binary_counting}{\emph{binary}},
which is a base 2 number system. How is this different from what we
are familiar with in our base-10 (i.e. decimal) number system?
\begin{itemize}
\item Decimal counting uses the symbols 0-9.
\item Counting starts by incrementing the least-significant
(rightmost) digit.
\begin{verbatim}
000, 001, 002, ..., 009
\end{verbatim}
\item When we exhaust the set of available symbols, we realize a type
of overflow, in which the least-significant digit is reset to zero
and the next digit of higher significance is incremented.
\begin{verbatim}
010, 011, 012, ..., 099
\end{verbatim}
\item We repeat this process of allowing for overflow with each digit
of significance.
\begin{verbatim}
100, 101, 102, ...
\end{verbatim}
\end{itemize}
To be formal, you can think of the integer $d_m
d_{m-1} \ldots d_1 d_0$ as being equivalent shorthand for
\[
d_0 \times 10^0 + d_1 \times 10^1 + \ldots + d_{m-1} \times 10^{m-1}
+ d_m \times 10^m,
\]
where each symbol $d_i$ can take on a single value from the set $\{0,
1, \ldots, 9\}$.
\href{https://en.wikipedia.org/wiki/Binary_number#Binary_counting}{Counting
in binary} is no different, sparing the fact that our set of
symbols is reduced to have cardinality two, i.e. the atomic elements are
\href{https://en.wikipedia.org/wiki/Bit}{\emph{binary digits} (bits)}
which can be toggled between zero or one.
\begin{centering}
\begin{table}[h]
\hspace{-6ex}
\begin{tabular}{l r r r r r r r r r r r r r r r r}
Decimal & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 & 11 & 12 & 13
& 14 & 15 \\
Binary & 0 & 1 & 10 & 11 & 100 & 101 & 110 & 111 & 1000 & 1001 &
1010
& 1011 & 1100 &
1101
& 1110 & 1111
\end{tabular}
\caption{A table describing how values can be equivalently represented
in either a decimal or a
binary number system.}
\end{table}
\end{centering}
To be formal, we can think of the integer $b_m b_{m-1} \ldots b_1 b_0$
as being equivalent shorthand for
\[
b_0 \times 2^0 + b_1 \times 2^2 + \ldots + b_{m-1} \times 2^{m-1} +
b_m \times 2^m
\]
where here we are using $b_i$'s to remind ourselves that each symbol
is a binary digit taking on only one of two values. Notice that other
than the set of symbols changing, the \emph{base} has changed as well,
from 10 to 2. Once you learn these basics, \href{https://en.wikipedia.org/wiki/Binary_number#Conversion_to_and_from_other_numeral_systems}{converting between numeral
systems} is easy.
There are 8 bits in a
\href{https://en.wikipedia.org/wiki/Byte}{byte}. Depending on the
computer architecture, a
\href{https://en.wikipedia.org/wiki/Word_(computer_architecture)}{word}
could be 4 or 8 bytes.
\subsubsection{Simplified Model of Computer}
Where are our data stored on a computer? How is it operated on?
A typical computer will have a memory hierarchy to facilitate
long-term storage (e.g. hard-disk), medium-term storage (RAM), and
short-term storage (cache). We present a simplified computer
architecture below, which does not picture long-term storage.
\begin{figure}[!h]
\centering
\includegraphics[scale=0.25]{fig/model-computer.png}
\caption{\footnotesize We take a simplified view of a computer architecture. In
particular, we show the relationship between
\href{https://en.wikipedia.org/wiki/Computer_data_storage\#Primary_storage}{primary
memory}
(e.g. \href{https://en.wikipedia.org/wiki/Random-access_memory}{RAM})
and a
\href{https://en.wikipedia.org/wiki/Central_processing_unit}{CPU}. The
CPU contains (i)
\href{https://en.wikipedia.org/wiki/Arithmetic_logic_unit\#Functions}{arithmetic
logic units} which take two inputs and produce a single output
(think of the operators \texttt{+}, \texttt{-}, \texttt{\&}, and \texttt{|}) and (ii)
\href{https://en.wikipedia.org/wiki/Processor_register}{registers}
which can (very quickly) store inputs for ALU operations and hold corresponding outputs.}
\end{figure}
\subsubsection{Common Prefixes}
There can be some confusion here depending on the context of
usage. That is, in the context of networking and storage, the prefixes
\texttt{kilo}, \texttt{mega}, \texttt{giga}, \texttt{tera},
\texttt{peta}, and \texttt{exa} all use base 2. However, many other
contexts speak in base 10, e.g. cpu-frequencies and ethernet speeds.
The prefix
\href{https://en.wikipedia.org/wiki/Kilobyte#Definitions_and_usage}{kilo}
always means one-thousand in the prefix of an
\href{https://en.wikipedia.org/wiki/International_System_of_Units#Prefixes}{International
System of Units}. However, since we understand the difference
between a \texttt{bit} and a \texttt{byte}, we can now understand that
there is a difference between a \texttt{kilobyte} (denoted by \textbf{kB}) and
a \emph{kibibyte} (denoted by \textbf{KiB}). The former is 1,000 bytes, i.e. 8,000
bits, whereas the latter is 1,024 bytes. i.e. 8,192 bytes.
We ask that you remember that there is a big difference between these
terms, but not necessarily that you remember the exact terminology or
symbols used to denote each. Wikipedia has a nice table demonstrating
how binary and decimal systems use
\href{https://en.wikipedia.org/wiki/Orders_of_magnitude_(data)}{different prefixes across orders of magnitude}.
\subsubsection{Computer Storage of an Integer}
At the hardware level computers \textbf{don't} do variable length
representations of numbers. I.e. although we may use a variable length
representation, writing $4_{10} = 100_2$, or $73_{10} = 1001001_2$,
computers often use a fixed width storage for numeric types.
\begin{figure}[!h]
\centering
\includegraphics[scale=0.35]{fig/bits.png}
\caption{\footnotesize We can think of a (fixed-width) sequence of bits as a panel
of lightswitches, each capable of being toggled on or off. }
\end{figure}
At the hardware level computers typically handle integers using 8, 16,
32, or 64 bits, depending on the architecture.
\vspace{-3ex}
\paragraph{Number of Representable Values Given $n$ Bits}
Realize that for a sequence of \texttt{n} bits, it may take on exactly
$2^n$ different values.
This dictates the range of what can be represented with a fixed number
of bits. Some common bounds we run into are.
\begin{verbatim}
2^8 = 256
2^16 = 65536 ~(65 thousand)
2^32 = 4294967296 ~(4 billion)
2^64 = 18446744073709551616 ~(18 followed by 18 zeros)
\end{verbatim}
I.e. when we went from 32-bit machines to 64-bit machines, we didn't
just double their capacity. The latter can store
10 orders of magnitude more information (in each register), akin to allowing 10
more decimal digits to be represented. This follows from the fact that
$$
\log_{10}(2^{64} / 2^{32}) = 64 \log_{10}(2) - 32 \log_{10}(2) = 32
\log_{10}(2) \approx 10.
$$
\vspace{-3ex}
\paragraph{Using a Sign Bit to Store Negative Integers}
The idea is to simply use one bit to store the sign of a value, and
the remaining bits to track magnitude.
This reduces the range of the magnitude from $2^n$ to
$2^{n-1}$. E.g. suppose that we have a
\href{https://en.wikipedia.org/wiki/Nibble}{nibble} to work with,
i.e. four bits. We can represent the values
$\pm 4$ as indicated at the right of each statement.
\begin{align*}
\texttt{4}_{10} &= \hspace{30pt} (\texttt{100})_2 &&\leadsto (\texttt{0100}) \\
-\texttt{4}_{10} &= -1 \times (\texttt{100})_2 &&\leadsto (\texttt{1100})
\end{align*}
\paragraph{Offset: An Unconventional Way to Represent Negative Values}
Another idea instead of using a sign bit is to simply
apply an offset or bias to reinterpret the conversion between binary
and decimal. One example may look as follows.
\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{fig/sign-offset.png}
\caption{\footnotesize Suppose we have a byte or eight bits. There are $2^8
= 256$ values that can be represented by such a sequence of
bits. Centering the number of distinct values around the integer zero,
we find that we can represent the set of values $\{-127, \ldots, 0, 1,
\ldots, 128\}$. If we use an offset method for representing integers,
then we may decide to set the sequence of eight bits as all zero's to
represent the minimum value $-127$, and from here we can apply the
usual rules of carry-over/overflow to count our way up to $128$.}
\end{figure}
Again, whether we use the idiom of a sign-bit or if we apply an
offset, either way we effectively reduce the range of the magnitude of
integers allowed. Intuitively, whether or not a number is positive or
negative should \emph{require one unit of information}.
Many programming languages support unsigned integers. Although
Python itself does not have \textbf{unsigned integers}, but \texttt{numpy} does.
A programmer can use this to their advantage by expanding the effective range available
if negative numbers don't need to be stored.
But! With any fixed-storage representage of a real number, we must beware.
\begin{itemize}
\item Attempting to assign a value greater than what can be represented by
the data type will result in \href{https://en.wikipedia.org/wiki/Integer_overflow}{\emph{overflow}}.
Overflow tends to cause \emph{wraparound}, e.g.~if adding
together two signed numbers causes overflow the result is likely to be
a negative number.
\item Attempting to assigning a value less than what can be represented by
the data type will result in
\href{https://en.wikipedia.org/wiki/Arithmetic_underflow}{\emph{underflow}}.\footnote{We
can try to fill the underflow gap using \href{https://en.wikipedia.org/wiki/Denormal_number}{denormal numbers}.}
Underflow can cause a division by zero error, for example, if we
were to scale by the difference between two quantities
$\frac{1}{b-a}$ where $a \neq b$ but $a \approx b$.
\end{itemize}
\subsubsection{Range of Integer Types}
For a fixed number of bits, if we choose to represent sequential
integers centered around zero we must make a decision on boundaries.
E.g. with eight bits, one can store
$2^8 = 256$ unique different bit-sequences (or values), where we have the choice of
either being able to represent the values $\{-128, \ldots, 127\}$ or
$\{-127, \ldots, 128\}$. In general, an $n$-bit representation of an
integer stores the values in the set
\[
\{\underbrace{-2^{n-1}, \ldots, -1}_{2^{n-1} \textrm{ elements}}, \underbrace{0,
\ldots, 2^{n-1} - 1}_{2^{n-1} \textrm{ elements}}\}.
\]
\paragraph{Integers in Python}
In Python 3, values of type \texttt{int} may have unlimited
range.\footnote{Python 2 had both fixed size integers (with type \texttt{int}) and
variable width integers (with type \texttt{long}).}
This is quite nice, and perhaps surprising if you come from
other programming languages.
\begin{python}[basicstyle=\small]
i = 52**100
print(type(i)) # <class 'int'>
print(i) # This is beyond the 64-bit integer range.
\end{python}
The module \texttt{numpy} supports \emph{only} fixed-width integers
for reasons that relate to both performance and storage.
\subsection{Strings}
\subsubsection{ASCII}
American Standard Code for Information Interchange (ASCII) is
typically used to encode text information.
Characters, numbers, symbols, etc. are encoded using 7 bits (although
on modern computers they would typically use 8 bits).
E.g. \texttt{A} maps to $\texttt{(1000001)}_2 = (65)_{10}$,
and \texttt{B} maps to $\texttt{(1000010)}_2 = (66)_{10}$.
Default Python 2 string is ASCII. Possible to get \href{https://en.wikipedia.org/wiki/Unicode}{Unicode} strings with \newline
\texttt{s\ =\ u\textquotesingle{}I\ am\ a\ unicode\
string!\textquotesingle{}}.
\subsubsection{Why Not ASCII All The Way Down? Motivating UTF-8}
\paragraph{For English, A Single Byte Suffices}
The English language has
26 letters (52 if we include casing). If we throw in punctuation and
digits 0-9, we realize that we have more than 64 unique characters to
represent. I.e. we need at least 7 bits. Since computers are more
efficient at processing bit sequences aligned a particular way, many
implementations use 8-bits for an ASCII set.
\paragraph{For Languages with Ideographs, Require 2 Bytes}
But what about other languages? Chinese,
Japanese, and Korean have somewhere between $2^8$ and $2^{16} - 1$
unique characters, whence we
require a two-byte
sequence to store each character in these languages.
So, different languages (or regions) require different
\href{https://en.wikipedia.org/wiki/Character_encoding#Common_character_encodings}{\emph{character
encodings}} to map symbols (text) to bit-sequences.
\paragraph{For World of All Languages, (naively) Require 4 Bytes}
But what about
the world wide web? The number of unique characters across all
languages exceeds $65,535$, whence we require strictly more than
two-bytes to encode any character coming from any language in the
world. Sticking with powers of two, we'll use four-bytes. But all of a
sudden, we realize how expensive this has become; any given language
only requires one or at most two bytes per character, but we're using
double that.
The advantage to UTF-32 (the 4-byte unicode encoding) despite it being
costly in terms of storage is that because each character is
represented as a fixed width sequence of bits, we can slice into
strings in constant time. I.e. looking up the $n$th character in a
string is easy, we simply look up where the object lives in memory and
jump $n$ characters forward in memory, where each character
corresponds to exactly 32-bits.
\paragraph{Using a Variable Length Encoding $\leadsto$ Efficiency}
Alas, we've followed Mark Pilgrim's motivation of UTF-8
\href{http://histo.ucsf.edu/BMS270/diveintopython3-r802.pdf}
{as per Chapter 4 of ``Dive Into Python''}. UTF-8 is a \emph{variable length encoding}.
The implication is that the set of ASCII characters each
requires only a single byte per character. Latin characters such as
{\"a} and {\~n} may require two bytes per character, Chinese
characters may involve three bytes, and rarely used characters require
four bytes. However, since each character requires a different
number of bits, we can \emph{no longer} jump to the $k$th character of a string with
$O(1)$ work; we're relegated to inspecting the first $k-1$ characters
to find out where in the bit-sequence the $k$th character is represented.
\paragraph{Default String Encoding in Python is UTF-8}
Default string encoding in Python 3 is
\href{https://en.wikipedia.org/wiki/UTF-8}{UTF-8}.
The change from ACSII to Unicode between Python 2 and 3 caused major
headaches for Python community.
The fact that each character of text may take from
1 to 4 bytes is nice in that the 1-byte codes
correspond to ASCII characters, which makes UTF-8 backwards
compatible with ASCII. Currently,
UTF-8 encodes a total of 1,112,064 characters, which is enough to represent
the majority of human character systems.
\subsection{Floating Point Representation of Numeric Values}
How do we represent a floating point value using bits?
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/float.png}
\caption{\footnotesize A representation of a decimal value that is broken down into
three pieces: (i) a sign bit, (ii) a fractional component, and (iii)
an exponent indicating the magnitude of rescaling. This is a generic
recipe for approximating real numbers using a finite set of bits.}
\end{figure}
\subsubsection{Floating Point Standard}
\href{https://en.wikipedia.org/wiki/IEEE_754}
{IEEE (Institute of Electrical and Electronics Engineers) 754} is the
technical standard for floating point used by all modern processors
The standard also specifies things like rounding modes, handling overflow,
divide by zero, etc.
\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{fig/float-table.png}
\caption{\footnotesize 32 vs. 64-bit floating point formats.
In moving from 32-bit to 64-bit standards, IEE
increased the number of bits allocated to the mantissa significantly
more than they did the exponent: with 11-bits for the
exponent, we can represent $2^{11} = 2,048$ different values of
exponents. We center these exponents around zero using a
\href{https://en.wikipedia.org/wiki/Exponent_bias}{exponent bias},
such that we can store exponents in the range $\pm 2^{10}$. But
these are exponents, i.e. they are used to raise our base to a
specified power. That's an astronomically
large rescaling.}
\end{figure}
\begin{figure}[h]
\centering
\includegraphics[scale=0.1]{fig/IEEE_754}
\caption{\footnotesize A visual layout of the 64 bit representation used to store
a floating point value. We have 1 sign bit, 52 mantissa
(significand, fractional) bits, and 11 exponent bits. Source: \href{https://en.wikipedia.org/wiki/Double-precision_floating-point_format}{Wikipeda}.}
\end{figure}
Symbolically,
\begin{align*}
(-1)^{\textrm{sgn}} \underbrace{\left(1 + \sum_{i=1}^{52} b_{52-i} 2^{-i}
\right)}_{= \left(1.b_{51} b_{50} \ldots b_0\right)_2} \times 2^{\textrm{exp} - 1023}
\end{align*}
where the exponent term $\textrm{exp}$ is simply an integer value
represented using 11 bits in a vanilla fashion, and we're subtracting
1,023 as means of
\href{https://en.wikipedia.org/wiki/Exponent_bias}{exponent bias} such
that we can have both positive and negative exponents without
requiring the use of an \emph{additional} sign-bit for just the
exponent term.
\paragraph{Floating Point and You}
Floating point also has similar potential for overflow and underflow.
In addition, the limited number of bits for the mantissa means it
often needs to be rounded.
We'll spend more time on floating point arithmetic in CME 212.
Canonical resource: ``What Every Computer Scientist Should Know About
Floating-Point Arithmetic'' by Goldberg
(\href{https://ece.uwaterloo.ca/~dwharder/NumericalAnalysis/02Numerics/Double/paper.pdf}{link}).
Floating point numbers in Python are double precision (64-bit).
Additionally, \texttt{numpy} has support for 16-bit and 32-bit
floating point formats.
\paragraph{Floating Point Comparisons: what you \emph{need} to know}
For now, perhaps one essential reminder is that you should never
directly compare floating point data-types for exact equality using
the \texttt{==} operator. Instead, consider using
\href{https://docs.python.org/3/library/math.html#math.isclose}{\texttt{math.isclose}},
which allows us to compare a floating point object with another to
within a certain tolerance of precision.
\vspace{-2ex}
\subsubsection{Correctness}
{
\small
\begin{quote}
\textbf{Some disasters attributable to bad numerical computing}
By Douglas N. Arnold.
{\footnotesize \url{http://www.math.umn.edu/~arnold/disasters/disasters.html}}
Have you been paying attention in your numerical analysis or scientific
computation courses? If not, it could be a costly mistake. Here are some
real life examples of what can happen when numerical algorithms are not
correctly applied.
The
\href{http://www-users.math.umn.edu/~arnold/disasters/patriot.html}{Patriot
Missile failure},
in Dharan, Saudi Arabia, on February 25,
1991 which resulted in 28 deaths, is ultimately attributable to poor
handling of rounding errors.
The \href{http://www-users.math.umn.edu/~arnold/disasters/ariane.html}{explosion of the Ariane 5 rocket} just after lift-off on its maiden
voyage off French Guiana, on June 4, 1996, was ultimately the
consequence of a simple overflow.
The
\href{http://www-users.math.umn.edu/~arnold/disasters/sleipner.html}{sinking
of the Sleipner A offshore platform}
in Gandsfjorden near
Stavanger, Norway, on August 23, 1991, resulted in a loss of nearly one
billion dollars. It was found to be the result of inaccurate finite
element analysis.
\end{quote}
}
See also: \url{https://people.eecs.berkeley.edu/~wkahan/} for thoughts
from the \href{https://en.wikipedia.org/wiki/William_Kahan}{father of
floating point}, i.e. the primary architect of an IEEE standard for
representing floating point values in a computer architecture.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Python for Numerical Computing}
\subsection{Motivation: Python is (Relatively) Slow}
One of the main disadvantages of a higher level language is that, while
comparatively easy to program, it offers less direct control over
resources; an experienced programmer using C++ or other lower level
languages
may be able to write a more performant program.
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/python-v-compiled.png}
\caption{\footnotesize The Python interpreter sits between the programmer's code and
the machine. In the case of a lower level language such as C++, the
programmer has direct control over memory management and more granular
control over data types. A knowledgeable programmer can take advantage
of the particular computer architecture they are using, alongside the
problem constraints, in order to more efficiently write code to solve
a task. This is harder with Python, wherein the code that the
programmer writes is first translated by the interpreter such that the
machine may understand it. We remark that of course C++ code also must
be translated to machine code, but that it ties in much closer to said
assembly code than Python.}
\end{figure}
\subsubsection{Runtime Performance: An Empirical Example}
Let's compute \(2^i\) for \(i \in [0,n)\). We can use the
\href{https://docs.python.org/3/library/timeit.html#basic-examples}{\texttt{timeit} module}.
Using lists, and from the command line
interface.
\begin{lstlisting}[language=bash,basicstyle=\small]
$ python3 -m timeit --setup='L=range(1000)' '[i**2 for i in L]'
\end{lstlisting}
Equivalently, we can use the \texttt{\%timeit} magic command in an
IPython notebook.
\begin{lstlisting}[language=bash,basicstyle=\small]
$ ipython
In [1]: L = range(1000)
In [2]: %timeit [i**2 for i in L]
\end{lstlisting}
What happens if we try the same operation using \texttt{numpy}? The
analogous operation to \texttt{range} in built-in Python is the
\texttt{arange} method in \texttt{numpy}. For integer arguments, the
two functions behave the same, but \texttt{numpy.arange} extends to
handling non-integer arguments.\footnote{The \texttt{range}
sub-routine is designed to return \emph{indices} that can be used as
subscript arguments when slicing an object, whereas the
\texttt{numpy.arange} is a more generalized function which can take
arbitrary boundary points and step-sizes.}
\begin{python}
import numpy as np
a = np.arange(1000)
%timeit a**2
\end{python}
On my machine, the difference is about 300x.
We'll explore how \texttt{numpy} is able to achieve such
performance gains throughout the lecture. For now, let's continue to
explore how the module can not only allow our programs to execute
faster once built, but in fact \texttt{numpy} lets us write the more
concise code that is capable of a higher level of abstraction.
\subsubsection{Programmer Productivity}
And even though Python is a high-level programming language, it's
still a general purpose one. I.e. it is not tailored to numerical or
scientific computing. Whence we \emph{do} have to write additional
code in order to abstract away commonly used operations in maths and
sciences.
E.g. let's add some 2D arrays.
In Python, we could try and create a two-dimensional tabular data
structure using lists of lists. In order to facilitate something as
easy as ``adding'' two 2D-arrays together, we'd have to write our own
sub-routine, since the \texttt{+} operator performs
\emph{concatenation} on lists rather than element wise addition!
\begin{python}
def my_ones(nrows, ncols):
"""
Create a matrix-like object of dimension nrows x ncols, with each
element taking unit value.
"""
A = []
for r in range(nrows):
A.append([])
for c in range(ncols):
A[r].append(1.0)
return A
def matrix_add(A,B):
"""
Create a new matrix-like object to store the result of adding
input matrices A and B. We assume the matrices have identical
dimension. The work required by the algorithm is proportional to
the number of elements stored in the input matrices.
"""
C = []
for r in range(len(A)):
C.append([])
for c in range(len(A[r])):
C[r].append(A[r][c] + B[r][c])
return C
nrows, ncols = 3, 2
A = my_ones(nrows,ncols)
B = my_ones(nrows,ncols)
C = matrix_add(A,B)
\end{python}
What about with \texttt{numpy}? The operations are trivial to carry
out without defining our own data structures or sub-routines.
I.e. we can ``program faster''.
\begin{python}
nrows = 3
ncols = 2
A = np.ones((nrows,ncols))
B = np.ones((nrows,ncols))
C = A + B
\end{python}
Seems a lot easier to write. Let's check if we earned ourselves a performance again.
Suppose we have placed our definition of \texttt{my\_ones} into a
standalone file in our current working directory, which we call
\texttt{my\_module.py}. We may time the performance as
follows, using
\begin{lstlisting}[language=bash,basicstyle=\small]
$ python3 -m timeit --setup='import my_module' 'my_module.my_ones(1000, 500)'
\end{lstlisting}
Alternatively, in an IPython notebook we may simply
write \texttt{\%timeit A = my\_ones(1000,500)}.
\texttt{numpy} is able to accomplish the analogous operation on my
machine approximately 200x faster. I.e. whereas the hand-rolled variant required
50 \emph{milliseconds} per loop, the \texttt{numpy} analogue below
requires only 200 \emph{microseconds} per loop.\footnote{There are 1,000
microseconds in a millisecond.}
\begin{lstlisting}[language=bash,basicstyle=\small]
$ python3 -m timeit --setup='import numpy as np' 'np.ones((1000, 500))'
\end{lstlisting}
What about our add operation? In a IPython notebook, with \texttt{my\_ones} as per above.
\begin{python}[basicstyle=\footnotesize]
nrows = 1000
ncols = 500
A = my_ones(nrows,ncols)
B = my_ones(nrows,ncols)
%timeit C = matrix_add(A,B)
A = np.ones((nrows,ncols))
B = np.ones((nrows,ncols))
%timeit C = A + B
\end{python}
Again we see that \texttt{numpy} is orders of magnitude faster.
\paragraph{Object Overhead} The reason that our \texttt{list} of
\texttt{list}s was so much slower than \texttt{numpy} relates in part
to object overhead.
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/object-overhead.png}
\caption{Recall that objects in Python are references, and that a
reference is simply a location in memory associated with a type and
a value. If we create a \texttt{list} of \texttt{list}s to represent
2D-tabular data, we end up with several layers of indirection. Part
of the practical implication of having to jump around in memory in
order to obtain data that ``should appear next to each other'' is
that we realize many more
\href{https://en.wikipedia.org/wiki/CPU_cache\#Cache_miss}{cache-misses}.}
\end{figure}
\subsection{Options for better performance}
Python is great for quick projects, prototyping new ideas, etc. but
what if you need better performance?
One option is to completely rewrite your program in something like
C or C++
\subsubsection{Python C API}
Python has a \href{https://docs.python.org/3/c-api/intro.html}{C API}
which allows the use of compiled modules.
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/python-c-interface.png}
\caption{When we call the \texttt{str.find()} method, we're actually
making a call to native C-code. What's happening here?
Within the native C-code, we can
include a Python module that gives us a way to write C modules which
extend to being able to be used by the Python interpreter.}
\end{figure}
The actual implementation of \texttt{string.find()} can be viewed at:
\href{https://github.com/python/cpython/blob/master/Objects/stringlib/fastsearch.h}{fastsearch.h}.
\subsubsection{Python compiled modules}
Python code in a \texttt{.py} file is actually executed in a hybrid
approach by a mix of the interpreter and compiled modules that come
with Python. See
\href{https://docs.python.org/3/tutorial/modules.html#compiled-python-files}{compiled
python files}: the compilation doesn't make the code execute faster,
it just means the definitions in a script can be loaded faster into
memory (but once loaded, the complexity of each algorithm is the same).
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/python-compiled-modules.png}
\caption{Perhaps closer to reality, the Python code we write sits atop
not only an interpreter but also additional compiled modules that we
import.}
\end{figure}
The same Python C API used by the developers of Python itself also
allows other programmers to develop and build their own compiled
extension modules.
These modules extend the functionality of Python with high performance
implementations of common operations.
\paragraph{Commonly Used Modules: NumPy, SciPy, Matplotlib}
\begin{itemize}
\item
NumPy - multidimensional arrays and fundamental operations on them.
\item
SciPy - Various math functionality (linear solvers, FFT, optimization,
etc.) utilizing NumPy arrays.
\item
matplotlib - plotting and data visualization.\footnote{None of these packages seek to clone (all of) MATLAB, if you want that try
something like GNU Octave.}
\end{itemize}
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/python-stack.png}
\caption{A schematic of how Python code sits on top of an entire set
of Pythonic tools, which ultimately relay human-readable language instructions into
machine-code.}
\end{figure}
\subsection{NumPy}
At its core, \texttt{numpy} provides an
\href{https://docs.scipy.org/doc/numpy/reference/arrays.html}
{$n$-dimensional numeric array object}:
\begin{python}
import numpy as np
a = np.array([7, 42, -3])
print(a) # Prints out: "array([ 7, 42, -3])"
a[1] = 19 # Numeric arrays support item assignment.
a # Corresponding element is modified.
\end{python}
\paragraph{Arrays are \emph{not} Lists}
For performance reasons, \texttt{numpy.array}s do not store
heterogeneous data.
Instead, \texttt{numpy.array} is said to be
\textbf{homogeneous}: each element is of exactly the same type and hence
requires the same amount of storage in memory, and further each block
of memory (i.e. element) can be interpreted in an identical
fashion.\footnote{Note that although the types must be homogeneous, they need
not be atomic, e.g. we can store data strucutures within an
\texttt{array}.}
\begin{python}
a[0] = "hello" # ValueError: invalid literal for int().
a.append(0) # AttributeError: numpy.ndarray object has no attribute 'append'.
\end{python}
Further, the \textbf{size is fixed}, i.e.~you \textbf{can't append or remove}.
\subsubsection{Data Types}
Numerical analysts must pay close attention to their data types, and
be wary of an algorithm's
\href{https://en.wikipedia.org/wiki/Numerical_stability}{numerical
stability}. To that end, \texttt{numpy} allows the programmer to
specify exactly what data type they are choosing to store in
underlying elements of the array.
\paragraph{Integers}
Can be stored in 8, 16, 32, and 64 bit variants, and additionally they
can be either signed or unsigned. We specify these by
e.g. \texttt{np.int8}, \texttt{np.uint8}. For more information, see
\href{https://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html}
{\texttt{dtype} documentation}.
\paragraph{Floating Point}
Can be stored in either 32, 64, or 128 bit variants (all signed). We
specify these by e.g. \texttt{np.float32}, \texttt{np.float64}, etc.
Complex numbers,
strings, and Python object \emph{references} also supported by \texttt{numpy}.
\paragraph{Data Type Examples}
\begin{python}
a = np.array([ 7, 19, -3], dtype=np.float32)
a[0] = a[0]/0. # RuntimeWarning: divide by zero encountered in true_divide. Returns inf.
b = np.array([4, 7, 19], dtype=np.int8)
b[0] = 437 # Example of overflow and wraparound. Range is [-128, 127]
\end{python}
Note that we can witness the wraparound behavior for ourselves quite
clearly as follows.
\begin{python}
b[0] = 127 # Print and inspect data; stored as expected.
b[0] = b[0] + 1 # Attempt to increment by unit value.
print(b) # Note that first element now -128.
\end{python}
\subsubsection{Multidimensional Arrays}
Arrays can have multiple dimensions called \texttt{axes}.
The number of \emph{axes} is called the \texttt{rank}.
These terms come from the NumPy community and should \emph{not be confused
with linear algebra terms} for \emph{rank}, etc.
\begin{python}
# Create a two-dimensional array. I.e. two rows, each with three elements.
a = np.array([(7, 19, -3), (4, 8, 17)], dtype=np.float64)
# Array 'a' has a slew of attributes (and methods) associated with it.
a.ndim # Returns the number of dimensions or axes: 2
a.dtype # Returns the data type stored: dtype('float64')
a.shape # Returns the extents of each dimension: (2, 3)
a.size # Returns the number of elements: 6
\end{python}
\subsubsection{Internal Representation}
How can we think about the internal representation of a
\texttt{numpy.array} object? The object itself contains
\emph{attributes} (some shown above), alongside a reference to a
fixed-width sequence of bits in memory which describe the underlying data.
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{fig/numpy-representation.png}
\caption{fig}
\end{figure}
Since each element of the array is of the same type, and each requires
the same amount of storage, we can know how to \emph{slice} into a
particular element in our array in constant time.
\subsubsection{Creating Arrays}
\begin{python}
a = np.empty((3,3)) # A 3x3 matrix of *un-initialized* entries, not necessarily zeros!
a = np.zeros((3,3)) # A 3x3 matrix of *zero-initialized* entries.
a = np.ones((3,3)) # A 3x3 matrix of *unit-unitialized* entries.
a = np.eye(3) # A 3x3 identity matrix.
a = np.arange(9, dtype=np.float64) # A 1D sequence of integers 0-8.
a = np.arange(9, dtype=np.float64).reshape(3,3) # A 2D sequence of integers 0-8...
# ... arranged in a 3x3 layout.
\end{python}
\subsubsection{Reading data from a file}
\begin{lstlisting}[language=bash]
$ cat numbers.txt
7. 19. -3.
4. 8. 17.
\end{lstlisting}
\begin{python}
a = np.loadtxt('numbers.txt', dtype=np.float64)
a = a + 1 # Element-wise addition!
np.savetxt('numbers2.txt', a)
\end{python}
\subsubsection{Remove single dimension entry}
\begin{python}
a = np.arange(3) # Think of this like a 1D row-vector.
a.shape # (3,)
b = np.arange(3).reshape(3,1) # We now realize a 2D-array: 3x1.
print(b.shape) # Returns (3, 1)
b = np.squeeze(b) # Removes 1D entries from the shape of an array.
print(b) # Now a 1D row-vector, just like 'a'.
print(b.shape) # Returns (3,)
\end{python}
Perhaps another informal way of thinking about this function: it removes any
``nested, superfluous'' dimensions. Suppose we create a 3D array with
six elements as follows.
\begin{python}
# Recall: arange simply returns an 'array' with 'range' evenly-spaced values.
a = np.arange(6).reshape((3, 1, 2))
# a[i] peels off the first dimension (i.e. returns an ndarray with shape (1, 2).
# but the second axis has only a single element, i.e. a[i][1] is INVALID
# for any i > 0...might as well drop this dimension via squeeze.
\end{python}
\subsubsection{Array Operations}
\begin{python}
a = np.arange(9, dtype=np.float64) # Store values 0-8 inclusive.
a[3:7] # Returns values 3-6 inclusive.
a[3:7] = 0 # Assign value zero to be referenced
# by the fourth and up till but
# excluding the eigth element.
# Elementwise rescaling.
2*a
a*a
# Vector operations are supported "right out of the box".
sum(a)
min(a)
max(a)
\end{python}
\paragraph{Don't Reinvent the Wheel} It can be tempting for newer
programmers to implement everything from scratch. This leads to not
only more code to maintain but programs which are not as
performant. Consider implementing a procedure for calculating the
\href{https://en.wikipedia.org/wiki/Norm_(mathematics)\#Euclidean_norm}
{Euclidean norm} of a vector.
\begin{python}
a = np.arange(9, dtype=np.float64) # Store values 0-8 inclusive.
# Bad idea.
import math
total = 0.
for n in range(len(a)):
total += a[n]*a[n]
math.sqrt(total)
math.sqrt(np.dot(a,a)) # Better idea: 1-liner and more performant than interpreted loop.
np.linalg.norm(a) # Best idea! (Not necessarily faster still, just more concise).
\end{python}
The last expression is both the most concise and at least as
performant as any other.
\subsubsection{Speed of Array Operations}
Some of the overloaded operators (e.g. \texttt{min}, \texttt{max},
\texttt{sum}, etc.) work albeit slowly.
In an IPython notebook:
\begin{python}
%timeit total = sum(np.ones(1000000,dtype=np.int32))
%timeit total = np.sum(np.ones(1000000,dtype=np.int32))
\end{python}
Same data, same input. Using built-in \texttt{sum} requires (on my
machine) an average of 700 microseconds per loop, whereas the
corresponding \texttt{numpy.sum} is about 35x faster, requiring only
20 microseconds per loop. Why is there such a big difference?
Loops you write in Python will be executed by the interpreter.
Calling NumPy function or methods of the array object will invoke high
performance implementations of these operations, which can be
specialized to operate on homogeneous data.
\subsubsection{Matrix Operations}
\begin{python}
a = np.arange(9, dtype=np.float64).reshape(3,3) # 3x3 matrix, row-major order.
# Basic operations on a single matrix.
a.transpose()
np.trace(a)
# Be careful with "matrix" multiplication!
a*a # Naive element wise multiplication.
np.dot(a,a) # True matrix-matrix multiplication.
a @ a # Equivalently: new matrix multiply operator in Python 3.5.
\end{python}
Note that if we use \texttt{ndarray}, then the \texttt{*} operator is
applied elementwise, whereas if we have a \texttt{matrix} we get the
benefit of being able to use a true matrix-multiply.
\subsubsection{Array vs Matrix}
NumPy has a dedicated matrix class
However, the matrix class is not as widely used and there are subtle
differences between a 2D array and a matrix.
It is highly recommended that you use 2D arrays for maximum
compatibility with other NumPy functions, SciPy, matplotlib, etc.
See here for more details:
\url{http://www.scipy.org/NumPy_for_Matlab_Users}.
\subsubsection{References to an Array}
\begin{python}
a = np.arange(9, dtype=np.float64).reshape(3,3)
b = a # Recall: assignment sets up a reference.
b[0,0] = 42 # For the ndarray referred by b, modify "first" element.
b # Of course, ndarray reffered to by 'b' has changed.
a # Hopefully not a surprise: 'a' referred to the same object...
\end{python}
\paragraph{Array slices and references}
\begin{python}
a = np.arange(9, dtype=np.float64)
b = a[2:7] # Create a reference to the underlying data, elements 3:7 inclusive.
b[2] = -1 # When we modify this data... (third elem in 'b', i.e. fifth elem in 'a')
a # ... 'a' also refers to the same data. Whence 'a' mutated.
\end{python}
\hypertarget{array-copies}{%
\subsubsection{Array copies}\label{array-copies}}