6.2: Linear Maps and Functionals. Matrices

For an adequate definition of differentiability, we need the notion of a linear map. Below, (E^{prime}, E^{prime prime},) and (E) denote normed spaces over the same scalar field, (E^{1}) or (C.)

Definition 1

A function (f : E^{prime} ightarrow E) is a linear map if and only if for all (vec{x}, vec{y} in E^{prime}) and scalars (a, b)

[f(a vec{x}+b vec{y})=a f(vec{x})+b f(vec{y});]

equivalently, iff for all such (vec{x}, vec{y},) and (a)

[f(vec{x}+vec{y})=f(x)+f(y) ext { and } f(a vec{x})=a f(vec{x}). ext {(Verify!)}]

If (E=E^{prime},) such a map is also called a linear operator.

If the range space (E) is the scalar field of (E^{prime},) (i.e., (E^{1}) or (C,)) the linear (f) is also called a (real or complex) linear functional on (E^{prime}.)

Note 1. Induction extends formula (1) to any "linear combinations":

[fleft(sum_{i=1}^{m} a_{i} vec{x}_{i} ight)=sum_{i=1}^{m} a_{i} fleft(vec{x}_{i} ight)]

for all (vec{x}_{i} in E^{prime}) and scalars (a_{i}).

Briefly: A linear map (f) preserves linear combinations.

Note 2. Taking (a=b=0) in (1), we obtain (f(overrightarrow{0})=0) if (f) is linear.


(a) Let (E^{prime}=E^{n}left(C^{n} ight).) Fix a vector (vec{v}=left(v_{1}, ldots, v_{n} ight)) in (E^{prime}) and set

[left(forall vec{x} in E^{prime} ight) quad f(vec{x})=vec{x} cdot vec{v}]

(inner product; see Chapter 3, §§1-3 and §9).


[egin{aligned} f(a vec{x}+b vec{y}) &=(a vec{x}) cdot vec{v}+(b vec{y}) cdot vec{v} &=a(vec{x} cdot vec{v})+b(vec{y} cdot vec{v}) &=a f(vec{x})+b f(vec{y}); end{aligned}]

so (f) is linear. Note that if (E^{prime}=E^{n},) then by definition,

[f(vec{x})=vec{x} cdot vec{v}=sum_{k=1}^{n} x_{k} v_{k}=sum_{k=1}^{n} v_{k} x_{k}.]

If, however, (E^{prime}=C^{n},) then

[f(vec{x})=vec{x} cdot vec{v}=sum_{k=1}^{n} x_{k} overline{v}_{k}=sum_{k=1}^{n} overline{v}_{k} x_{k},]

where (overline{v}_{k}) is the conjugate of the complex number (v_{k}).

By Theorem 3 in Chapter 4, §3, (f) is continuous (a polynomial!).

Moreover, (f(vec{x})=vec{x} cdot vec{v}) is a scalar (in (E^{1}) or (C).) Thus the range of (f) lies in the scalar field of (E^{prime};) so (f) is a linear functional on (E^{prime}.)

(b) Let (I=[0,1].) Let (E^{prime}) be the set of all functions (u : I ightarrow E) that are of class (CD^{infty}) (Chapter 5, §6) on (I), hence bounded there (Theorem 2 of Chapter 4, §8).

As in Example (C) in Chapter 3, §10, (E^{prime}) is a normed linear space, with norm

[|u|=sup _{x in I}|u(x)|.]

Here each function (u in E^{prime}) is treated as a single "point" in (E^{prime}.) The
distance between two such points, (u) and (v,) equals (|u-v|,) by definition.

Now define a map (D) on (E^{prime}) by setting (D(u)=u^{prime}) (derivative of (u) on (I)). As every (u in E^{prime}) is of class (CD^{infty},) so is (u^{prime}.)

Thus (D(u)=u^{prime} in E^{prime},) and so (D : E^{prime} ightarrow E^{prime}) is a linear operator. (Its linearity follows from Theorem 4 in Chapter 5, §1.)

(c) Let again (I=[0,1].) Let (E^{prime}) be the set of all functions (u : I ightarrow E) that are bounded and have antiderivatives (Chapter 5, §5) on (I.) With norm (|u|) as in Example (b), (E^{prime}) is a normed linear space.

Now define (phi : E^{prime} ightarrow E) by

[phi(u)=int_{0}^{1} u,]

with (int u) as in Chapter 5, §5. (Recall that (int_{0}^{1} u) is an element of (E) if (u : I ightarrow E.) ) By Corollary 1 in Chapter 5, §5, (phi) is a linear map of (E^{prime}) into (E). (Why?)

(d) The zero map (f=0) on (E^{prime}) is always linear. (Why?)

Theorem (PageIndex{1})

A linear map (f : E^{prime} ightarrow E) is continuous (even uniformly so) on all of (E^{prime}) iff it is continuous at (overrightarrow{0};) equivalently, iff there is a real (c>0) such that

[left(forall vec{x} in E^{prime} ight) quad|f(vec{x})| leq c|vec{x}|.]

(We call this property linear boundedness.)


Assume that (f) is continuous at (overrightarrow{0}.) Then, given (varepsilon>0,) there is (delta>0) such that

[|f(vec{x})-f(overrightarrow{0})|=|f(vec{x})| leq varepsilon]

whenever (|vec{x}-overrightarrow{0}|=|vec{x}|

Now, for any (vec{x} eq overrightarrow{0},) we surely have

[left|frac{delta vec{x}}{|vec{x}|} ight|=frac{delta}{2}


[(forall vec{x} eq overrightarrow{0}) quadleft|fleft(frac{delta vec{x}}{2|vec{x}|} ight) ight| leq varepsilon,]

or, by linearity,

[frac{delta}{2|vec{x}|}|f(vec{x})| leq varepsilon,]


[|f(vec{x})| leq frac{2 varepsilon}{delta}|vec{x}|.]

By Note 2, this also holds if (vec{x}=overrightarrow{0}).

Thus, taking (c=2 varepsilon / delta,) we obtain

[left(forall vec{x} in E^{prime} ight) quad f(vec{x}) leq c|vec{x}| quad ext {(linear boundedness).}]

Now assume (3). Then

[left(forall vec{x}, vec{y} in E^{prime} ight) quad|f(vec{x}-vec{y})| leq c|vec{x}-vec{y}|;]

or, by linearity,

[left(forall vec{x}, vec{y} in E^{prime} ight) quad|f(vec{x})-f(vec{y})| leq c|vec{x}-vec{y}|.]

Hence (f) is uniformly continuous (given (varepsilon>0,) take (delta=varepsilon / c).) This, in turn, implies continuity at (overrightarrow{0};) so all conditions are equivalent, as claimed. (quad square)

A linear map need not be continuous. But, for (E^{n}) and (C^{n},) we have the following result.

Theorem (PageIndex{2})

(i) Any linear map on (E^{n}) or (C^{n}) is uniformly continuous.

(ii) Every linear functional on (E^{n}left(C^{n} ight)) has the form

[f(vec{x})=vec{x} cdot vec{v} quad ext {(dot product)}]

for some unique vector (vec{v} in E^{n}left(C^{n} ight),) dependent on (f) only.


Suppose (f : E^{n} ightarrow E) is linear; so (f) preserves linear combinations.

But every (vec{x} in E^{n}) is such a combination,

[vec{x}=sum_{k=1}^{n} x_{k} vec{e}_{k} quad ext {(Theorem 2 in Chapter 3, §§1-3).}]

Thus, by Note 1,

[f(vec{x})=fleft(sum_{k=1}^{n} x_{k} vec{e}_{k} ight)=sum_{k=1}^{n} x_{k} fleft(vec{e}_{k} ight).]

Here the function values (fleft(vec{e}_{k} ight)) are fixed vectors in the range space (E,) say,

[fleft(vec{e}_{k} ight)=v_{k} in E,]

so that

[f(vec{x})=sum_{k=1}^{n} x_{k} fleft(vec{e}_{k} ight)=sum_{k=1}^{n} x_{k} v_{k}, quad v_{k} in E.]

Thus (f) is a polynomial in (n) real variables (x_{k},) hence continuous (even uniformly so, by Theorem 1).

In particular, if (E=E^{1}) (i.e., (f) is a linear functional) then all (v_{k}) in (5) are real numbers; so they form a vector

[vec{v}=left(v_{1}, ldots, v_{k} ight) ext { in } E^{n},]

and (5) can be written as

[f(vec{x})=vec{x} cdot vec{v}.]

The vector (vec{v}) is unique. For suppose there are two vectors, (vec{u}) and (vec{v},) such that

[left(forall vec{x} in E^{n} ight) quad f(vec{x})=vec{x} cdot vec{v}=vec{x} cdot vec{u}.]


[left(forall vec{x} in E^{n} ight) quad vec{x} cdot(vec{v}-vec{u})=0.]

By Problem 10 of Chapter 3, §§1-3, this yields (vec{v}-vec{u}=overrightarrow{0},) or (vec{v}=vec{u}.) This completes the proof for (E=E^{n}.)

It is analogous for (C^{n};) only in (ii) the (v_{k}) are complex and one has to replace them by their conjugates (overline{v}_{k}) when forming the vector (vec{v}) to obtain (f(vec{x})=vec{x} cdot vec{v}). Thus all is proved. (quad square)

Note 3. Formula (5) shows that a linear map (f : E^{n}left(C^{n} ight) ightarrow E) is uniquely determined by the (n) function values (v_{k}=fleft(vec{e}_{k} ight)).

If further (E=E^{m}left(C^{m} ight),) the vectors (v_{k}) are (m) -tuples of scalars,

[v_{k}=left(v_{1 k}, ldots, v_{m k} ight).]

We often write such vectors vertically, as the (n) "columns" in an array of (m) "rows" and (n) "columns":

[left(egin{array}{cccc}{v_{11}} & {v_{12}} & {dots} & {v_{1 n}} {v_{21}} & {v_{22}} & {dots} & {v_{2 n}} {vdots} & {vdots} & {ddots} & {vdots} {v_{m 1}} & {v_{m 2}} & {dots} & {v_{m n}} end{array} ight).]

Formally, (6) is a double sequence of (m n) terms, called an (m imes n) matrix. We denote it by ([f]=left(v_{i k} ight),) where for (k=1,2, ldots, n),

[fleft(vec{e}_{k} ight)=v_{k}=left(v_{1 k}, ldots, v_{m k} ight).]

Thus linear maps (f : E^{n} ightarrow E^{m}) (or (f : C^{n} ightarrow C^{m})) correspond one-to-one to their matrices ([f].)

The easy proof of Corollaries 1 to 3 below is left to the reader.

Corollary (PageIndex{1})

If (f, g : E^{prime} ightarrow E) are linear, so is

[h=a f+b g]

for any scalars (a, b).

If further (E^{prime}=E^{n}left(C^{n} ight)) and (E=E^{m}left(C^{m} ight),) with ([f]=left(v_{i k} ight)) and ([g]=left(w_{i k} ight)), then

[[h]=left(a v_{i k}+b w_{i k} ight).]

Corollary (PageIndex{2})

A map (f : E^{n}left(C^{n} ight) ightarrow E) is linear iff

[f(vec{x})=sum_{k=1}^{n} v_{k} x_{k},]

where (v_{k}=fleft(vec{e}_{k} ight)).

Hint: For the "if," use Corollary 1. For the "only if," use formula (5) above.

Corollary (PageIndex{3})

If (f : E^{prime} ightarrow E^{prime prime}) and (g : E^{prime prime} ightarrow E) are linear, so is the composite (h=g circ f.)

Our next theorem deals with the matrix of the composite linear map (g circ f)

Theorem (PageIndex{3})

Let (f : E^{prime} ightarrow E^{prime prime}) and (g : E^{prime prime} ightarrow E) be linear, with

[E^{prime}=E^{n}left(C^{n} ight), E^{prime prime}=E^{m}left(C^{m} ight), ext { and } E=E^{r}left(C^{r} ight).]

If ([f]=left(v_{i k} ight)) and ([g]=left(w_{j i} ight),) then

[[h]=[g circ f]=left(z_{j k} ight),]


[z_{j k}=sum_{i=1}^{m} w_{j i} v_{i k}, quad j=1,2, ldots, r, k=1,2, ldots, n.]


Denote the basic unit vectors in (E^{prime}) by

[e_{1}^{prime}, ldots, e_{n}^{prime},]

those in (E^{prime prime}) by

[e_{1}^{prime prime}, ldots, e_{m}^{prime prime},]

and those in (E) by

[e_{1}, ldots, e_{r}.]

Then for (k=1,2, ldots, n),

[fleft(e_{k}^{prime} ight)=v_{k}=sum_{i=1}^{m} v_{i k} e_{i}^{prime prime} ext { and } hleft(e_{k}^{prime} ight)=sum_{j=1}^{r} z_{j k} e_{j},]

and for (i=1, dots m),

[gleft(e_{i}^{prime prime} ight)=sum_{j=1}^{r} w_{j i} e_{j}.]


[hleft(e_{k}^{prime} ight)=gleft(fleft(e_{k}^{prime} ight) ight)=gleft(sum_{i=1}^{m} v_{i k} e_{i}^{prime prime} ight)=sum_{i=1}^{m} v_{i k} gleft(e_{i}^{prime prime} ight)=sum_{i=1}^{m} v_{i k}left(sum_{j=1}^{r} w_{j i} e_{j} ight).]


[hleft(e_{k}^{prime} ight)=sum_{j=1}^{r} z_{j k} e_{j}=sum_{j=1}^{r}left(sum_{i=1}^{m} w_{j i} v_{i k} ight) e_{j}.]

But the representation in terms of the (e_{j}) is unique (Theorem 2 in Chapter 3, §§1-3), so, equating coefficients, we get (7). (quad square)

Note 4. Observe that (z_{j k}) is obtained, so to say, by "dot-multiplying" the (j)th row of ([g]) (an (r imes m) matrix) by the (k)th column of ([f]) (an (m imes n) matrix).

It is natural to set

[[g][f]=[g circ f],]


[left(w_{j i} ight)left(v_{i k} ight)=left(z_{j k} ight),]

with (z_{j k}) as in (7).

Caution. Matrix multiplication, so defined, is not commutative.

Definition 2

The set of all continuous linear maps (f : E^{prime} ightarrow E) (for fixed (E^{prime} ) and (E)) is denoted (L(E^{prime}, E).)

If (E=E^{prime},) we write (L(E)) instead.

For each (f) in (Lleft(E^{prime}, E ight),) we define its norm by

[|f|=sup _{|vec{x}| leq 1}|f(vec{x})|.]

Note that (|f|<+infty,) by Theorem 1.

Theorem (PageIndex{4})

(L(E^{prime}, E)) is a normed linear space under the norm defined above and under the usual operations on functions, as in Corollary 1.


Corollary 1 easily implies that (L(E^{prime}, E)) is a vector space. We now show that (|cdot|) is a genuine norm.

The triangle law,

[|f+g| leq|f|+|g|,]

follows exactly as in Example (C) of Chapter 3, §10. (Verify!)

Also, by Problem 5 in Chapter 2, §§8-9, (sup |a f(vec{x})|=|a| sup |f(vec{x})|.) Hence (|a f|=|a||f|) for any scalar (a.)

As noted above, (0 leq|f|<+infty).

It remains to show that (|f|=0) iff (f) is the zero map. If

[|f|=sup _{|vec{x}| leq 1}|f(vec{x})|=0,]

then (|f(vec{x})|=0) when (|vec{x}| leq 1.) Hence, if (vec{x} eq overrightarrow{0}),

[f(frac{vec{x}}{|vec{x}|})=frac{1}{|vec{x}|} f(vec{x})=0.]

As (f(overrightarrow{0})=0,) we have (f(vec{x})=0) for all (vec{x} in E^{prime}).

Thus (|f|=0) implies (f=0,) and the converse is clear. (quad square)

Note 5. A similar proof, via (fleft(frac{vec{x}}{|vec{x}|} ight)) and properties of lub, shows that

[|f|=sup _{vec{x} eq 0}left|frac{f(vec{x})}{|vec{x}|} ight|]


[(forall vec{x} in E^{prime}) quad|f(vec{x})| leq|f||vec{x}|.]

It also follows that (|f|) is the least real (c) such that

[(forall vec{x} in E^{prime}) quad|f(vec{x})| leq c|vec{x}|.]

Verify. (See Problem 3'.)

As in any normed space, we define distances in (L(E^{prime}, E)) by

[ ho(f, g)=|f-g|,]

making it a metric space; so we may speak of convergence, limits, etc. in it.

Corollary (PageIndex{4})

If (f in L(E^{prime}, E^{prime prime})) and (g in L(E^{prime prime}, E),) then

[|g circ f| leq|g||f|.]


By Note 5,

[left(forall vec{x} in E^{prime} ight) quad|g(f(vec{x}))| leq|g||f(vec{x})| leq|g||f||vec{x}|.]


[(forall vec{x} eq overrightarrow{0}) quadleft|frac{(g circ f)(vec{x})}{|vec{x}|} ight| leq|g||f|,]

and so

[|g||f| geq sup _{vec{x} eq overline{0}} frac{|(g circ f)(vec{x})|}{|vec{x}|}=|g circ f|. quad square]

Matrix Operations¶

At the core of linear algebra are linear operations on vectors. The simplest of this is the linear combination of two vectors (a mathbf_1 + b mathbf_2) :

Another important operation is the inner (or dot) product (i.e., the sum of the element-wise products). The inner product is usually denoted for two (column) vectors by (mathbf_1 cdot mathbf_2) or (mathbf_1^T mathbf_2) . In SymPy, both the inner product can be computed in two ways:

Similarly, the outer product (mathbf_1 mathbf_2^T) of two column vectors can be computed via

Probably the most important operation in all of scientific computing is the product of matrix and a vector. The inner and outer products just observed are special cases of matrix-vector multiplication. More general matrix-matrix multiplication can be consider a sequence of matrix-vector multiplications. SymPy handles matrix-vector multiplication with ease:

Fundamentally, matrix-vector multiplication can be deconstructed into a sequence of simpler vector operations. You have most likely learned the “sequence of dot products” definition, in which the inner product of each row of (mathbf) with the vector (mathbf) defines one element of the maxrix-vector product. For our example, this would be defined as

This column-oriented view is incredibly useful and has long been promoted by Gilbert Strang at MIT (whose book free videos on linear algebra are quite good!).

5.3. Tuples and Sequences¶

We saw that lists and strings have many common properties, such as indexing and slicing operations. They are two examples of sequence data types (see Sequence Types — list, tuple, range ). Since Python is an evolving language, other sequence data types may be added. There is also another standard sequence data type: the tuple.

A tuple consists of a number of values separated by commas, for instance:

As you see, on output tuples are always enclosed in parentheses, so that nested tuples are interpreted correctly they may be input with or without surrounding parentheses, although often parentheses are necessary anyway (if the tuple is part of a larger expression). It is not possible to assign to the individual items of a tuple, however it is possible to create tuples which contain mutable objects, such as lists.

Though tuples may seem similar to lists, they are often used in different situations and for different purposes. Tuples are immutable , and usually contain a heterogeneous sequence of elements that are accessed via unpacking (see later in this section) or indexing (or even by attribute in the case of namedtuples ). Lists are mutable , and their elements are usually homogeneous and are accessed by iterating over the list.

A special problem is the construction of tuples containing 0 or 1 items: the syntax has some extra quirks to accommodate these. Empty tuples are constructed by an empty pair of parentheses a tuple with one item is constructed by following a value with a comma (it is not sufficient to enclose a single value in parentheses). Ugly, but effective. For example:

The statement t = 12345, 54321, 'hello!' is an example of tuple packing: the values 12345 , 54321 and 'hello!' are packed together in a tuple. The reverse operation is also possible:

This is called, appropriately enough, sequence unpacking and works for any sequence on the right-hand side. Sequence unpacking requires that there are as many variables on the left side of the equals sign as there are elements in the sequence. Note that multiple assignment is really just a combination of tuple packing and sequence unpacking.

6.2: Linear Maps and Functionals. Matrices

In the last section we introduced the problem of Image Classification, which is the task of assigning a single label to an image from a fixed set of categories. Moreover, we described the k-Nearest Neighbor (kNN) classifier which labels images by comparing them to (annotated) images from the training set. As we saw, kNN has a number of disadvantages:

  • The classifier must remember all of the training data and store it for future comparisons with the test data. This is space inefficient because datasets may easily be gigabytes in size.
  • Classifying a test image is expensive since it requires a comparison to all training images.

Overview. We are now going to develop a more powerful approach to image classification that we will eventually naturally extend to entire Neural Networks and Convolutional Neural Networks. The approach will have two major components: a score function that maps the raw data to class scores, and a loss function that quantifies the agreement between the predicted scores and the ground truth labels. We will then cast this as an optimization problem in which we will minimize the loss function with respect to the parameters of the score function.

Parameterized mapping from images to label scores

The first component of this approach is to define the score function that maps the pixel values of an image to confidence scores for each class. We will develop the approach with a concrete example. As before, let’s assume a training dataset of images ( x_i in R^D ), each associated with a label ( y_i ). Here ( i = 1 dots N ) and ( y_i in < 1 dots K >). That is, we have N examples (each with a dimensionality D) and K distinct categories. For example, in CIFAR-10 we have a training set of N = 50,000 images, each with D = 32 x 32 x 3 = 3072 pixels, and K = 10, since there are 10 distinct classes (dog, cat, car, etc). We will now define the score function (f: R^D mapsto R^K) that maps the raw image pixels to class scores.

Linear classifier. In this module we will start out with arguably the simplest possible function, a linear mapping:

In the above equation, we are assuming that the image (x_i) has all of its pixels flattened out to a single column vector of shape [D x 1]. The matrix W (of size [K x D]), and the vector b (of size [K x 1]) are the parameters of the function. In CIFAR-10, (x_i) contains all pixels in the i-th image flattened into a single [3072 x 1] column, W is [10 x 3072] and b is [10 x 1], so 3072 numbers come into the function (the raw pixel values) and 10 numbers come out (the class scores). The parameters in W are often called the weights, and b is called the bias vector because it influences the output scores, but without interacting with the actual data (x_i). However, you will often hear people use the terms weights and parameters interchangeably.

There are a few things to note:

  • First, note that the single matrix multiplication (W x_i) is effectively evaluating 10 separate classifiers in parallel (one for each class), where each classifier is a row of W.
  • Notice also that we think of the input data ( (x_i, y_i) ) as given and fixed, but we have control over the setting of the parameters W,b. Our goal will be to set these in such way that the computed scores match the ground truth labels across the whole training set. We will go into much more detail about how this is done, but intuitively we wish that the correct class has a score that is higher than the scores of incorrect classes.
  • An advantage of this approach is that the training data is used to learn the parameters W,b, but once the learning is complete we can discard the entire training set and only keep the learned parameters. That is because a new test image can be simply forwarded through the function and classified based on the computed scores.
  • Lastly, note that classifying the test image involves a single matrix multiplication and addition, which is significantly faster than comparing a test image to all training images.

Foreshadowing: Convolutional Neural Networks will map image pixels to scores exactly as shown above, but the mapping ( f ) will be more complex and will contain more parameters.

Interpreting a linear classifier

Notice that a linear classifier computes the score of a class as a weighted sum of all of its pixel values across all 3 of its color channels. Depending on precisely what values we set for these weights, the function has the capacity to like or dislike (depending on the sign of each weight) certain colors at certain positions in the image. For instance, you can imagine that the “ship” class might be more likely if there is a lot of blue on the sides of an image (which could likely correspond to water). You might expect that the “ship” classifier would then have a lot of positive weights across its blue channel weights (presence of blue increases score of ship), and negative weights in the red/green channels (presence of red/green decreases the score of ship).

Analogy of images as high-dimensional points. Since the images are stretched into high-dimensional column vectors, we can interpret each image as a single point in this space (e.g. each image in CIFAR-10 is a point in 3072-dimensional space of 32x32x3 pixels). Analogously, the entire dataset is a (labeled) set of points.

Since we defined the score of each class as a weighted sum of all image pixels, each class score is a linear function over this space. We cannot visualize 3072-dimensional spaces, but if we imagine squashing all those dimensions into only two dimensions, then we can try to visualize what the classifier might be doing:

As we saw above, every row of (W) is a classifier for one of the classes. The geometric interpretation of these numbers is that as we change one of the rows of (W), the corresponding line in the pixel space will rotate in different directions. The biases (b), on the other hand, allow our classifiers to translate the lines. In particular, note that without the bias terms, plugging in ( x_i = 0 ) would always give score of zero regardless of the weights, so all lines would be forced to cross the origin.

Interpretation of linear classifiers as template matching. Another interpretation for the weights (W) is that each row of (W) corresponds to a template (or sometimes also called a prototype) for one of the classes. The score of each class for an image is then obtained by comparing each template with the image using an inner product (or dot product) one by one to find the one that “fits” best. With this terminology, the linear classifier is doing template matching, where the templates are learned. Another way to think of it is that we are still effectively doing Nearest Neighbor, but instead of having thousands of training images we are only using a single image per class (although we will learn it, and it does not necessarily have to be one of the images in the training set), and we use the (negative) inner product as the distance instead of the L1 or L2 distance.

Additionally, note that the horse template seems to contain a two-headed horse, which is due to both left and right facing horses in the dataset. The linear classifier merges these two modes of horses in the data into a single template. Similarly, the car classifier seems to have merged several modes into a single template which has to identify cars from all sides, and of all colors. In particular, this template ended up being red, which hints that there are more red cars in the CIFAR-10 dataset than of any other color. The linear classifier is too weak to properly account for different-colored cars, but as we will see later neural networks will allow us to perform this task. Looking ahead a bit, a neural network will be able to develop intermediate neurons in its hidden layers that could detect specific car types (e.g. green car facing left, blue car facing front, etc.), and neurons on the next layer could combine these into a more accurate car score through a weighted sum of the individual car detectors.

Bias trick. Before moving on we want to mention a common simplifying trick to representing the two parameters (W,b) as one. Recall that we defined the score function as:

As we proceed through the material it is a little cumbersome to keep track of two sets of parameters (the biases (b) and weights (W)) separately. A commonly used trick is to combine the two sets of parameters into a single matrix that holds both of them by extending the vector (x_i) with one additional dimension that always holds the constant (1) - a default bias dimension. With the extra dimension, the new score function will simplify to a single matrix multiply:

With our CIFAR-10 example, (x_i) is now [3073 x 1] instead of [3072 x 1] - (with the extra dimension holding the constant 1), and (W) is now [10 x 3073] instead of [10 x 3072]. The extra column that (W) now corresponds to the bias (b). An illustration might help clarify:

Image data preprocessing. As a quick note, in the examples above we used the raw pixel values (which range from [0…255]). In Machine Learning, it is a very common practice to always perform normalization of your input features (in the case of images, every pixel is thought of as a feature). In particular, it is important to center your data by subtracting the mean from every feature. In the case of images, this corresponds to computing a mean image across the training images and subtracting it from every image to get images where the pixels range from approximately [-127 … 127]. Further common preprocessing is to scale each input feature so that its values range from [-1, 1]. Of these, zero mean centering is arguably more important but we will have to wait for its justification until we understand the dynamics of gradient descent.

Loss function

In the previous section we defined a function from the pixel values to class scores, which was parameterized by a set of weights (W). Moreover, we saw that we don’t have control over the data ( (x_i,y_i) ) (it is fixed and given), but we do have control over these weights and we want to set them so that the predicted class scores are consistent with the ground truth labels in the training data.

For example, going back to the example image of a cat and its scores for the classes “cat”, “dog” and “ship”, we saw that the particular set of weights in that example was not very good at all: We fed in the pixels that depict a cat but the cat score came out very low (-96.8) compared to the other classes (dog score 437.9 and ship score 61.95). We are going to measure our unhappiness with outcomes such as this one with a loss function (or sometimes also referred to as the cost function or the objective). Intuitively, the loss will be high if we’re doing a poor job of classifying the training data, and it will be low if we’re doing well.

Multiclass Support Vector Machine loss

There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the Multiclass Support Vector Machine (SVM) loss. The SVM loss is set up so that the SVM “wants” the correct class for each image to a have a score higher than the incorrect classes by some fixed margin (Delta). Notice that it’s sometimes helpful to anthropomorphise the loss functions as we did above: The SVM “wants” a certain outcome in the sense that the outcome would yield a lower loss (which is good).

Let’s now get more precise. Recall that for the i-th example we are given the pixels of image ( x_i ) and the label ( y_i ) that specifies the index of the correct class. The score function takes the pixels and computes the vector ( f(x_i, W) ) of class scores, which we will abbreviate to (s) (short for scores). For example, the score for the j-th class is the j-th element: ( s_j = f(x_i, W)_j ). The Multiclass SVM loss for the i-th example is then formalized as follows:

[L_i = sum_ max(0, s_j - s_ + Delta)]

Example. Lets unpack this with an example to see how it works. Suppose that we have three classes that receive the scores ( s = [13, -7, 11]), and that the first class is the true class (i.e. (y_i = 0)). Also assume that (Delta) (a hyperparameter we will go into more detail about soon) is 10. The expression above sums over all incorrect classes ((j eq y_i)), so we get two terms:

[L_i = max(0, -7 - 13 + 10) + max(0, 11 - 13 + 10)]

You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the (max(0,-)) function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10 Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class (y_i) to be larger than the incorrect class scores by at least by (Delta) (delta). If this is not the case, we will accumulate loss.

Note that in this particular module we are working with linear score functions ( ( f(x_i W) = W x_i ) ), so we can also rewrite the loss function in this equivalent form:

[L_i = sum_ max(0, w_j^T x_i - w_^T x_i + Delta)]

where (w_j) is the j-th row of (W) reshaped as a column. However, this will not necessarily be the case once we start to consider more complex forms of the score function (f).

A last piece of terminology we’ll mention before we finish with this section is that the threshold at zero (max(0,-)) function is often called the hinge loss. You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM), which uses the form (max(0,-)^2) that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better. This can be determined during cross-validation.

The loss function quantifies our unhappiness with predictions on the training set

Regularization. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters W that correctly classify every example (i.e. all scores are so that all the margins are met, and (L_i = 0) for all i). The issue is that this set of W is not necessarily unique: there might be many similar W that correctly classify the examples. One easy way to see this is that if some parameters W correctly classify all examples (so loss is zero for each example), then any multiple of these parameters ( lambda W ) where ( lambda > 1 ) will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of W by 2 would make the new difference 30.

In other words, we wish to encode some preference for a certain set of weights W over others to remove this ambiguity. We can do so by extending the loss function with a regularization penalty (R(W)). The most common regularization penalty is the squared L2 norm that discourages large weights through an elementwise quadratic penalty over all parameters:

In the expression above, we are summing up all the squared elements of (W). Notice that the regularization function is not a function of the data, it is only based on the weights. Including the regularization penalty completes the full Multiclass Support Vector Machine loss, which is made up of two components: the data loss (which is the average loss (L_i) over all examples) and the regularization loss. That is, the full Multiclass SVM loss becomes:

Or expanding this out in its full form:

[L = frac<1> sum_i sum_ left[ max(0, f(x_i W)_j - f(x_i W)_ + Delta) ight] + lambda sum_ksum_l W_^2]

Where (N) is the number of training examples. As you can see, we append the regularization penalty to the loss objective, weighted by a hyperparameter (lambda). There is no simple way of setting this hyperparameter and it is usually determined by cross-validation.

In addition to the motivation we provided above there are many desirable properties to include the regularization penalty, many of which we will come back to in later sections. For example, it turns out that including the L2 penalty leads to the appealing max margin property in SVMs (See CS229 lecture notes for full details if you are interested).

The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. For example, suppose that we have some input vector (x = [1,1,1,1] ) and two weight vectors (w_1 = [1,0,0,0]), (w_2 = [0.25,0.25,0.25,0.25] ). Then (w_1^Tx = w_2^Tx = 1) so both weight vectors lead to the same dot product, but the L2 penalty of (w_1) is 1.0 while the L2 penalty of (w_2) is only 0.5. Therefore, according to the L2 penalty the weight vector (w_2) would be preferred since it achieves a lower regularization loss. Intuitively, this is because the weights in (w_2) are smaller and more diffuse. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly. As we will see later in the class, this effect can improve the generalization performance of the classifiers on test images and lead to less overfitting.

Note that biases do not have the same effect since, unlike the weights, they do not control the strength of influence of an input dimension. Therefore, it is common to only regularize the weights (W) but not the biases (b). However, in practice this often turns out to have a negligible effect. Lastly, note that due to the regularization penalty we can never achieve loss of exactly 0.0 on all examples, because this would only be possible in the pathological setting of (W = 0).

Code. Here is the loss function (without regularization) implemented in Python, in both unvectorized and half-vectorized form:

The takeaway from this section is that the SVM loss takes one particular approach to measuring how consistent the predictions on training data are with the ground truth labels. Additionally, making good predictions on the training set is equivalent to minimizing the loss.

All we have to do now is to come up with a way to find the weights that minimize the loss.

Practical Considerations

Setting Delta. Note that we brushed over the hyperparameter (Delta) and its setting. What value should it be set to, and do we have to cross-validate it? It turns out that this hyperparameter can safely be set to (Delta = 1.0) in all cases. The hyperparameters (Delta) and (lambda) seem like two different hyperparameters, but in fact they both control the same tradeoff: The tradeoff between the data loss and the regularization loss in the objective. The key to understanding this is that the magnitude of the weights (W) has direct effect on the scores (and hence also their differences): As we shrink all values inside (W) the score differences will become lower, and as we scale up the weights the score differences will all become higher. Therefore, the exact value of the margin between the scores (e.g. (Delta = 1), or (Delta = 100)) is in some sense meaningless because the weights can shrink or stretch the differences arbitrarily. Hence, the only real tradeoff is how large we allow the weights to grow (through the regularization strength (lambda)).

Relation to Binary Support Vector Machine. You may be coming to this class with previous experience with Binary Support Vector Machines, where the loss for the i-th example can be written as:

[L_i = C max(0, 1 - y_i w^Tx_i) + R(W)]

where (C) is a hyperparameter, and (y_i in < -1,1 >). You can convince yourself that the formulation we presented in this section contains the binary SVM as a special case when there are only two classes. That is, if we only had two classes then the loss reduces to the binary SVM shown above. Also, (C) in this formulation and (lambda) in our formulation control the same tradeoff and are related through reciprocal relation (C propto frac<1>).

Aside: Optimization in primal. If you’re coming to this class with previous knowledge of SVMs, you may have also heard of kernels, duals, the SMO algorithm, etc. In this class (as is the case with Neural Networks in general) we will always work with the optimization objectives in their unconstrained primal form. Many of these objectives are technically not differentiable (e.g. the max(x,y) function isn’t because it has a kink when x=y), but in practice this is not a problem and it is common to use a subgradient.

Aside: Other Multiclass SVM formulations. It is worth noting that the Multiclass SVM presented in this section is one of few ways of formulating the SVM over multiple classes. Another commonly used form is the One-Vs-All (OVA) SVM which trains an independent binary SVM for each class vs. all other classes. Related, but less common to see in practice is also the All-vs-All (AVA) strategy. Our formulation follows the Weston and Watkins 1999 (pdf) version, which is a more powerful version than OVA (in the sense that you can construct multiclass datasets where this version can achieve zero data loss, but OVA cannot. See details in the paper if interested). The last formulation you may see is a Structured SVM, which maximizes the margin between the score of the correct class and the score of the highest-scoring incorrect runner-up class. Understanding the differences between these formulations is outside of the scope of the class. The version presented in these notes is a safe bet to use in practice, but the arguably simplest OVA strategy is likely to work just as well (as also argued by Rikin et al. 2004 in In Defense of One-Vs-All Classification (pdf)).

Softmax classifier

It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the Softmax classifier, which has a different loss function. If you’ve heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs (f(x_i,W)) as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping (f(x_i W) = W x_i) stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

where we are using the notation (f_j) to mean the j-th element of the vector of class scores (f). As before, the full loss for the dataset is the mean of (L_i) over all training examples together with a regularization term (R(W)). The function (f_j(z) = frac<>>> ) is called the softmax function: It takes a vector of arbitrary real-valued scores (in (z)) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you’re seeing it for the first time but it is relatively easy to motivate.

Information theory view. The cross-entropy between a “true” distribution (p) and an estimated distribution (q) is defined as:

The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( (q = e^<>> / sum_j e^ ) as seen above) and the “true” distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. (p = [0, ldots 1, ldots, 0]) contains a single 1 at the (y_i) -th position.). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as (H(p,q) = H(p) + D_(p||q)), and the entropy of the delta function (p) is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.

Probabilistic interpretation. Looking at the expression, we see that

can be interpreted as the (normalized) probability assigned to the correct label (y_i) given the image (x_i) and parameterized by (W). To see this, remember that the Softmax classifier interprets the scores inside the output vector (f) as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term (R(W)) in the full loss function as coming from a Gaussian prior over the weight matrix (W), where instead of MLE we are performing the Maximum a posteriori (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class.

Practical issues: Numeric stability. When you’re writing code for computing the Softmax function in practice, the intermediate terms (e^<>>) and (sum_j e^) may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant (C) and push it into the sum, we get the following (mathematically equivalent) expression:

We are free to choose the value of (C). This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for (C) is to set (log C = -max_j f_j ). This simply states that we should shift the values inside the vector (f) so that the highest value is zero. In code:

Possibly confusing naming conventions. To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn’t make sense to talk about the “softmax loss”, since softmax is just the squashing function, but it is a relatively commonly used shorthand.

SVM vs. Softmax

A picture might help clarify the distinction between the Softmax and SVM classifiers:

Softmax classifier provides “probabilities” for each class. Unlike the SVM which computes uncalibrated and not easy to interpret scores for all classes, the Softmax classifier allows us to compute “probabilities” for all labels. For example, given an image the SVM classifier might give you scores [12.5, 0.6, -23.0] for the classes “cat”, “dog” and “ship”. The softmax classifier can instead compute the probabilities of the three labels as [0.9, 0.09, 0.01], which allows you to interpret its confidence in each class. The reason we put the word “probabilities” in quotes, however, is that how peaky or diffuse these probabilities are depends directly on the regularization strength (lambda) - which you are in charge of as input to the system. For example, suppose that the unnormalized log-probabilities for some three classes come out to be [1, -2, 0]. The softmax function would then compute:

[[1, -2, 0] ightarrow [e^1, e^<-2>, e^0] = [2.71, 0.14, 1] ightarrow [0.7, 0.04, 0.26]]

Where the steps taken are to exponentiate and normalize to sum to one. Now, if the regularization strength (lambda) was higher, the weights (W) would be penalized more and this would lead to smaller weights. For example, suppose that the weights became one half smaller ([0.5, -1, 0]). The softmax would now compute:

[[0.5, -1, 0] ightarrow [e^<0.5>, e^<-1>, e^0] = [1.65, 0.37, 1] ightarrow [0.55, 0.12, 0.33]]

where the probabilites are now more diffuse. Moreover, in the limit where the weights go towards tiny numbers due to very strong regularization strength (lambda), the output probabilities would be near uniform. Hence, the probabilities computed by the Softmax classifier are better thought of as confidences where, similar to the SVM, the ordering of the scores is interpretable, but the absolute numbers (or their differences) technically are not.

In practice, SVM and Softmax are usually comparable. The performance difference between the SVM and Softmax are usually very small, and different people will have different opinions on which classifier works better. Compared to the Softmax classifier, the SVM is a more local objective, which could be thought of either as a bug or a feature. Consider an example that achieves the scores [10, -2, 3] and where the first class is correct. An SVM (e.g. with desired margin of (Delta = 1)) will see that the correct class already has a score higher than the margin compared to the other classes and it will compute loss of zero. The SVM does not care about the details of the individual scores: if they were instead [10, -100, -100] or [10, 9, 9] the SVM would be indifferent since the margin of 1 is satisfied and hence the loss is zero. However, these scenarios are not equivalent to a Softmax classifier, which would accumulate a much higher loss for the scores [10, 9, 9] than for [10, -100, -100]. In other words, the Softmax classifier is never fully happy with the scores it produces: the correct class could always have a higher probability and the incorrect classes always a lower probability and the loss would always get better. However, the SVM is happy once the margins are satisfied and it does not micromanage the exact scores beyond this constraint. This can intuitively be thought of as a feature: For example, a car classifier which is likely spending most of its “effort” on the difficult problem of separating cars from trucks should not be influenced by the frog examples, which it already assigns very low scores to, and which likely cluster around a completely different side of the data cloud.

Interactive web demo


  • We defined a score function from image pixels to class scores (in this section, a linear function that depends on weights W and biases b).
  • Unlike kNN classifier, the advantage of this parametric approach is that once we learn the parameters we can discard the training data. Additionally, the prediction for a new test image is fast since it requires a single matrix multiplication with W, not an exhaustive comparison to every single training example.
  • We introduced the bias trick, which allows us to fold the bias vector into the weight matrix for convenience of only having to keep track of one parameter matrix.
  • We defined a loss function (we introduced two commonly used losses for linear classifiers: the SVM and the Softmax) that measures how compatible a given set of parameters is with respect to the ground truth labels in the training dataset. We also saw that the loss function was defined in such way that making good predictions on the training data is equivalent to having a small loss.

We now saw one way to take a dataset of images and map each one to class scores based on a set of parameters, and we saw two examples of loss functions that we can use to measure the quality of the predictions. But how do we efficiently determine the parameters that give the best (lowest) loss? This process is optimization, and it is the topic of the next section.

6.2.2. Feature hashing¶

The class FeatureHasher is a high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick”. Instead of building a hash table of the features encountered in training, as the vectorizers do, instances of FeatureHasher apply a hash function to the features to determine their column index in sample matrices directly. The result is increased speed and reduced memory usage, at the expense of inspectability the hasher does not remember what the input features looked like and has no inverse_transform method.

Since the hash function might cause collisions between (unrelated) features, a signed hash function is used and the sign of the hash value determines the sign of the value stored in the output matrix for a feature. This way, collisions are likely to cancel out rather than accumulate error, and the expected mean of any output feature’s value is zero. This mechanism is enabled by default with alternate_sign=True and is particularly useful for small hash table sizes ( n_features < 10000 ). For large hash table sizes, it can be disabled, to allow the output to be passed to estimators like MultinomialNB or chi2 feature selectors that expect non-negative inputs.

FeatureHasher accepts either mappings (like Python’s dict and its variants in the collections module), (feature, value) pairs, or strings, depending on the constructor parameter input_type . Mapping are treated as lists of (feature, value) pairs, while single strings have an implicit value of 1, so ['feat1', 'feat2', 'feat3'] is interpreted as [('feat1', 1), ('feat2', 1), ('feat3', 1)] . If a single feature occurs multiple times in a sample, the associated values will be summed (so ('feat', 2) and ('feat', 3.5) become ('feat', 5.5) ). The output from FeatureHasher is always a scipy.sparse matrix in the CSR format.

Feature hashing can be employed in document classification, but unlike CountVectorizer , FeatureHasher does not do word splitting or any other preprocessing except Unicode-to-UTF-8 encoding see Vectorizing a large text corpus with the hashing trick , below, for a combined tokenizer/hasher.

As an example, consider a word-level natural language processing task that needs features extracted from (token, part_of_speech) pairs. One could use a Python generator function to extract features:

How to find a basis of an image of a linear transformation?

I apologize for asking a question though there are pretty much questions on math.stackexchange with the same title, but the answers on them are still not clear for me.

I have this linear operator:

$ Ax = (2x_1-x_2-x_3, x_1-2x_2+x_3, x_1+x_2-2x_3) $

And I need to find the basis of the kernel and the basis of the image of this transformation.

First, I wrote the matrix of this transformation, which is:

I found the basis of the kernel by solving a system of 3 linear equations:

But how can I find the basis of the image? What I have found so far is that I need to complement a basis of a kernel up to a basis of an original space. But I do not have an idea of how to do this correctly. I thought that I can use any two linear independent vectors for this purpose, like

because the image here is $mathbb^2$

But the correct answer from my textbook is:

And by the way I cannot be sure that there is no error in the textbook's answer.

So could anyone help me with this. I will be very grateful, thank you in advance.

The functools module¶

The functools module in Python 2.5 contains some higher-order functions. A higher-order function takes one or more functions as input and returns a new function. The most useful tool in this module is the functools.partial() function.

For programs written in a functional style, you’ll sometimes want to construct variants of existing functions that have some of the parameters filled in. Consider a Python function f(a, b, c) you may wish to create a new function g(b, c) that’s equivalent to f(1, b, c) you’re filling in a value for one of f() ’s parameters. This is called “partial function application”.

The constructor for partial() takes the arguments (function, arg1, arg2, . kwarg1=value1, kwarg2=value2) . The resulting object is callable, so you can just call it to invoke function with the filled-in arguments.

Here’s a small but realistic example:

functools.reduce(func, iter, [initial_value]) cumulatively performs an operation on all the iterable’s elements and, therefore, can’t be applied to infinite iterables. func must be a function that takes two elements and returns a single value. functools.reduce() takes the first two elements A and B returned by the iterator and calculates func(A, B) . It then requests the third element, C, calculates func(func(A, B), C) , combines this result with the fourth element returned, and continues until the iterable is exhausted. If the iterable returns no values at all, a TypeError exception is raised. If the initial value is supplied, it’s used as a starting point and func(initial_value, A) is the first calculation.

If you use operator.add() with functools.reduce() , you’ll add up all the elements of the iterable. This case is so common that there’s a special built-in called sum() to compute it:

For many uses of functools.reduce() , though, it can be clearer to just write the obvious for loop:

A related function is itertools.accumulate(iterable, func=operator.add) . It performs the same calculation, but instead of returning only the final result, accumulate() returns an iterator that also yields each partial result:

The operator module¶

The operator module was mentioned earlier. It contains a set of functions corresponding to Python’s operators. These functions are often useful in functional-style code because they save you from writing trivial functions that perform a single operation.

Some of the functions in this module are:

Math operations: add() , sub() , mul() , floordiv() , abs() , …

Logical operations: not_() , truth() .

Bitwise operations: and_() , or_() , invert() .

Comparisons: eq() , ne() , lt() , le() , gt() , and ge() .

Object identity: is_() , is_not() .

Consult the operator module’s documentation for a complete list.


This is the documentation of the ComplexHeatmap package. Examples in the book are generated under version 2.7.7.

You can get a stable Bioconductor version from, but the most up-to-date version is always on Github and you can install it by:

The development branch on Bioconductor is basically synchronized to Github repository.

The ComplexHeatmap package is inspired from the pheatmap package. You can find many arguments in ComplexHeatmap have the same names as in pheatmap. Also you can find this old package that I tried to develop by modifying pheatmap.

Please note, this documentation is not completely compatible with older versions (< 1.99.0, before Oct, 2018), but the major functionality keeps the same.

If you use ComplexHeatmap in your publications, I am appreciated if you can cite:

Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. DOI: 10.1093/bioinformatics/btw313

Graph Representation – Adjacency Matrix and Adjacency List

Graph is a collection of nodes or vertices (V) and edges(E) between them. We can traverse these nodes using the edges. These edges might be weighted or non-weighted.

There can be two kinds of Graphs

  • Un-directed Graph – when you can traverse either direction between two nodes.
  • Directed Graph – when you can traverse only in the specified direction between two nodes.

Now how do we represent a Graph, There are two common ways to represent it:

Adjacency Matrix:

Adjacency Matrix is 2-Dimensional Array which has the size VxV, where V are the number of vertices in the graph. See the example below, the Adjacency matrix for the graph shown above.

adjMaxtrix[i][j] = 1 when there is edge between Vertex i and Vertex j, else 0.

It’s easy to implement because removing and adding an edge takes only O(1) time.

But the drawback is that it takes O(V 2 ) space even though there are very less edges in the graph.

Adjacency List:

Adjacency List is the Array[] of Linked List, where array size is same as number of Vertices in the graph. Every Vertex has a Linked List. Each Node in this Linked list represents the reference to the other vertices which share an edge with the current vertex. The weights can also be stored in the Linked List Node.

The code below might look complex since we are implementing everything from scratch like linked list, for better understanding. Read the articles below for easier implementations (Adjacency Matrix and Adjacency List)