## Functions of Independent Random Variables

Is the claim that functions of independent random variables are themselves independent, true?

I have seen that result often used implicitly in some proofs, for example in the proof of independence between the sample mean and the sample variance of a normal distribution, but I have not been able to find justification for it. It seems that some authors take it as given but I am not certain that this is always the case.

## Discussion

A frequent situation in machine learning is having a huge amount of data however, most of the elements in the data are zeros. For example, imagine a matrix where the columns are every movie on Netflix, the rows are every Netflix user, and the values are how many times a user has watched that particular movie. This matrix would have tens of thousands of columns and millions of rows! However, since most users do not watch most movies, the vast majority of elements would be zero.

Sparse matrices only store nonzero elements and assume all other values will be zero, leading to significant computational savings. In our solution, we created a NumPy array with two nonzero values, then converted it into a sparse matrix. If we view the sparse matrix we can see that only the nonzero values are stored:

There are a number of types of sparse matrices. However, in *compressed sparse row* (CSR) matrices, (1, 1) and (2, 0) represent the (zero-indexed) indices of the non-zero values 1 and 3 , respectively. For example, the element 1 is in the second row and second column. We can see the advantage of sparse matrices if we create a much larger matrix with many more zero elements and then compare this larger matrix with our original sparse matrix:

As we can see, despite the fact that we added many more zero elements in the larger matrix, its sparse representation is exactly the same as our original sparse matrix. That is, the addition of zero elements did not change the size of the sparse matrix.

As mentioned, there are many different types of sparse matrices, such as compressed sparse column, list of lists, and dictionary of keys. While an explanation of the different types and their implications is outside the scope of this book, it is worth noting that while there is no “best” sparse matrix type, there are meaningful differences between them and we should be conscious about why we are choosing one type over another.

## 1. Introduction

Measuring and testing dependence between |$

Testing independence has important applications. Two examples from genomics research are testing whether two groups of genes are associated and examining whether certain phenotypes are determined by particular genotypes. In social science research, scientists are interested in understanding potential associations between psychological and physiological characteristics. Wilks (1935) introduced a parametric test based on |$|

$| and |$$| . Throughout |$

The distance correlation ( Székely et al., 2007) can be used to measure and test dependence between |$**| = ( <^<
m T>>)^<1/2>$| for a vector |$$| . Székely & Rizzo (2013) observed that the distance correlation may be adversely affected by the dimensions of |$**

** **

**We propose using projection correlation to characterize dependence between |$ $| and |$**

** **

**Covariance of the sum of two random vectors**

**Covariance of the sum of two random vectors**

This is the situation. I have an estimation of the position $(x_t,y_t)$ of an object with its covariance $Sigma_p$ and an estimation of its speed $(v_x, v_y)$ with its covariance $Sigma_v$ . Actually, the two estimations should be correlated since the speed is computed from the position estimation, but for simplicity, we can assume they are independent.

Now, I want to estimate the covariance of the future position of the object $(x_

$cov(X+Y) = E((X+Y)(X+Y)^T) - E(X+Y)E(X+Y)^T$ $= E((X+Y)(X^T+Y^T)) - E(X+Y)E(X^T+Y^T)$ $= E(XX^T + YY^T + XY^T + YX^T) - (E(X)E(Y))(E(X^T)E(Y^T))$ $= E(XX^T)-E(X)E(X^T) + E(YY^T)-E(Y)E(Y^T) + E(XY^T)-E(X)E(Y^T) + E(YX^T)-E(Y)E(X^T)$ $= cov(X) + cov(Y) + E(XY^T)-E(X)E(Y^T) + E(YX^T)-E(Y)E(X^T)$

Is it correct? Now, can I rewrite the formulation in order to compute $cov(X+Y)$ only in function of $cov(X)$ and $cov(Y)$ since I do not have the PDF of $X$ and $Y$ ?

## Probability with binomial distribution and random vectors

In a city the proportion of men with blue eyes is $20$%, of green eyes is $5$%, of black eyes is $10$% and the rest $65$% of men has brown eyes. Susan decides to commute from the center of the city to a small town. In order to get to the town, she has to take two buses which only take people from the city. She decides to run a little experiment which consists of drinking a beer with every man of blue or green eyes in each of the buses. Suppose every man accepts drinking a beer with her and that in the first bus there are $10$ men and in the second, $8$ men.

1)Calculate the probability that in the first part of the trip she has drunk less than $4$ beers.

2)Calculate the probability that she has drunk more than $3$ beers in total (in the two buses).

My attempt at a solution using user137481 hint:

2) If I define $Y=< ext*^3 P(Z=i)$$=1-sum_ ^3<18 choose i>(dfrac<1><4>)^i(dfrac<3><4>)^<10-i>$*

I would appreciate if someone could take a look at my solution and correct, if necessary, any mistakes. Thanks in advance.

## Example 5-7 Section

One ball is drawn randomly from a bowl containing four balls numbered 1, 2, 3, and 4. Define the following three events:

- Let (A) be the event that a 1 or 2 is drawn. That is, (A=<1, 2>).
- Let (B) be the event that a 1 or 3 is drawn. That is, (B = <1, 3>).
- Let (C) be the event that a 1 or 4 is drawn. That is, (C = <1, 4>).

Are events (A, B, ext< and >C) pairwise independent? Are they mutually independent?

This example illustrates, as does the previous example, that pairwise independence among three events does not necessarily imply that the three events are mutually independent.

## Examples

It might seem odd to train a random forest model on a dataset and then use that random forset as a kernel, instead of using the random forest directly. However, this can be useful in a number of circumstances.

For example, consider the MNSIT digit recognition dataset. A large random forest can obtain acceptable (greater than 90% accuracy) if it is trained on all ten digit classes. In other words, if the random forest algorithm is allowed to see many examples of each digit class, it can learn to classify each of the ten digits. But what if the random forest algorithm is only allowed to see examples of 7s and 9s? Clearly, such a random forest would not be very useful on a dataset of 3s and 4s. However, using the random forest kernel, we can take advantage of a random forest trained on only 7s and 9s to help us understand the differences between 3s and 4s (this is often called transfer learning).

Consider two different PCA projections of the 3s and 4s from the MNIST dataset:

The left-hand side image shows the results of using Kernel PCA to find the two most significant (potentially *non-linear*) components of the 3s and 4s data using the random forest kernel. The right-hand side shows the two most significant (linear) components as determined by PCA.

For the most part, the components found using the random forest kernel provide a much better separation of the data. Remember, the random forest here only got to see 7s and 9s – it has never seen a 3 or a 4, yet the partitions learned by the random forest still carry semantic meaning about digits that can be *transferred* to 3s and 4s.

In fact, if one trains an SVM model to find a linear boundary on the first 5 principal components of the data, the accuracy for the random forest kernel assisted PCA + SVM is 94%, compared to the 87% of the linear PCA + SVM scheme.

The random forest kernel can be a quick and easy way to transfer knowledge from a random forest model to a related domain when techniques like deep belief nets are too expensive or simply overkill. The kernel can be evaluated quickly, and does not require special hardware or a significant amount of memory to compute. It isn’t frequently useful, but it is a nice trick to keep in your back pocket.

## Independent Events

Although typically we expect the conditional probability P ( A | B ) to be different from the probability P ( A ) of *A*, it does not have to be different from P ( A ) . When P ( A | B ) = P ( A ) , the occurrence of *B* has no effect on the likelihood of *A*. Whether or not the event *A* has occurred is *independent* of the event *B*.

Using algebra it can be shown that the equality P ( A | B ) = P ( A ) holds if and only if the equality P ( A ∩ B ) = P ( A ) · P ( B ) holds, which in turn is true if and only if P ( B | A ) = P ( B ) . This is the basis for the following definition.

### Definition

*Events* *A* *and* *B* *are* independent Events whose probability of occurring together is the product of their individual probabilities. *if*

*If* *A* *and* *B* *are not independent then they are* **dependent**.

The formula in the definition has two practical but exactly opposite uses:

In a situation in which we can compute all three probabilities P ( A ) , P ( B ) , and P ( A ∩ B ) , it is used to check whether or not the events *A* and *B* are independent:

- If P ( A ∩ B ) = P ( A ) · P ( B ) , then
*A*and*B*are independent. - If P ( A ∩ B ) ≠ P ( A ) · P ( B ) , then
*A*and*B*are not independent.

### Example 23

A single fair die is rolled. Let A = < 3 >and B = < 1,3,5 >. Are *A* and *B* independent?

In this example we can compute all three probabilities P ( A ) = 1 ∕ 6 , P ( B ) = 1 ∕ 2 , and P ( A ∩ B ) = P ( < 3 >) = 1 ∕ 6 . Since the product P ( A ) · P ( B ) = ( 1 ∕ 6 ) ( 1 ∕ 2 ) = 1 ∕ 12 is not the same number as P ( A ∩ B ) = 1 ∕ 6 , the events *A* and *B* are not independent.

### Example 24

The two-way classification of married or previously married adults under 40 according to gender and age at first marriage in Note 3.48 "Example 21" produced the table

Determine whether or not the events *F*: “female” and *E*: “was a teenager at first marriage” are independent.

The table shows that in the sample of 902 such adults, 452 were female, 125 were teenagers at their first marriage, and 82 were females who were teenagers at their first marriage, so that

P ( F ) = 452 902 P ( E ) = 125 902 P ( F ∩ E ) = 82 902

P ( F ) · P ( E ) = 452 902 · 125 902 = 0.069

we conclude that the two events are not independent.

### Example 25

Many diagnostic tests for detecting diseases do not test for the disease directly but for a chemical or biological product of the disease, hence are not perfectly reliable. The *sensitivity* of a test is the probability that the test will be positive when administered to a person who has the disease. The higher the sensitivity, the greater the detection rate and the lower the false negative rate.

Suppose the sensitivity of a diagnostic procedure to test whether a person has a particular disease is 92%. A person who actually has the disease is tested for it using this procedure by two independent laboratories.

- What is the probability that both test results will be positive?
- What is the probability that at least one of the two test results will be positive?

Let *A*_{1} denote the event “the test by the first laboratory is positive” and let *A*_{2} denote the event “the test by the second laboratory is positive.” Since *A*_{1} and *A*_{2} are independent,

Using the Additive Rule for Probability and the probability just computed,

### Example 26

The *specificity* of a diagnostic test for a disease is the probability that the test will be negative when administered to a person who does not have the disease. The higher the specificity, the lower the false positive rate.

Suppose the specificity of a diagnostic procedure to test whether a person has a particular disease is 89%.

- A person who does not have the disease is tested for it using this procedure. What is the probability that the test result will be positive?
- A person who does not have the disease is tested for it by two independent laboratories using this procedure. What is the probability that both test results will be positive?

Let *B* denote the event “the test result is positive.” The complement of *B* is that the test result is negative, and has probability the specificity of the test, 0.89. Thus

Let *B*_{1} denote the event “the test by the first laboratory is positive” and let *B*_{2} denote the event “the test by the second laboratory is positive.” Since *B*_{1} and *B*_{2} are independent, by part (a) of the example

The concept of independence applies to any number of events. For example, three events *A*, *B*, and *C* are independent if P ( A ∩ B ∩ C ) = P ( A ) · P ( B ) · P ( C ) . Note carefully that, as is the case with just two events, this is not a formula that is always valid, but holds precisely when the events in question are independent.

### Example 27

The reliability of a system can be enhanced by redundancy, which means building two or more independent devices to do the same job, such as two independent braking systems in an automobile.

Suppose a particular species of trained dogs has a 90% chance of detecting contraband in airline luggage. If the luggage is checked three times by three different dogs independently of one another, what is the probability that contraband will be detected?

Let *D*_{1} denote the event that the contraband is detected by the first dog, *D*_{2} the event that it is detected by the second dog, and *D*_{3} the event that it is detected by the third. Since each dog has a 90% of detecting the contraband, by the Probability Rule for Complements it has a 10% chance of failing. In symbols, P ( D 1 c ) = 0.10 , P ( D 2 c ) = 0.10 , and P ( D 3 c ) = 0.10 .

Let *D* denote the event that the contraband is detected. We seek P ( D ) . It is easier to find P ( D c ) , because although there are several ways for the contraband to be detected, there is only one way for it to go undetected: all three dogs must fail. Thus D c = D 1 c ∩ D 2 c ∩ D 3 c , and

P ( D ) = 1 − P ( D c ) = 1 − P ( D 1 c ∩ D 2 c ∩ D 3 c )

But the events *D*_{1}, *D*_{2}, and *D*_{3} are independent, which implies that their complements are independent, so

P ( D 1 c ∩ D 2 c ∩ D 3 c ) = P ( D 1 c ) · P ( D 2 c ) · P ( D 3 c ) = 0.10 × 0.10 × 0.10 = 0.001

Using this number in the previous display we obtain

That is, although any one dog has only a 90% chance of detecting the contraband, three dogs working independently have a 99.9% chance of detecting it.

## Subspace Spanned By Cosine and Sine Functions

Let $calF[0, 2pi]$ be the vector space of all real valued functions defined on the interval $[0, 2pi]$.

Define the map $f:R^2 o calF[0, 2pi]$ by

[left(, fleft(, egin

alpha

eta

end

[V:=im f=

**(a)** Prove that the map $f$ is a linear transformation.

**(b)** Prove that the set $

**(c)** Prove that the kernel is trivial, that is, $ker f=

(This yields an isomorphism of $R^2$ and $V$.)

**(d)** Define a map $g:V o V$ by

[g(alpha cos x + eta sin x):=frac

**(e)** Find the matrix representation of the linear transformation $g$ with respect to the basis $