DinS Written on 2017/12/14

This article covers the basic concepts of probability.

1. Sample space

Sample space Ω = all possible outcomes.

Ω can either be finite or infinite.

Each outcome i ∈ Ω has a probability p(i) >= 0, and .

2. Event

An event is a subset . The probability of an event s is .

3. Random variable

A random variable X is a real-valued function X: Ω → R.

In other words, we assign certain outcomes to a real number. This makes writing convenient.

For example, we toss a coin three times. Depending on head or tail, we have:

This number is actually the number of head appeared.

Now that we have this number, we can say that the subset that makes X=2 are A={HHT, HTH, THH}.

Furthermore, we can say that .

This expression is elegant and decent.

4. Expectation

E[X] is the expectation of X. E[X] = average value of X, which is defined as .

For example, what is the expectation of the sum if we roll two dices?

Let’s take the pain to write all random variables down.

Clearly, random variable is just the sum of number on the dice. This is straightforward.

Now we’ll just plug in the definition of expectation to calculate it.

Notice that there’re 36 events all together, and each event has the probability of 1/36. In other words, p(i) = 1/36 regardless of i. This makes things easy. We can just write the following formula.

As a result, the expectation of the sum is 7.

5. Linearity of expectation

Let X₁, X₂, … X_n be random variables defined on Ω.

Then

Let’s take the last example for demonstration.

The expectation for a single dice is . This is as clear as day.

By linearity of expectation, the expectation for two dices will be E[X₁+X₂] = E[X₁] + E[X₂] = 2*3.5, which is 7, the same as last example. However, this solution is much easier.

Let’s look at a more complicated example to demonstrate the magic of linearity of expectation.

Problem: need to assign n processes to n servers.

Proposed solution: assign each process to a random server.

Question: what is the expected number of processes assigned to a server?

At first glance, this seems more complicated than rolling dice. We need to find a breakpoint.

Let Y = total number of processes assigned to the first server. Since all servers are equally likely to be chosen for a process, we just need to compute E[Y], and that’ll be the answer.

Now we want to use a trick. We define our random variable and use it as an indicator.

What’s the benefit? Using X_j we can write . This is just the definition of Y.

When you see the sum in expectation, you should automatically think about linearity of expectation.

As a result .

Now we have simplified the problem from n processes to one process.

This is the definition of expectation. The first term times 0 so we won’t bother. As for the second term, since the assignment is totally random, each server is equally likely. In other words, the probability for the first server to get this process is 1/n.

From this example you can see that the trick of linearity of expectation is to construct a random variable that only has value 0 and 1. When they are summed up the 0 terms are automatically canceled. As a result we can make our calculation much easier.

6. Markov’s Inequality

Theorem: suppose f is a non-negative random variable. Then for any number a>0 we have

Markov’s Inequality allows us to use expectation to bound probability of certain events.

For example, Bob is holding a party. He invited 30 people and the probability for each to come is 0.4, independent. Question: what’s the upper bound of probability that at least 18 comes?

To answer this question, we need to work out the expectation first. By linearity of expectation we have 30 x 0.4 = 12. The expectation is 12 people. Then we can use Markov’s Inequality to get the answer.

7. Conditional probability

The “X|Y”means “ X given Y”.

This needs some explanation. For X,Y∈Ω, they can either happen or not happen. They can either ∩ or not ∩.

We want to know that given Y has happened, what will happen to X that’s inside Y.

For example, let’s roll two dices, what’s the probability that at least one dice is 1, given the sum of two dices is 7?

We can write the outcomes for X and Y. let’s start with Y since it’s simple.

Y = {(1,6), (2,5), (3,4), (4,3), (5,2), (6,1)}

Now we don’t have to write X. we only care about X∩Y= {(1,6), (6,1)}.

Thus, .

8. Bayes’ Theorem

One of the most famous theorem in probability.

H stands for hypo, E for evidence.

Nothing special, just applying the definition of conditional probability twice. The usage, however, is vast. Bayes’ Theorem estimates that given E, the reliability of H can be improved by what degree.

Here is a pitfall that is used by many commercials and reports. Notice that if Pr[H] is small, no matter how E is effective the final outcome will be small none the less.

9. Independence of events

X,Y∈Ω are independent if and only if Pr[X|Y] = Pr[X]*Pr[Y].

This can be understood as Pr[X|Y] =Pr[X], i.e. whether Y happens or not has nothing to do with X.

Same goes for Pr[Y|X] = Pr[Y].

Warning: independence can be a very subtle concept. Intuition if often incorrect!

Rule of thumb: you can at first treat X and Y as independent, but tag until further notice.

10. Independence of random variables

A,B are independent ⇔ the events Pr[A=a] and Pr[B=b] are independent for all a, b.

Claim: if A, B are independent, then E[A*B] = E[A] * E[B].

Notice: although this looks like linearity of expectation, it has condition that A, B are independent, while linearity of expectation don’t.