Introduction to Central Limit Theorem

This is a notebook to learn about the central limit theorem. The basic idea is to draw N random numbers $\{x_i\}$ (for $i=1\ldots N$) from some probability distribution $p(x)$ and calculate the sum $y=\sum_{i=1}^N x_i$.Note that in general that $y$ is a random variable. These means that if I draw a different set of $M$ numbers, I will get a slightly different value for $y$.

In statstical physics, we are often interested in the behavior of such extensive variables (variables that scale with $N$). We would like to understand its average value, its fluctuations , and how these scale with $N$.

In this notebook, we will try to get an intuition for this by repeatedly calculating $y$ for different draws of $N$ random. Let $y_\alpha$ (with $\alpha=1\ldots M$) be the sum on $\alpha$'th time I draw $N$ numbers. Then, we can make a histogram of these $y_\alpha$. This historgram tells us about the probability of observing a $y_\alpha$.

Binary Variables

We now perform this when the $x_i$ are binary variables with $x_i=\pm 1$ with $$ p(x_i=1)=q\\ p(x_i=0)=1-q $$

  • Please play around with the code below. How does $N$, $M$, and $q$ effect the mean and the the fluctuations of the distribution?
  • How would I identify the center of this distribution and the "width" of the distribution from theory? Derive expressions for these.
  • Can you relate these theoretically-dervied expressions to the empirical mean and standard deviation? For what $M$ do I get $10\%$ error, how about $0.01\%$ error?
  • Is there something special about binary variables?
  • Make plots of the the empirically observed mean and "width" as a function of $N$.
In [8]:
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline  
#Draw M sets of N random numbers

N=100
M=100

q=0.9

Data=np.random.binomial(1,q,(M,N))

#Draw from random distribution
#mean=10;
#sigma=5;
#Data=np.random.normal(mean,sigma,(M,N))


#Draw from Gamma distribution
#shape=2
#scale=2
#Data=np.random.gamma(shape,scale,(M,N))


y_vector=np.sum(Data, axis=1)

plt.clf()
sns.distplot(y_vector, kde='False');
plt.show()

#Calculate mean value

mean_y=np.mean(y_vector)
print("The empirical mean is", mean_y)
std_y=np.std(y_vector)
print("The empirical std is", std_y)
#Print Theoretical std: print("The theoretical std for bernoulli is:", np.sqrt(N*q*(1-q)))
('The empirical mean is', 90.090000000000003)
('The empirical std is', 3.1814933600433615)

Continuous Variables

We now perform a similar simulation when the $x_i$ are continuous variables drawn from some other distributions: Normal Distribution or even Gamma Distribution (look up on Wikipedia). Here fix $M=5000$.

  • Please play around with the code below. How does $N$ effect the mean and the the fluctuations of the distribution? Make a plot of the width and mean as a function of $N$.
  • Is there something special about binary variables or the probability distribution we draw from (as far as scaling with $N$)?
In [ ]: