Statistics in Python

In this post I refer to sites that discuss statistics with hands-on examples in Python. I like learning theoretical concepts by coding and Python is a great language for experimenting – it is  is an easy to learn, free and has great support for statistical computation.

The online book An introduction to statistics in Python, written by Thomas Haslwanter,  aim at introducing basic statistical procedures to researchers that are not proficient in statistics. The author approaches this task by providing lots of plain Python code as well as  IPython notebooks that the reader can apply to his/her own data. The author stresses that for more serious statistical work,  readers will need to dig into the serious statistics books and literature.

The book assume that readers that have reasonably knowledge of Python but are not statistic experts. If you are relatively new to scientific python computations, I recommend downloading the Anaconda distribution that has all the packages  needed for trying out the code in this book as well as and many more useful packages, a convenient package manager and good support for virtual environments.

The  site “Computational Statistics in python”  focuses on presenting python tools for computational statistics. Beginning with a crash course on Python and proceeding to present many related tools, the site adopts an hand-on approach  by providing lots of relevant python code. The site promotes using the  IPython notebook and discuss integrating the python code with code in other languages such as R, Matla,  Ocatve, Julia and C. The site demonstrate using libraries such as numpy, scipyNLTK, Pandas, h5py (working with  HDF5 format), pystan and pymc.  A great part of the  site is devoted to  different approached for  increasing computational efficiency including interfacing to C/C++ code, converting python to C with Cython and numba, writing parallel code and utilizing the  GPU, distributed processing using Hadoop and Spark and more. Specific computational areas discussed include linear algebra, linear systems, matrix decomposition, PCA, optimization techniques, Expectation Minimization (EM),  Monte-Carlo methods, Re-sampling methods and Markov-Chains-Monte-carlo. I must stress that this site is not meant to replace text books but rather to provide examples in Python for doing  computations, assuming the reader knows the basic theory.

The book Think Stats: Probability and Statistics for Programmers by Allen B. Downey  is an introductory book in probability and statistics accompanied by data files and source code with   examples and solutions to problems. The author provides Python code implementing a wide range of concepts from  the most basic procedures like mean or variance to more complicated procedures such as representing the CDF and using it to generate random numbers with a given distribution, hypothesis testing, etc.