Getting started with Deep learning





New posts on machine learning with special focus on deep learning

Libraries for Boolean operations on polygons, mainly in C++

I needed a good C++ library for Boolean operations on polygons so I investigated several libraries in search for a library that fit my needs: Computing Boolean intersection and difference for large sets of polygons. I also needed the ability to efficiently compute the union of large set of polygons at once, a property called unary cascade  union.  In this post I present the libraries which I found useful for my needs. The interested reader is encouraged to also consider this benchmark from 2011 and this newer benchmark from 2013. The later benchmark, however, seems to be biased towards VLSI applications operating on large collections of rectangular shapes.

Libraries with functions to compute Boolean operations on polygons:

  • Boost.Polygon is a library developed in Intel in 1997 by the name of GTL and was accepted to Boost in 2008. The core philosophy of Boost.Polygon is to give prime consideration to the correctness and geometrical robustness of their algorithms. A byproduct of this focus was the decision to allow only polygons with vertices at integral coordinates. In our tests Boost.Polygon always produced correct results. This library can compute unary/cascade union. Cascade union efficiently generate the union of a set of, possibly overlapping, polygons by one command. This is far more efficient than accumulating the union by adding polygons in a sequential manner (see here for a more complete discussion). Note  that Unary union can also be performed by other libraries such as GEOS or clipper. However GEOS was considerably slower than Boost.Polygon and clipper produced incorrect results for some tests. I am quite happy working with Boost.Polygon but it is a little bit worrying that the library does not seem to be actively maintained . Perhaps it is because it just works. The algorithm used for Boolean operations is based on Vatti .
  • Boost.Geometry is a library developed from 2006 by the name GGL and accepted to Boost in 2010. The library tries to closely follow the Simple Feature Access (SFA) data model that is widely used by the Geographic Information System community for describing two dimensional geometries. The library allows coordinates to be defined both by integral or by floating types. Thus it is more flexible than Boost.Polygon but also seems to be less robust (see, for example this discussion). The library is actively developed and have an helpful mailing list. By the information in this post, it seems that Boost.Geometry implements an adapted of Weiler-Atherton algorithm for Boolean operations.
  • Another library that closely follows the SFA model is GEOS which is heavily used for GIS applications. GEOS has a C interface and was considerably slower than Boost.Geometry in my experiments with Boolean operations. On the other hand, GEOS does have a unary-unify procedure.
  • Clipper is library developed in 2010 that uses the  Vatti Clipping Algorithms to perform Boolean operations. It is very fast and allows defining the rule for polygon filling . In the same spirit as Boost.Polygons, the Clipper library supports only points with integral coordinates for reasons of numerical robustness. Clipper can properly handle unary union and is very straightforward to use (no fancy template meta-programming tricks) but I found several cases where it did not perform correctly  some of my tests and thus I did not continue exploring the library any further. My tests were made at the end of 2013 so it might be that the above conclusions are no longer valid because of fixes in the algorithm.
  • Other libraries that do Boolean operations in C/C++ which did not have the time to test: CGALGPCPolyBooleanQISLib,GEOS.

In addition to C++, I also worked quite extensively with the  shapely library which is a Python package based on GEOS. This library is a joy to use for small experimental projects but its performance does not seem to scale well for large data sets.

Statistics in Python

In this post I refer to sites that discuss statistics with hands-on examples in Python. I like learning theoretical concepts by coding and Python is a great language for experimenting – it is  is an easy to learn, free and has great support for statistical computation.

The online book An introduction to statistics in Python, written by Thomas Haslwanter,  aim at introducing basic statistical procedures to researchers that are not proficient in statistics. The author approaches this task by providing lots of plain Python code as well as  IPython notebooks that the reader can apply to his/her own data. The author stresses that for more serious statistical work,  readers will need to dig into the serious statistics books and literature.

The book assume that readers that have reasonably knowledge of Python but are not statistic experts. If you are relatively new to scientific python computations, I recommend downloading the Anaconda distribution that has all the packages  needed for trying out the code in this book as well as and many more useful packages, a convenient package manager and good support for virtual environments.

The  site “Computational Statistics in python”  focuses on presenting python tools for computational statistics. Beginning with a crash course on Python and proceeding to present many related tools, the site adopts an hand-on approach  by providing lots of relevant python code. The site promotes using the  IPython notebook and discuss integrating the python code with code in other languages such as R, Matla,  Ocatve, Julia and C. The site demonstrate using libraries such as numpy, scipyNLTK, Pandas, h5py (working with  HDF5 format), pystan and pymc.  A great part of the  site is devoted to  different approached for  increasing computational efficiency including interfacing to C/C++ code, converting python to C with Cython and numba, writing parallel code and utilizing the  GPU, distributed processing using Hadoop and Spark and more. Specific computational areas discussed include linear algebra, linear systems, matrix decomposition, PCA, optimization techniques, Expectation Minimization (EM),  Monte-Carlo methods, Re-sampling methods and Markov-Chains-Monte-carlo. I must stress that this site is not meant to replace text books but rather to provide examples in Python for doing  computations, assuming the reader knows the basic theory.

The book Think Stats: Probability and Statistics for Programmers by Allen B. Downey  is an introductory book in probability and statistics accompanied by data files and source code with   examples and solutions to problems. The author provides Python code implementing a wide range of concepts from  the most basic procedures like mean or variance to more complicated procedures such as representing the CDF and using it to generate random numbers with a given distribution, hypothesis testing, etc.

Bayesian methods for hackers

The online book entitled Probabilistic Programming & Bayesian methods for hackers, written by Cam Davidson-Pilon and many others,  describes itself as “An intro into Bayesian methods and probabilistic programming from a computation/understanding-first, mathematics-second point of view”. It is written in a friendly, easy-to-follow manner as a collection of IPython notebooks, one for each chapter. The interested reader is free to download the notebooks and play with the provided methods.

Hyper-parameter optimization

Many learning procedures need the specification of some hyperparameters. The practice is to use a validation set to measure the performance of learning method and then invoke an optimization procedure to select a good value for the hyper-parameters.  A traditional approach is to cover the range of the hyper-parameters by a grid, to fit a learning model to each point of the grid and to select the value for which the error on the validation set is minimal. Here are some recent interesting posts on Hyper-parametr optimization:

  •  James Bergstra and Yoshua Bengio suggested to use random search for tuning the parameters.  This procedure is capable of finding models that are as good as modeled computed by a grid search at a fraction of the computational cost. A simple explanation of the good performance is given in this nice post by Alice Zheng that also describes other tuning strategies.
  • Daniel Saltiel argues that random search is no better than grid search and propose using Bayesian optimization. The author present the method at a conceptual level and points to two open source libraries with an IPython notebook example.