MATH4SL Data Mining, Inference and Prediction
Recent years have witnessed in explosion in the quantity and variety of different data available, creating computational, mathematical and statistical challenges. The tools being developed to understand these data blur the boundaries between different disciplines. The goal of this paper is to provide an introduction to these tools, or more correctly, to the concepts and mathematics underlying them.
The paper will be built on a selection of topics from chapters 3-7 & 11 of
Hastie, T., Tibshirani, R. and Friedman, J. (2001) The Elements of Statistical Learning, Data mining, Inference and Prediction. Springer.
which can be downloaded for free from here.
We will also discuss neural networks, PAC learning and dimension reduction (material taken from various sources). A provisional list of topics
- Introduction and background
- Linear models
- Linear classifiers
- Basis expansions and regularisation
- Kernel Methods
- Neural networks
- Model assessment and selection (statistical learning)
- PAC learning
- Dimension reduction and concentration inequalities
This paper is designed primarily for mathematicians who may, or may not, have much formal training in statistics or probability, beyond what you might find in a 1st year paper. Some experience with programming, and particularly R, will help you make use of the techniques introduced in the paper, but it is not required.
Prof David Bryant, Mathematics and Statistics, Room 514.
Two written assignments (which I plan on making available as soon as possible), each worth 20% of the grade. One 3 hour exam which will be worth 60% of the grade.
There will be (a maximum of) 20 one hour lectures, starting in week two of the semester, and finishing in week 39. There will be no lectures in week 34 (week before break).
Some of these lectures will be held in a computer lab, and may need to be rescheduled because of that (we’ll let you know).
- The text by Hastie et al, and a companion (more statistical, less mathematical) book Introduction to Statistical Learning by James, Tibshirani and Hastie. Google can help you find full solutions for the exercises.
- There are slides and videos on this material available from the book website.
- There are buckets of online texts and books available.
THE BIG PROBLEM
You have an unknown function f that takes inputs X and returns outputs Y. (e.g. X = email, Y = spam or not spam)
You are given a training sample (X1,Y1), …, (Xn,Yn), where each Yi = f(Xi) with possible noise/error.
Can you predict f(X) for future values of X?