Machine Learning My Way (MLMW) and Models, Data, Tools and Productisation (MDTP) notes

MLMW and MDTP are the two collection of notes that describe the world of machine learning from the following angles:

modelling concepts and techniques (the “formulas”),
data analysis and control systems workflows (the “pipelines”)
data acquisition, storage and operations (the “data”),
computer science and software tools (the “code”),
model and data productisation (the “money”),
investor sentiment, society impacts and regulation (the “markets”)

Both collections are still incomplete, yet equipped with links to textbooks, reports, summary articles, industry cases and excercises. Only a limited number of passive learning resouces like videos or tutorials made it to the lists.

Get access and download

The collections are availale as a topic list, a longer guide and the website.

Artifact	Intent	Link
MDTP topic list	A slim list of topics I wish I knew well (models, tools and productisation)	Access granted upon request.
The MLMW guide	A collection of topics with links and quotes.	Ealier public PDF or view online upon request.
Website	Best of MLMW	Browse at https://trics.me.

Changelog

0.8.0

Subscribe for MLMW guide updates:

Beginner track for machine learning

Two books and a video course

MML for math (2020)	ISLP for classic ML (2023)	DLS for deep learning (2024)

Two free textbooks for math and classic machine learning (ML) and a video course on deep learning (DL) make a solid entry track for beginners:

Mathematics for Machine Learning (MML) by Marc Peter Deisenroth, A. Aldo Faisal and Cheng Soon Ong;
Introduction to Statistical Learning with Python (ISLP) by Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani;
Deep Learning Specialization by Andrew Ng.

Links to other textbooks and supplementary materials are provided below.

Roadmap

Math   ML       DL                  Subfields and data types
=====  =======  =================   ============================

         +------------------------> Tabular data and time series
         |
MML  -> ISLP  -> deeplearning.ai -> Text and speech (NLP)
(free)  (free)   Deep Learning      Transformers (the T in ChatGPT)
         |       Specialisation     Computer vision (CV)
         |       + any of           Reinforcement learning (RL)
         |       3 free textbooks
         |
        Practical manuals:
        - scipy lectures (free)
        - Muller (paid), Geron (paid) or Burkov (free preview)

Python packages

Math:     ML:           DL:
- numpy   scikit-learn  - torch
- scipy                 - tf
                        - keras

Prerequisites

You will need a working knowledge of Python and ability to operate with mathematical concepts and notation from linear algebra and calculus.

Core path

Check you math knowledge with Mathematics for Machine Learning (MML) Part 1.
Read chapter 8 “When Models Meet Data” in MML for introduction to machine learning.
Proceed to Introduction to Statistical Learning with Python (ISLP) textbook.
Read from scikit-learn documentation about neural network models.
Start Andrew Ng Deep Learning Specialization.

Reference texts

There are more dense textbooks than ISLP or Andrew Ng course, you can use them as references.

For classic machine learning they are Bishop (2006) and Murphy (2022).

For deep learning there are several open textbooks:

DLB (2016) is a reference text that enjoys a continious stream of citations, while d2l and UDL are newer and keep updating their code and content.

This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions. [arxiv abstract]

What else?

You can supplement the core path above with the following:

introductory courses on probability (like P4D or Seeing Theory) and statistics;
scipy lectures is an underappreciated resource by the authors on foundational scikit-learn package themselves;
Python Data Science Handbook including a chapter on machine learning and the DSML textbook (popular in Asia);
practical books that combine machine learning concepts and programming practice – either of Müller (paid edition), Géron (paid edition) or Burkov (free preview);
code collections like ML-From-Scratch or Kaggle competitions (note that Kaggle is owned by Google – hence emphasis on TensorFlow, not Pytorch);
review major textbooks by subfiled of machine learning: Jurafsky and Martin for natural language processing (NLP), several texts for computer vision (CV), and Sutton and Barto for reinforcement learning (RL);
for popular attention and transformer architectures check out talks by Andrej Karpathy and his NanoGPT sample code;
familiarize with approaches to data modelling in econometrics (eg. chapter 1, 2, 4, 5 from Econ 1630)
glance at data analysis vocabulary as endorsed by the industry leaders like Google, Mathworks, H2O and NVIDIA;
last but not least – watch StatQuest videos by Josh Starmer and 3Blue1Brown videos by Grant Sanderson.

Python packages

Tabular data:

pandas or polars.

Numeric computation:

numpy and scipy.

Visualisation:

matplotlib, but also Plotly, Bokeh and others.

Machine learning:

scikit-learn – see also Gaël Varoquaux 2023 interview.

Deep learning:

PyTorch (most advanced),
TensorFlow (most cited), and
Keras (easier interface to learn).

Does reading these materials make you a machine learning engineer?

Not until you make projects for real tasks on real data with real contraints (that would be quite different from textbook examples).

Not in scope

This page puts no recommendation for various skills that are also important for a quantitative modeller or an engineer:

Python programming, Linux and cloud computing;
data processing, pipelines and model productisation;
experiment design and iterative workflows;
advanced topics in statistics and machine learning;
modelling methods outside machine learning;
domain knowledge, business sense and outcomes of ML adoption.

Please refer to larger MLMW guide for coverage of these topics.

Interviews

`randomlyCoding` on production pipelines, engineering skills and job roles

randomlyCoding, a head of AI at a startup who has been working in the field for over a decade: “I certainly don’t know everything, but I like to get my feet wet and touch on anything I find interesting. I’ve trained ML models to do all sorts of tasks and will likely have at least heard of most things.”

MLMW: Can one summarize a production pipeline as the following: choosing a business and then a modelling hypothesis – dataset – model selection – training – validation – model rollout followed by business metrics? What are the weak links in this process and where a pipeline may break?

randomlyCoding: In general that’s about on point. I’d say there’s certainly a lot more recursion. For example you might pick a dataset, build a model and train it, only to realize you’ve massively overfitting because you don’t have enough data – thus you go looking for a bigger additional dataset. Weak links often occur at either end of the process – you pick a dataset that isn’t suited to your problem and thus end up with a solution that solves a problem you weren’t trying to solve or the model is 100% perfect but the business case requires inference to happen in real time and it takes 20 minutes based on the size of the model. I’ve also seen cases of trying to extend a model to do more than it was initially designed for. This isn’t always a bad idea but if the person leading this doesn’t understand the underlying model there can often be misalignment between their expectations and reality.

MLMW: What skills would you expect an ML engineer (MLE) to know? How can an decent econometrician upgrade to an MLE?*

randomlyCoding: I would expect any ML engineer to know one of three Python packages that are the core of most ML processes (either pytorch, tensorflow or keras), but on top of that I’d expect familiarity with some domain specific packages, that might be NLTK if you’re working on natural language processing; it might be scikit-learn if you’re looking at random forests. One thing I would say is usually a must is familiarity with Linux and a cloud provider (AWS, GCP, Azure). You don’t need to know all 3 cloud providers (pick AWS if you don’t know any yet – it has 50% market share) but if you don’t know any of them it’ll be harder to on board you and your first few weeks would be a lot more overwhelming – even knowing a different one to the one you use at a specific job will help as they all have similar functionality.

MLMW: Who puts and ML model into production? You got the weights after training, validation stage passed ok, then it always becomes a small API? Who wraps a notebook into API, a designated engineer?

randomlyCoding: Who puts the ML model into production can vary depending on the system in use, it’s often an API but not always. I would expect any ML engineer to at least be able to put together a notebook (or similar) that can be used to run inference on the model; in some cases if the organisation is small enough it will be someone who has directly worked on the model; in other cases they may be using a specific orchestration packages that abstracts away this process; in yet other cases it could be hidden behind a message broker. Obviously not all ML models need to be hosted all the time, some are run periodically and they might not require anything more than ingesting a CSV file into a single python script.

MLMW: Does a full-stack data scientist role still exist?

randomlyCoding: I think the full-stack data scientist role does still exist, it will always exists as long as there are start-ups that have limited budgets and big ideas. If you’re in a larger team your remit will often be constrained to a specific task, but depending on the organisation your within that task could change regularly (eg. today you’re handling data ingestion because the model we’re working on is a transformer and you don’t have much experience with transformers, but tomorrow we’re building a reinforcement learning system and you’re the team’s expert in RL). In most teams I’d expect the architect of the model to also do a fair amount of the modelling itself; anyone doing modelling will have to work closing with the data engineers, etc. I think this mean the roles aren’t as well defined as in SWE and I think this is because there’s a lot of trial and error in ML so it’s not as simple as for example ingest the data and pass the process on.

MLMW: What is the most unexpected case of a model you thought would not work but it did?

randomlyCoding: Diffusion feels like it shouldn’t work. In general it’s a multi-step process of removing noise from an image until you end up with the image without any noise; but to do that you start with a completely noisy image and then predict a small percentage of the noise that was added (the previous step of noise added) and then subtract that noise from the image. The maths behind it is reasonably simple, but it just feels like it shouldn’t work!

Video series

MLMW is inherently a text-format guide (books, articles, code) with an exception for these quality videos and podcasts.

StatQuest with Josh Starmer – this man is a genius, look for ‘BAM’ moments.
3Blue1Brown by Grant Sanderson – very high quality and pedagogically sound content.
Machine Learning Street Talk – as suggested by an MLMW reader: “Sometimes a bit too dense for absolute beginners but really good. They list resources, papers and books.”

Probability and statistics

Beginner

Seeing Theory visual textbook.
Probability distributions.

Courses

Statistics 110 (Probability) by Joe Blitzstein.

Reference

Chapter 6 “Probability and Distributions” in the MML book.
P4D.

Advanced

Aubrey Clayton reading of The Logic of Science by E.T. Jaynes.