Wednesday, May 11, 2022

Deep learning ( DL )

Deep Learning requires much more of an ARCHITECT mind set than traditional Machine Learning. 


In a sense, the feature engineering work has been moved to the design of very specialized computational blocks in DL using smaller units (LSTM, Convolutional, embedding, Fully connected, …).

 I always advise to start with a simple net when architecting a model such that you can build your intuition. Jumping right away into a Transformer model may not be the best way to start.  


DL is very powerful in the case of multi-modal input data: time series, tabular data, text data, image data. 

One approach is to encode all those different data types into a simple vector and feed that into a logistic regression (LogReg) or a linear regression (LR) (or with more fully connected layers to add non-linearity) depending on if you need to perform classification or regression. 

When developing a simple model, start with a low capacity network and increase little by little the complexity to reduce the bias while adding regularization to keep the variance low. 

A conv layer is meant to learn local correlations. 

Multiple successive blocks of conv and pooling layers allows to learn the correlations at multiple scales and they can be used on image data (conv2d), text data (text is just a time series of categorical variables) or time series (conv1d). 

For example you can encode an image using a series of conv2d and pooling layers like in VGG (https://lnkd.in/g6Jp6NmDhttps://lnkd.in/gDjUGWFE). 


You can encode text data using an embedding (pretrained obviously https://lnkd.in/gt5N-i6R) followed by a couple of conv1d layers. And you can encode a time series using series of conv1d and pooling layers. 

I advise against using LSTM layers when possible. 

The iterative computation doesn’t allow for good parallelism leading to very slow training (even with the Cuda LSTM). 

For text and time series ConvNet are much faster to train as they make use the of the matrix computation parallelism and tend to perform on par with LSTM networks (https://lnkd.in/g-6Z6qCN). 

One reason transformers became the leading block unit for text learning tasks, is its superior parallelism capability compared to LSTM allowing for realistically much bigger training data sets.

In general it is not too hard to train on multi-modal data. As a simple example:

- time series vector = Pool1d(Conv1d(Pool1d(Conv1d(time series))  
- image vector vector = Pool2d(Conv2d(Pool2d(Conv2d(image data))  
- text vector = Pool1d(Conv1d(Pool1d(Conv1d(Embedding(text data))) 
- tabular data vector = FC(FC(tabular data)) 
=> X = FC(FC(time series vector, tabular data vector, text vector, image vector))
The nice thing with DL, you can train on multiple targets at once using multiple target heads: Y_1 ~ LR(X), Y_2 ~ LogReg(X)

No comments:

Post a Comment