January, 2022: The beginning

2022-01-28

Goals for the rest of the 3 days in the month:

compile list of papers to read in resource list & order chronologically
get google colab set up with github
finish 4.1 and 4.2 practicum on RNNs and CNNs from the NYU course

2022-01-29

watched NYU 4.1. Good intro to thinking about images, audio, etc. as natural signals. Experiment in neural networks at the end also reinforced importance of locality to convolutions extremely well.
First pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 1/3)
- proof seems to be based on concept of size of open ball in some space becoming bigger with every time step
- didn’t quite grok usefulness of defining “basins” and the psuedo-basins
- re-read tomorrow: focus on understanding “latching” and some of the non-gradient descent methods attempt to overcome vanishing/exploding gradient
- In one of the NYU lectures, Yann mentioned the fundamental premise of this paper was disproven. Need to figure out why this is.

2022-01-30

watched NYU 4.2. Introduced different types of RNN setups for problems, e.g. neural translation, image captioning, etc.

Second pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 2/3)

Supplementary resource: Yan talks about this paper in the middle of NYU 3L
Understood more today than I did during the first pass, but still confused about some of the arguments in the paper
Main thesis: training recurrent neural networks is hard, because gradients either explode over time in long sequences, or they vanish to 0. In the latter case, we can’t learn long-term dependencies. In the former, we can’t store stable memory.
Yann’s remark: this is wrong - we now deal with gradient explosion via gating gradient flow, e.g. GRU and LSTM. Also, there doesn’t have to be a fixed stable state for us to have “memory”.

Parts that are still unclear to me:

Most of the proofs for the theoretical parts of the paper. For example, several of the definitions reference a differentiable map $M$ (e.g. invariance, hyperbolic attractor). In the definition of hyperbolic attractor, Bengio writes ‘all eigenvalues of $M'(a)$ are less than 1’, which makes it seem like $M: \mathbb{R}^n \rightarrow \mathbb{R}^m, m > 1$. However, in theorem one, Bengio assumes $$

M’(z)

>1$. Is he assuming$M’(z)$$ produces a scalar and taking the absolute value? Or do his lines here represent a matrix determinant? Or is this a miswritten norm operation?

The paper mentions “trainable inputs” many times. I’m guessing this refers to the output of a black-boxed/oracle network that transforms the input sequence, to be then fed into our recurrent neuron, but not 100% sure on this.

2022-01-31

Took a break to watch 3blue1brown’s essence of linear algebra series on youtube. It was very eye-opening to see all the things I learned & should have learned during my linear algebra class in undergrad
Going to put off a 3rd rereading of the Bengio paper until I either have resources or support, since I couldn’t think through some of the questions I posed
TODO: replicate experiments described in Bengio ‘94 with a simple RNN in a colab notebook