January, 2022: The beginning
2022-01-28
Goals for the rest of the 3 days in the month:
- compile list of papers to read in resource list & order chronologically
- get google colab set up with github
- finish 4.1 and 4.2 practicum on RNNs and CNNs from the NYU course
2022-01-29
- watched NYU 4.1. Good intro to thinking about images, audio, etc. as natural signals. Experiment in neural networks at the end also reinforced importance of locality to convolutions extremely well.
- First pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 1/3)
- proof seems to be based on concept of size of open ball in some space becoming bigger with every time step
- didn’t quite grok usefulness of defining “basins” and the psuedo-basins
- re-read tomorrow: focus on understanding “latching” and some of the non-gradient descent methods attempt to overcome vanishing/exploding gradient
- In one of the NYU lectures, Yann mentioned the fundamental premise of this paper was disproven. Need to figure out why this is.
2022-01-30
- watched NYU 4.2. Introduced different types of RNN setups for problems, e.g. neural translation, image captioning, etc.
- Second pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 2/3)
- Supplementary resource: Yan talks about this paper in the middle of NYU 3L
- Understood more today than I did during the first pass, but still confused about some of the arguments in the paper
- Main thesis: training recurrent neural networks is hard, because gradients either explode over time in long sequences, or they vanish to 0. In the latter case, we can’t learn long-term dependencies. In the former, we can’t store stable memory.
- Yann’s remark: this is wrong - we now deal with gradient explosion via gating gradient flow, e.g. GRU and LSTM. Also, there doesn’t have to be a fixed stable state for us to have “memory”.
- Parts that are still unclear to me:
-
Most of the proofs for the theoretical parts of the paper. For example, several of the definitions reference a differentiable map \(M\) (e.g. invariance, hyperbolic attractor). In the definition of hyperbolic attractor, Bengio writes ‘all eigenvalues of \(M'(a)\) are less than 1’, which makes it seem like \(M: \mathbb{R}^n \rightarrow \mathbb{R}^m, m > 1\). However, in theorem one, Bengio assumes $$ M’(z) >1\(. Is he assuming\)M’(z)$$ produces a scalar and taking the absolute value? Or do his lines here represent a matrix determinant? Or is this a miswritten norm operation? - The paper mentions “trainable inputs” many times. I’m guessing this refers to the output of a black-boxed/oracle network that transforms the input sequence, to be then fed into our recurrent neuron, but not 100% sure on this.
-
2022-01-31
- Took a break to watch 3blue1brown’s essence of linear algebra series on youtube. It was very eye-opening to see all the things I learned & should have learned during my linear algebra class in undergrad
- Going to put off a 3rd rereading of the Bengio paper until I either have resources or support, since I couldn’t think through some of the questions I posed
- TODO: replicate experiments described in Bengio ‘94 with a simple RNN in a colab notebook