2022-01-28

Goals for the rest of the 3 days in the month:

  • compile list of papers to read in resource list & order chronologically
  • get google colab set up with github
  • finish 4.1 and 4.2 practicum on RNNs and CNNs from the NYU course

2022-01-29

  • watched NYU 4.1. Good intro to thinking about images, audio, etc. as natural signals. Experiment in neural networks at the end also reinforced importance of locality to convolutions extremely well.
  • First pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 1/3)
    • proof seems to be based on concept of size of open ball in some space becoming bigger with every time step
    • didn’t quite grok usefulness of defining “basins” and the psuedo-basins
    • re-read tomorrow: focus on understanding “latching” and some of the non-gradient descent methods attempt to overcome vanishing/exploding gradient
    • In one of the NYU lectures, Yann mentioned the fundamental premise of this paper was disproven. Need to figure out why this is.

2022-01-30

  • watched NYU 4.2. Introduced different types of RNN setups for problems, e.g. neural translation, image captioning, etc.
  • Second pass through the [Bengio ‘94] paper on vanishing gradient problem in RNNs (day 2/3)
    • Supplementary resource: Yan talks about this paper in the middle of NYU 3L
    • Understood more today than I did during the first pass, but still confused about some of the arguments in the paper
    • Main thesis: training recurrent neural networks is hard, because gradients either explode over time in long sequences, or they vanish to 0. In the latter case, we can’t learn long-term dependencies. In the former, we can’t store stable memory.
    • Yann’s remark: this is wrong - we now deal with gradient explosion via gating gradient flow, e.g. GRU and LSTM. Also, there doesn’t have to be a fixed stable state for us to have “memory”.
    • Parts that are still unclear to me:
      • Most of the proofs for the theoretical parts of the paper. For example, several of the definitions reference a differentiable map \(M\) (e.g. invariance, hyperbolic attractor). In the definition of hyperbolic attractor, Bengio writes ‘all eigenvalues of \(M'(a)\) are less than 1’, which makes it seem like \(M: \mathbb{R}^n \rightarrow \mathbb{R}^m, m > 1\). However, in theorem one, Bengio assumes $$ M’(z) >1\(. Is he assuming\)M’(z)$$ produces a scalar and taking the absolute value? Or do his lines here represent a matrix determinant? Or is this a miswritten norm operation?
      • The paper mentions “trainable inputs” many times. I’m guessing this refers to the output of a black-boxed/oracle network that transforms the input sequence, to be then fed into our recurrent neuron, but not 100% sure on this.

2022-01-31

  • Took a break to watch 3blue1brown’s essence of linear algebra series on youtube. It was very eye-opening to see all the things I learned & should have learned during my linear algebra class in undergrad
  • Going to put off a 3rd rereading of the Bengio paper until I either have resources or support, since I couldn’t think through some of the questions I posed
  • TODO: replicate experiments described in Bengio ‘94 with a simple RNN in a colab notebook