March, 2022: Cooking the code (CTC) loss

2022-03-01

watched NYU 5L on truck-backing-up

2022-03-02

Busy day at work :(

First half of last lecture by Yann NYU 14L.
Seems like SSL tasks to train non-contrastive networks would be a good area of exploration
Also, continual learning/multi-task learning

2022-03-03

Second half of last lecture by Yann (link above), first half of last NYU practicum [Prediction and Planning under uncertainty]
Found it interesting that Yann had never met somebody that regretted getting a PhD

Start of Speech unit

Goals: have an understanding of all major DL models for speech tasks, implement major ones
Papers:
- Graves ‘13 Speech Recognition with Deep Recurrent Neural Networks
Textbooks:
- Jurafsky and Martin textbook on audio/speech
Lectures:

2022-03-04

Graves ‘13 Speech Recognition with Deep Recurrent Neural Networks
- Short paper, seems to be a follow-up to the previous RNN-T previous paper.
- Tried out CTC and RNN-T architectures with multilayer BiRNNs
- Evaluated on speech recognition on the TIMIT dataset, should be worth reproducing
- Modifications to RNNT: compared to last paper, output of Language model and CTC model now get fed into another network
- Graves: Better results if LM and CTC are both pretrained, fine-tuned
- Inference done using beam search, not based on previous time steps (?). Not sure I understand this, seems like just greedy selection of next token at every time step?
Mikolov ‘13 Efficient Estimation of Word Representations in Vector Space
- Evaluated two methods to calculate continuous word embeddings: continuous bag of words and skip-gram
- CBOW task: given context around a word, predict word
- Skipgram task: given word, predict all words in a window around the word
- pretty successful evaluation results, with the well-known king - queen = man - woman results, etc.

2022-03-05

Not much done today :(
Started speech chapter of Jurafsky and Martin textbook on NLP
- Text -> speech relies on featurization still! (how to convert signal to raw input for neural networks)
- Do I need to figure out basics of DSP and Fourier series/transforms? (what are these)

2022-03-06

Random research ideas on BT:
- differentiable image transformations + barlow twins = adversarial barlow twins
- (questions from Jure’s talk) relax cross-correlation constraint on “nearby” output vectors
- barlow quadruplets
Chapter 26 of Jurafsky and Martin
- TTS changes seemed a little complicated, need to read tacotron paper

2022-03-07

Seq2seq
- feeding in translation sentences backwards seemed like an insteresting idea, but it seems like attention is strictly better
Listen Attend and Spell
- interesting that decoding score has again an explicit term for shortness of translation after RNN-T paper #2 by graves removes i

2022-03-08

Planned to get implementing on speech tasks, but got distracted by SLT lectures :(
First lecture of SLT from the UW summer school and second lecture
- learnability theorems ring quite similarly to stuff covered in 21st century algorithms class I took at Dartmouth - cool!
- Seems like core idea of VC dimension is a more accurate upper bound on the “complexity” of function classes

2022-03-09

Lectures 3 and 4 of SLT from the UW summer school
- topics covered: Rademacher complexity, using VCDim & Growth function through rademacher complexity
Learning about SLT is the most fun of everything I’ve seen so far IMO. Will follow up on SLT after this lecture series, probably with Percy Liang’s notes from his stanford course.

2022-03-11

Spent 2-3 hours trying to implement CTC loss in pure pytorch, did not succeed. Code vomit (of the forward, not exactly CTCLoss):

def calculate_normalized_prob_matrix(y: torch.Tensor, target: str) -> torch.Tensor:
  '''
  y - float tensor of size (time steps x vocab_size)
    - t x v
  target - str target
  '''
  if len(target) == 0:
    return

  T = y.shape[0]
  
  new_target = '\0' + '\0'.join(target) + '\0'
  S = len(new_target)
  
  a = torch.zeros((T, S), device=y.device)
  a[0,0] = y[0,letter_to_number['\0']]
  a[0,1] = y[0,letter_to_number[new_target[0]]]

  for t in range(1, T):
    c_t = torch.sum(a[t-1,:])
    for s in range(max(0,S - 2*(T-t)), S):
      prob_S_s = y[t, letter_to_number[new_target[s]]]

      a_ts = a[t-1, s]/c_t * prob_S_s
      if s-1 >= 0:
        a_ts += a[t-1, s-1]/c_t * prob_S_s
      if new_target[s] != '\0' and (s < 2 or new_target[s-2] != new_target[s]):
        if s-2 >= 0:
          a_ts += a[t-1, s-2]/c_t * prob_S_s
      
      a[t,s] = a_ts

  # print(a.T)
  return torch.Tensor.sum(a[-2:,-1], dim=0, keepdim=True)

Unfortunately, when trying to backprop through an input prob vector y, Pytorch complains one of the variables needed for gradient computation has been modified by an inplace operation. One fix to get the code actually working is to clone previous time-step values of the a matrix when filling it in according to the forward-backward algorithm, but that causes the gradient given by PyTorch’s autograd to be incorrect :( (verified numerically).

2022-03-12

Finished last UW summer school lecture on SLT on AdaBoost
Practical series:
- Pytorch internals blog post
- Pytorch autograd engine
- Computational graphs

2022-03-13

Spent ~1.5 hours wrangling with pytorch audio datasets to try to set up model
- Finally learned what collate_fn in the DataLoader does, and know I know how to get more than just data and labels when iterating batches over an epoch
Pytorch Lightning looks like a cool framework - going to poke around with it for my first basic model mimicking Graves ‘06 CTC paper (hopefully tomorrow)

2022-03-15

Got a version of CTC training at last, seems like loading data is very slow though :( over 30 percent of training time taken by calculating batch data, need to debug tomorrow

2022-03-16

Two things that were really slow are now fixed:
- Mel Spectrogram transform - instead of doing this on the fly, pre-compute this for the whole dataset and make a dataloader out of this
- Padding extra 0 for batch - now using pytorch rnn util function to do this instead of manual copying
- These two sped up batch load time, which now only accounts for ~9 percent of training time :)
At inference time, it seems like a model I trained to convergence for 200 epochs learned absolutely nothing (using greedy decoding/beam search width = 1). Will debug tomorrow. Could be related to model saving incorrectly, or could just be a straight up training or processing error.

2022-03-18

Started reading SLT paper introduction, should be a bit more palatable than Percy Liang’s notes from his Stanford course (a bit out of reach for me right now). Goal: 1 page of the paper per day

2022-03-18

Started reading SLT paper introduction, should be a bit more palatable than Percy Liang’s notes from his Stanford course (a bit out of reach for me right now). Goal: 1 page of the paper per day

2022-03-19

Continue reading & taking notes on SLT paper introduction
Watch 1st half of NeruIPS 2021 tutorial on SSL
Debugged CTC model a bit + corrected beta function (as referred to in Graves ‘06), still couldn’t figure out why it’s performing extremely poorly… Things I checked:
- Log probs output from the model make sense (i.e. are of the right dimension, sum to 1)
- passing in correct order of speech/sentence data and their lengths

2022-03-20

Notes on SLTempirical processes
Watched 2nd half of NeruIPS 2021 tutorial on SSL
CTC model: trained another model for ~400 epochs, still mediocre performance

2022-03-21

Notes on inequality/concentration bounds up to but not hitting chernoff bounds
CTC model: overfit massively 1 epoch, but still not great performance. TODO tomorrow: investigate WER and try out better architectures (RNN-T?)