Sequence problems are everywhere.
Sequence problems are best solved by recurrent models.
"For most deep learning practitioners, sequence modeling is synonymous with recurrent networks."
— Bai, Kolter, Koltun (2018)
Maintain hidden state
System specified by parameters
Update model parameters with gradient descent:
Get derivatives using back-propagation
Need to store all
Language Modeling (Gated-Conv by Dauphin et al.),
Machine Translation (Transformer by Vaswani et al.),
Speech Synthesis (WaveNet by van den Oord et al.),
everything else (Bai et al.)
Are all trainable recurrent models
Sounds plausible to me, but it's hard
to characterize trainability
Stable recurrent models can be replaced by equivalent feed-forward models for both training and inference.
Put differently, either model is
inherently feed-forward or it is unstable.
Ways of making models stable (RNNs, LSTMs), also during training
Stable recurrent models can achieve
competitive performance on various tasks.
Linear dynamical system
Same for recurrent
recurrent neural network
And, since you're asking,
LSTM is stable if holds.
Truncation affects neither training nor inference.
Direct calculation using contractiveness assumption.
Uses stability properties of gradient descent.
Inspired by "train faster, generalize better" analysis.
Show that stable models have vanishing gradients (over time).
I.e., argue that
(not so obvious) extension of inference result
Show that gradient descent insensitive to gradient differences.
Gradient descent is insensitive to pertubations in the sample (algorithmic stability) [H-Recht-Singer, 2015]
Similar analysis applies here
Decaying learning rate necessary for analysis in both cases.
Here, even necessary in experiment.
Stable RNN slightly better than unstable RNN. Opposite for LSTM.
Stable/unstable LSTMs and RNNs have very similar performance.
Stable slightly better.
Stable worse than unstable. No difference between LSTM and RNN.
Stable recurrent models do not provide long-term memory and can be replaced by feed-forward models without loss.
Empirical evidence suggests that stable models can achieve good performance.
Are all efficiently trainable recurrent models
inherently subsumed by feed-forward models?
In this work: Stability as a proxy for trainability.
When are recurrent models really needed (if they ever are)?
What is the price of stability (if there is any)?
What memory mechanisms are efficiently learnable?
What kind of memory do natural problems require?
What classes of dynamical systems are learnable?
(Tricky even for linear dynamical systems)