Transformers (such as GenerativePretrainingTransformer) are being adopted outside of text classification and generation, and are being used more widely.
So currently reading up on transformers. (This book) I don't claim to understand them yet. Let alone be an "expert".
But first cut of an understanding. Transformers are like neural networks but have two sorts of layers : normal feed-forward layers learning with something like BackProp which are learning mappings between inputs and outputs.
But these are interleaved with "attention mechanism" layers. Ie. layers which are fully interconnected within themselves. And which perform a kind of "term weighting" function. Ie. the attention mechanism says "given this token is present in the input, this is how much weight to give to the other tokens present"
So for example, if we're processing language, then it might assess that in a sentence containing "chocolate" and "cooking" then the word "recipe" might be treated as much more significant than it would be in a sentence with no culinary words, but containing the word "disaster".
Transformers are called "industrial" in the sense that they contain this standardized pattern of a layer with both attention and normal learning sub-layers, but repeated many times. (So it's a kind of "software complexity" (see TypesOfComplexSystem)
Today's powerful transformers have in the order of 90 - 100 such layers, each with its own attention sublayer and a "normal" feed-forward sublayer.