Embeddings are usually the first layers. They encode a SINGLE TOKEN into a numerical representation.
Transformer Encoder, on the other hand, encodes ENTIRE SEQUENCE into a numerical representation.
Finally, there are contextualized embeddings that also encode the entire sequence.
But context embeddings are more general-purpose (unsupervised trained) while encoder is more task specific (supervised trained)