Transformers replaced RNN & CNN. Their advantage is parallelized computing. But they don't take into account object position as RNN or CNN.
Idea : inject informaion about object into its input embedding.