SeNSe: Embedding Alignment via Semantic Anchors Selection

Lorenzo Malandri, Fabio Mercorio, Mario Mezzanzanica & Filippo Pallucchini
Reference

International Journal of Data Science and Analytics

Abstract

Word embeddings have proven extremely useful across many NLP applications in recent years. Several key linguistic tasks, such as machine translation and transfer learning, require comparing distributed representations of words belonging to different vector spaces within or among different domains and languages to be aligned, known as embedding alignment. To this end, several existing methods exploit words that are supposed to have the same meaning in the two corpora, called seed lexicon or anchors, as reference points to map one embedding into the other. All those methods consider only the word that is supposed to have the same meaning in the two spaces to choose anchors, while its neighbours or similar words are neglected. We propose SeNSe, an unsupervised method for aligning monolingual embeddings, generating a bilingual dictionary composed of words with the most similar meaning among word vector spaces. Our approach selects a seed lexicon of words used in the same context in both corpora without assuming a priori semantic similarities. We compare our method with well-established benchmarks showing SeNSe outperforms state-of-the-art (SOTA) methods for embedding alignment on bilingual lexicon extraction in most cases.