Results for ""
Languages spoken just a few thousand years ago are lost in current times - lost and undeciphered - their words, grammar, syntax, a mystery.
These languages are much more than just academic interests; they provide a window into the civilisations that spoke these languages. However, due to the small number of records of these languages available, lack of relative language to be compared to and lack of traditional dividers such as punctuations and spaces, machine-learning algorithms such as Google Translate are not able to help either.
But perhaps, a new innovation may finally be able to help these 'dead' languages resurrect. Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created a new system that been showing a promise in automatically deciphering these ancient languages without any previous advanced knowledge of relative languages, etc. The system is said to establish connections between languages; a recent example was when the team used it to corroborate recent scholarship suggesting that the language of Iberian is not actually related to Basque.
The team, led by MIT Professor Regina Barzilay, aims to finally create an algorithm that can translate ancient languages using only a few thousand words. The project is a progression on a paper Barzilay and MIT PhD student Jiaming Luo wrote in 2019 that translated Ugaritic langauges and Linear B; the latter language was decoded by humans in decades. However, during the experiment, the team knew the related langauges - these languages were related to early forms of Hebrew and Greek, respectively.
The system created by the MIT researchers uses principles that have origins in knowledge and constraints of historical linguistics, that helps define predictable ways that a language evolves. For example, a given language may rarrely add or delete an entire sound, however, certain sound substitutions may be included. A word with a “p” in the parent language may change into a “b” in the descendant language, but changing to a “k” is less likely due to the significant pronunciation gap.
Barzilay and Luo have developed an algorithm that can compute the various possible transformations even in the light of scarcity of a guiding signal in the input. The algorithm embeds language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors. This approach helps the algorithm capture patterns of language evolution and set computational constraints. The resulting model can segment words in an ancient language and try creating relation to close-enough language.
For example, the related langauge to Iberian langauge has puzzled the scholars - some argue that it is closest to Basque, while others refute this hypothesis and claim that Iberian doesn’t relate to any known language.
The proposed algorithm is able to compute the proximity between two languages. It has been known to accurately identify the correct language families for known languages. For understanding the origin of Iberian, the team applied compared the language to Basque in their alggorithm, also including languages from Romance, Germanic, Turkic, and Uralic families, which are less popular choices. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related.
In future work, the team hopes to expand their work beyond the act of connecting texts to related words in a known language — an approach referred to as “cognate-based decipherment.” This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The team’s new approach would involve identifying semantic meaning of the words, even if they don’t know how to read them.
“For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” says Barzilay. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”
The project was supported, in part, by the Intelligence Advanced Research Projects Activity (IARPA).