What makes a software a working product, is its source code. From what a software is for, to whatever it can do, and how it works, is all written inside its source code.
Yandex is a Russian multinational technology company, which conducts business by providing internet-related products and services. It's an internet search engine, as well as an e-commerce platform, and a place for navigation, online advertising, and more.
The company operates an empire on the internet, and because of that, it's often referred to as the Google of Russia.
And this time, a Yandex source code repository allegedly stolen by a former employee of the Russian technology company, has allegedly been leaked as a torrent on a popular hacking forum.
The leaker posted a magnet link that they claim are "Yandex git sources," consisting a massive 44.7 GB of files stolen source code, which the leaker claimed to come from the company in July 2022.
These code repositories allegedly contain all of the company's source code besides anti-spam rules.
According to software engineer Arseniy Shestakov in his website post, the leaked Yandex Git repository contains technical data and code about the following products:
- Yandex search engine and indexing bot.
- Yandex Maps.
- Alice (AI assistant).
- Yandex Taxi.
- Yandex Direct (ads service).
- Yandex Mail.
- Yandex Disk (cloud storage service).
- Yandex Market.
- Yandex Travel (travel booking platform).
- Yandex360 (workspaces service).
- Yandex Cloud.
- Yandex Pay (payment processing service).
- Yandex Metrika (internet analytics).
Shestakov also put the alleged source code to a directory listing on GitHub.
"There are at least some API keys, but they are likely only been used for testing deployment only," Shestakov said about the leaked data.
In response to the alleged leak, Yandex made a statement, which said that its systems were not hacked, and confirmed the authenticity of the data.
"A repository is a tool for storing and working with code. Code is used in this way internally by most companies."
"Repositories are needed to work with code and are not intended for the storage of personal user data. We are conducting an internal investigation into the reasons for the release of source code fragments to the public, but we do not see any threat to user data or platform performance."
Speaking to BleepingComputer, Grigory Bakunov, a former senior systems administrator, deputy chief of development, and director of spreading technologies at Yandex, explained that the motive of the data leak was political, likely motivated by Russia’s invasion of Ukraine. Bakunov also said that the rogue Yandex employee responsible for the data leak had not tried to sell the code to competitors.
Bakunov added that the leak does not contain any customer data, so it does not pose a direct risk to the privacy or security of Yandex users, nor does it directly threaten to leak proprietary technology.
Just like big tech companies, Yandex is aware of possible data leak, and have precautions in place in case the worst happens.
In this case, Yandex is said to use a monorepo structure called 'Arcadia'. And to make the leaked source code a working software, internal tools and special knowledge are required.
What this means, standard compiling the code procedure does not apply.
What's more, the leaked repository contains only the code, but no database. What this means, even if the source code can be compiled and ran, key information is still missing, which in turn makes the software almost useless.
However, the leaked data does show some of the inner workings of Yandex.
First of, according to securitylab.ru, the leaked source code contains lots of code written in Python 2.7, and all files and folders have the same date: "2022-02-24."
Then, there are files with names like blacklist.txt, which could potentially expose working services. And then, it's also revealed that Yandex's search engine utilizes 1,922 ranking factors for its search algorithm, at least as of July 2022.
The code shows that the developers and programmers at Yandex, have been using racist words, and this was tolerated at Yandex.
This was revealed by Canadian hacker Aubrey Cottle, who noticed that the leaked code contains ethnic insults.
real code snippets from Yandex’s leaked git repos, woweeeee pic.twitter.com/p4dBibyQdS
— (@Kirtaner) January 27, 2023
The leaked code also suggests that Yandex has tweaked its pornography filter so that when users search for them, they won't see anything bad about Russia's President Vladimir Putin.
And if user search for the “z symbol,” the search engine would add a lot of negative clues to hide possible parallels with Nazi Germany.
similarly, for image requests, if you search for "z symbol", it adds a lot of negative prompts to conceal the possible nazi germany parallels.
this feature is also used to sneakily promote yandex-owned kinopoisk when you look for something to watch. it adds "host:kinopoisk\.ru"
— banteg (@bantg) January 27, 2023
Then there is the code that where Yandex users try to turn off smart speakers with Alisa voice assistant. Here, the code contains obscene language, threats to kill the virtual assistant, and accusations that it does not allow Russians to go to the toilet.
Там в сливе яндекса нашли юзер дату и эта юзер дата - записанные фразы которыми русаки вырубают умную колонку Алису, ну и конечно же там есть кто-то, кто блять СМАРТ КОЛОНКЕ применил что она не воевала суууууукааа pic.twitter.com/M5YWZz2ZQC
— Nuts) (@OMFGNuts) January 26, 2023
Responding to the racial slurs in its codes, the Russian tech giant apologized
"We deeply regret that this word ever appeared in our internal codes," the Yandex press office said.
But for what it's all worth, Bakunov confirmed that the leaked code does have the potential for hackers to identify security holes.
Knowledgeable and experienced hackers who analyze the source code may be able to find weaknesses in the systems for them to create exploits.
While Yandex said that the leaked code is not identical to the current version being used by the company, the former executive said that the leaked codes might be up to 90% similar.