If there is something on the internet that comes close to containing universal knowledge, it is Wikipedia. In the pre-digital era, comparable knowledge accessible from home was in encyclopedias. The difference in size is however astronomical. Jaime Crespo, a computer engineer and member of the Wikimedia Foundation team that is responsible for the persistence of its systems, has made an approximate calculation of the weight in terabytes of all the projects of his organization, of which Wikipedia is the main one.
At the end of September, 500 and 600 terabytes were released, including the images. A 10 terabyte hard drive weighs 850 grams, and one terabyte equates to roughly 500 hours of HD video. The complete content of Wikipedia would therefore fit on about 50 kilos of hard drives. “If we put the content in a single file, it would be that, in 300 languages and all the projects,” says Crespo, 38, who works remotely from Logroño, where he was born.
This calculation is for illustration purposes, but it would have several technical problems. That weight would not include Wikipedia in an easy-to-search or organized format: “It seems little because in plain text it is not that much either, along with the images,” says Crespo by videoconference to EL PAIS. “But to serve that information, you need a lot more space,” he adds. If we had only those discs at home in text format and wanted to search for a word, it would take hours to return the result. “You would not have the same functions as on the internet. Maybe it would take 2 hours to find what you are looking for. You would have to search everything from top to bottom. The databases organize information so that you ask for an article and in milliseconds you have it ”, explains Crespo, who did his calculation to a conference of the Spanish Python Association, a programming language.
Despite these technical shortcomings, the comparison serves to understand the unimaginable difference between the Larousse encyclopedias of 20th century homes and Wikipedia. How could we think that those encyclopedias were “universal”? “In addition, we are the first to say that we have almost nothing of human knowledge,” explains Crespo. “It annoys us many times that a small town in Spain has only 4 paragraphs when it could have many more.”
To complete the analogy, an American artist intended to print only the English Wikipedia. 7,473 volumes came out.
Wikipedia today is much more than the encyclopedia of yesteryear, but its use is not so different. On the Spanish Wikipedia, in September two of the three most searched words they were “Cleopatra” and “periodic table of elements”, which have all the earmarks of being linked to students. The second is “The Squid Game”, the Netflix series. Wikipedia combines traditional searches with current affairs.
A few decades ago, some specially dedicated character could pretend to read “all” universal knowledge. Even Crespo made his attempts: “When I was little I loved to take out the encyclopedia and read pieces to myself and learn things, maybe that’s why I ended up working here,” he says. But today that would be impossible, it is overwhelming: “Wikipedia is a black hole of knowledge because it attracts you and it never ends, there is always something else, it would be impossible for a human to read it because of the speed at which the information is added”, he adds.
Crespo’s experience also serves to understand a little more how the cloud works. The Wikimedia Foundation has its own data warehouses, detached from big technology: “We are a bit special but it is in line with the philosophy we have of privacy and transparency,” says Crespo. “Companies work with other clouds, but we want to have control over the data because we don’t want anyone to access private data or to be able to make statistics. That means managing our own machines. We have a room inside larger data centers that has a key and only we can get in there, ”he explains.
Wikipedia is the page number 14 among the most visited in the world, according to data from Alexa. That means, according to Crespo, “half a million queries per second”, of which approximately a third are for the English Wikipedia. The size of the rest of the main pages is much larger than Wikipedia.
A central part of Crespo’s job is keeping Wikipedia backups alive. A problem for engineers in charge of conserving data or knowledge is to think ahead, which in technology is complex. Crespo must take into account that your backup may be accessible in 2027 or beyond. “I am very careful to use technologies that have a future in 5 years from now. It could be that five years later the way something has been encrypted does not have its manual, ”he says. “Languages and applications can also play a role, but I always use very portable formats. If the database technology we use disappeared today, it would cost us little to migrate to another because we have it in compressed text, which is a standard format ”.
Wikipedia is 99% open information, says Crespo. They also try to have little private information because not having it is the best way to avoid leaks or losses. Yet editors who monitor page changes or vandalize other pages is sensitive information. Hence the ciphers. “Most of the data is public. Aside from our backups containing user activity, we post on a page of exports with an archive of all our articles for people to download. If our organization were to disappear, the public has a copy to rebuild it. There is even a copy on the Moon, ”he says.
Eddie is an Australian news reporter with over 9 years in the industry and has published on Forbes and tech crunch.