Humans cannot be the sole guardians of scientific knowledge

There is an old joke that physicists like to tell: Everything has already been discovered and reported in a Russian newspaper in the 1960s, we just don’t know. Although hyperbolic, the joke accurately captures the current state of affairs. The volume of knowledge is vast and growing rapidly: the number of scientific papers published on arXiv (the largest and most popular preprint server) in 2021 is expected to reach 190,000, and that’s just a subset of the scientific literature produced this year.

It is clear that we do not really know what we know, because no one can read all the literature, even in his narrow field (which includes, in addition to journal articles, doctoral theses, laboratory notes, slides, white papers, technical notes and reports). Indeed, it is quite possible that in this mountain of papers, the answers to many questions are hidden, that important discoveries have been neglected or forgotten and that links remain hidden.

Artificial intelligence is a potential solution. Algorithms can already analyze text without human supervision to find relationships between words that help uncover knowledge. But much more can be accomplished if we move away from writing traditional scientific papers, the style and structure of which have changed little over the past hundred years.

Text mining comes with a number of limitations, including access to full-text articles and legal issues. Importantly, the AI ​​doesn’t really understand the concepts and the relationships between them, and is susceptible to biases in the dataset, such as the selection of articles it analyzes. It is difficult for AI – and, indeed, even for a non-expert human reader – to understand scientific articles in part because the use of jargon varies from discipline to discipline and the same term can be used with completely different meanings in different fields. The increasing interdisciplinarity of research often makes it difficult to precisely define a subject using a combination of keywords in order to discover all relevant articles. Making connections and (re)discovering similar concepts is difficult for even the brightest minds.

As long as this is the case, AI cannot be trusted and humans will have to recheck everything an AI produces after text mining, a tedious task that defies the very purpose of using the AI. To solve this problem, we need to make scientific papers not only machine-readable, but alsounderstandable, by (re)writing them in a particular type of programming language. In other words: Teach science to machines in the language they understand.

Writing scientific knowledge in a programming-like language will be dry, but it will be durable, as new concepts will be added directly to the scientific library that machines understand. Additionally, as machines learn more scientific facts, they can help scientists streamline their logical arguments; identify errors, inconsistencies, plagiarism and duplicates; and highlight the connections. AI with an understanding of physical laws is more powerful than AI trained on data alone, so state-of-the-art machines will be able to help future discoveries. Machines with great scientific knowledge could help rather than replace human scientists.

Mathematicians have already started this process of translation. They teach math to computers by writing theorems and proofs in languages ​​like Lean. Lean is a proof assistant and a programming language in which one can introduce mathematical concepts in the form of objects. By using known objects, Lean can reason whether a statement is true or false, helping mathematicians check proofs and identify where their logic is insufficiently rigorous. The more math Lean knows, the more it can do. Imperial College London’s Xena project aims to integrate the entire undergraduate mathematics curriculum into Lean. One day, proof assistants may help mathematicians with research by checking their reasoning and researching the vast mathematical knowledge they possess.

Donald E. Patel