Meta launches Sphere, an AI knowledge tool based on open web content, initially used to check citations on Wikipedia – TechCrunch

Facebook may be infamous for helping to usher in the era of “fake news,” but it’s also been trying to carve out a place for itself in tracking: the never-ending battle to fight it. In the latest development on this front, Facebook parent Meta today announced a new tool called Sphere, AI built around the concept of harnessing the vast repository of information on the open web to provide a knowledge base for AI and other systems are working. According to Meta, the first application of Sphere is Wikipedia, where it is used in a production phase (not live entries) to automatically analyze entries and identify when quotes in its entries are strongly or weakly supported.

The research team has Open source sphere — which is currently based on 134 million public web pages.

Here’s how it works in action:

The idea behind using Sphere for Wikipedia is simple: the online encyclopedia has 6.5 million entries and on average sees few. 17,000 items added monthly. The wiki concept behind this effectively means that adding and editing content is outsourced, and although there is a team of editors to oversee this, it is a daunting task that is growing day by day. , not just because of its size but because of its mandate, given the number of people, as well as increasingly educators and other institutions, who rely on it as an archival repository.

At the same time, the Wikimedia Foundation, which oversees Wikipedia, has been thinking about new ways to exploit all this data. Last month, it announced an Enterprise tier and its first two commercial customers, Google and the Internet Archive, which use Wikipedia-based data for their own business interests and will now have broader, more formal service agreements around that.

To be clear, today’s announcements about how Meta works with Wikipedia do not refer to Wikimedia Enterprise, but the general addition of additional tools for Wikipedia to ensure the content it has is verified. and exact will be something potential Enterprise service customers will want to know when you are considering paying for the service.

Meta has confirmed to me that there are no financial arrangements in this agreement: neither Wikipedia becoming a paying customer of Meta, nor vice versa. But Meta notes that to train the Sphere model, it created “a new dataset (WAFER) of 4 million Wikipedia citations, much more complex than ever used for this type of research.” And only five days agoMeta announced that Wikipedia editors are also using a new AI-based language translation tool he’s built, so there’s clearly a deeper connection there.

On Meta’s part, the company continues to be weighed down by poor public perception, stemming in part from accusations that it allows misinformation and toxic ideas to run wild – or if you’re someone one who ended up in “Facebook jail”, believing that you shared something you think is good, but still ran afoul of overzealous social police. It’s certainly a waste, but in that respect launching something like Sphere feels a bit like a public relations exercise for Meta, as well as a potentially useful tool: if it works, it shows that there are people in the organization trying to work in good faith. .

A few more details on today’s news and what may come next:

— Meta believes that the “white box” knowledge base that Sphere represents contains significantly more data (and, therefore, more sources to match for verification) than a typical “black box” knowledge source based on results from, for example, proprietary search engines. “Because Sphere can access much more public information than current standard models, it could provide useful information that they cannot,” he noted in a blog post. The 134 million documents that Meta used to gather and form Sphere were divided into 906 million passages of 100 tokens each.

— In open source this tool, Meta’s argument is that it is a stronger base for AI training models and other work than any proprietary base. Still, he concedes that the very foundations of knowledge are potentially fragile, especially in these early days. What if a “truth” is simply not disseminated as widely as disinformation? This is where Meta wants to focus its future efforts in Sphere. “Our next step is to train models to assess the quality of retrieved documents, detect potential contradictions, prioritize more reliable sources – and, if there is no convincing evidence, admit that they still can, like us, be puzzled,” he noted.

— In that sense, it raises some interesting questions about what Sphere’s hierarchy of truth will be based on compared to those of other knowledge bases. Because it is open source, users may have the ability to modify these algorithms in ways that better suit their own needs. (For example, a user implementing Sphere to check legal credentials may give more credibility to court documents and case law databases than a user checking fashion or sports credentials, which would put more emphasis on other sources.)

— Meta has confirmed that it does not use Sphere or any version of it on its own platforms like Facebook, Instagram and Messenger, which themselves have long struggled with misinformation and bad actor toxicity. (We also asked if there were any other online clients for Sphere.) He said separate tools to manage and moderate its own content.

— Above all, it seems that something like this is designed for a mega scale. The current size of Wikipedia is arguably beyond what any team of single-sized humans could check for accuracy, so the idea here is that Sphere is used to automatically analyze hundreds of thousands of citations simultaneously for spot when a quote doesn’t have much support on the web: “If a quote seems irrelevant, our model will suggest a more applicable source, even pointing to the specific passage that supports the claim,” he noted .

Although it’s in production at the moment, it also looks like the editors are picking up on bits that might need checking at this time. “Ultimately, our goal is to create a platform to help Wikipedia editors systematically spot citation issues and quickly fix the citation or correct the corresponding article content at scale.”

Updated with additional comment from Meta.

Donald E. Patel