作者:Joanna Gerber
In June, the IAB Tech Lab proposed a new initiative to create guardrails around how AI bots are permitted to access content, with an emphasis on publisher monetization.
It’s hoping that its new solution will get publishers back on their feet – and keep them there.
Publishers are like “the plankton of the digital media ecosystem,” said IAB Tech Lab CEO Anthony Katsur.
Every living thing in an aquatic environment depends on plankton. If they die out, the rest of the ocean goes down with them. And if publishers collapse, that would be an “extinction-level event” for digital media, Katsur said.
Many publishers are still managing to stay afloat, but the water is choppy, with traffic falling off the metaphorical cliff and no metaphorical harness in sight.
A life raft for publishers
The IAB Tech Lab’s initiative, currently called the LLM Content Ingest API Initiative (“which we need to rename,” Katsur joked; it’s “a mouthful”) can be broken down into four major components.
The first is access controls, which determine who is allowed to access a publisher’s content in the first place.
Once controls are established, access terms come into place, such as licensing models and content tiers. Under the IAB Tech Lab’s guidelines, content will be segregated into tiers based on relevance and value.
“Your archival content from 10 years ago is not worth as much as your late-breaking news or your interview with Taylor Swift,” Katsur said.
The guidelines would also mandate logging the use of content, which Katsur defines as “tracking and recording when and how publisher content is accessed or used by an LLM or AI system,” so publishers can accurately invoice and track usage of their data.
Content logging ties into the final part of the initiative, which Katsur believes is the most important facet: tokenization. Tokenization involves breaking content down into smaller units made up of words, parts of words, punctuation or metadata, Katsur said. These units, called tokens, are used to train LLMs and generate their responses. Publisher content gets tokenized and uniquely assigned to each publisher.
Then, “using the logging and reporting functions that we are proposing,” he explained, publishers can see exactly how the information scraped from their sites is being used.
Tokenization is useful for brands, too, so they can see what is being said about their products and by whom. Many LLMs scrape sites like Reddit, for example, and parrot back what they find as fact – despite the information often being outdated, if not outright incorrect.
As AI continues to make a name for itself in search, a set of guidelines like the LLM Content Ingest API Initiative (looking forward to that new name) is the best way to ensure that query responses are accurate, Katsur said, and that publishers – and with them, the rest of the ad tech ecosystem – continue to thrive.
The big picture
But let’s zoom out.
What actually happens when a bot scrapes a website?
First, it’s important to note that AI isn’t born with limitless knowledge. It has to get that knowledge from somewhere. That’s why AI bots mine websites, which are vast troves of information.
Sometimes, scraping is one-and-done. When a query is for something straightforward, like a chocolate chip cookie recipe, a bot typically won’t need to continue scraping a site for more updated information, Katsur explained, since a cookie recipe doesn’t generally update or evolve. And once an AI model has a good recipe, it can feed it (no pun intended) to the hundreds of thousands of people requesting it.
It’s not guaranteed that after a page is scraped once it never will be scraped again. There is a common misconception “that once an LLM crawls, it stores all the data and never crawls again,” said Katsur. The IAB Tech Lab’s research has shown that crawlers will recrawl content they have already accessed.
Still, scraping the same page a handful of additional times doesn’t scale against the pay-per-visit model that publishers are used to.
With a pay-per-crawl model, a publisher gets paid when a bot pulls information from its site – and that’s basically the end of the story. No matter how many of a generative AI search engine’s users benefit from that information down the line, the publisher only gets paid once per scrape.
Pay per query, on the other hand, is more similar to the way publishers currently drive revenue, and is the model favored by the IAB Tech Lab. “Now you’re getting paid per use,” said Katsur, “which is similar to getting paid per visit.”
“Pay per query scales,” he said. “Pay per crawl does not.”
Problem is, even pay per crawl isn’t guaranteed. Plenty of bots are scraping sites without providing any compensation and, technically, that’s allowed – for now.
But that seems to be changing, as more companies develop models that put publisher monetization at the forefront.
Earlier this month, Cloudflare implemented a new pay-per-crawl model that gives publishers full rein over the access they provide to bots. Publishers can give full access, block all scraping or opt into the new pay-per-crawl model, which requires bots to share payment information so they can be charged for each scrape.
That’s something – although, until this sort of model is widely adopted, publisher traffic is still in serious danger.
But, hey, along with the LLM Content Ingest API Initiative, it’s definitely a start.