This seems like a (potential) solution looking for a nail-shaped problem.
Yes, there is a huge problem with AI content flooding the field, and being able to identify/exclude it would be nice (for a variety of purposes)
However, the issue isn't that content was "AI generated"; as long as the content is correct, and is what the user was looking for, they don't really care.
The issue is content that was generated en-masse, is largely not correct/trustworthy, and serves only to to game SEO/clicks/screentime/etc.
A system where the content you are actually trying to avoid has to opt in is doomed for failure. Is the purpose/expectation here that search/cdn companies attempt to classify, and identify, "AI content"?
>Attack applications may use a suitable API to request that [the evil bit] be set. Systems that do not have other mechanisms MUST provide such an API; attack programs MUST use it.
Potential flaw: I'm concerned that attackers may be slow to update their malware to achieve compliance with this RFC. I suggest a transitional API: Intrusion detection systems respond to suspected-evil packets that have the evil bit set to 0 with a depreciation notice.
It says in the first paragraph it’s for crawlers and bots. How many humans are inspecting the headers of every page they casually browse? An immediate problem that could potentially be addressed by this is the “AI training on AI content” loop.
Moreover, I find it ironic that website owners will gracefully give AI companies the power to identify what is "good" data and what is not. I mean, why would I do the work for them and identify my data as AI, so that they would ignore it ? "yes please, take all my work, this is quality content, train on it, it's free !" that's what it sounds like
It would still be required for the content producer (ie, the content-spam-farm) to label their content as such.
The current approach is that the content served is the same for humans and agents (ie, a site serves consistent content regardless of the client), so who a specific header is "meant for" is a moot point here.
Maybe we should avoid training AI with AI-generated content: that's a use case I would defend.
Still I believe MIME would be the right place to say something about the Media, rather than the Transport protocol.
On a lighter note: we should consider second order consequences. The EU commission will demand its own EU-AI-Disclosure header be send to EU citizens, and will require consent from the user before showing him AI generated stuff. UK will require age validation before showing AI stuff to protect the children's brains. France will use the header to compute a new tax on AI generated content, due by all online platform who want to show AI generated content to french citizens.
That's a Pandora box I wouldn't even talk about, much less open...
> The EU commission will demand its own EU-AI-Disclosure header be send to EU citizens, and will require consent from the user before showing him AI generated stuff. UK will require age validation before showing AI stuff to protect the children's brains. France will use the header to compute a new tax on AI generated content, due by all online platform who want to show AI generated content to french citizens.
I think the recent drama related to the UK's Online Safety Act has shown that people are getting sick of country-specific laws simply for serving content. The most likely outcome is sites either block those regions or ignore the laws, realizing there is no practical enforcement avenue.
> Maybe we should avoid training AI with AI-generated content: that's a use case I would defend.
if this takes off I'll:
- tag my actual content (so they won't train on it)
- not tag my infinite spider web of automatically generated slop output (so it'll poison the models)
It depends but for example if I wanted to train a LoRa that outputs a certain art style from a specific model, I have no issue with this being done. Its not like you are making a model from scratch.
It feels like a header is the wrong tool for this, even if you hypothetically would want to disclose that, would you expect a blog cms to offer the feature? Or a web browser to surface it?
Completely the wrong way around. We are heading into a future where everything will be touched by AI in some way, be it things like Photoshop Generative Fill, spell check, subtitles, face filters, upscaling, translation or just good old algorithmic recommendations. Even many smartphones already run AI over every photo they make.
Doing it in a HTTP header is furthermore extremely lossly, files get copy around and that header ain't coming with them. It's not a practical place to put that info, especially when we have Exif inside the images themselves.
The proper way to handle this is mark authentic content and keeping a trail of how it was edited, since that's the rare thing you might to highlight in a sea of slop,
https://contentauthenticity.org/ is trying to do that.
The authors do seem to be conflating AI as a marketing term with chat gpt types. AI encompasses a broad suite of technologies including the spell check you've mentioned and given them number of tools used today that would technically constitute AI, this header makes no sense.
Interesting initiative but I wonder if the mode provides sufficient granularity. For example, what about an original human-generated text that is entirely translated by an AI?
> what about an original human-generated text that is entirely translated by an AI?
probably ai-modified -- the core content was first created by humans, then modified (translated into another language). translating back would hopefully return you the original human generated content (or at least something as close as possible to the original).
| class | author | modifier/reviewer |
| ----------------- | ------ | ----------------- |
| none | human | human/none |
| ai-modified | human | ai | <--*
| ai-originated | ai | human |
| machine-generated | ai | ai/none |
It certainly doesn't cover the case of mixed-origin content. Say for example, a dialog between a human and AI or even mixed-model content.
For those, my instinct is to fallback to markup which would seem to work quite well. There is the pesky issue of AI content in non-markup formats - think JSON that don't have the same orthogonal flexibility in annotating metadata.
The bigger challenge here is that we already struggle with basic metadata integrity. Sites routinely manipulate creation dates for SEO - I regularly see 5-year-old content timestamped as "published yesterday" to game Google's freshness signals.
While this doesn't invalidate the proposal, it does suggest we'd see similar abuse patterns emerge, once this header becomes a ranking factor.
> It would be crazy for Google to treat that as authorship date, and I cannot believe that they do.
I'm not sure what Google uses for authorship date, but if you do date-range based web searches, the actual dates of the content no longer have any meaningful relationship to what was set in the earch criteria (news seems mostly better but with some problems, but actual web search is hopeless). In both directions -- searching for recent stuff gets plenty of very old stuff mixed in, but searching for stuff from a period well in the past gets lots of stuff from yesterday, too.
I'm genuinely torn. On one hand, transparency is good. But on the other, I can totally see this header becoming a lazy filter for platforms to just automatically demote or even block any AI-assisted content. What happens to artists using AI tools, or writers using it for brainstorming?
Maybe an ignorant question but at the dictionary level, how would one indicate that multiple providers/models went into the resulting work (based on the example given)? Is there a standard for nested lists?
Years ago people were arguing that fashion magazines should have to disclose if they photoshopped pictures of women to make them look skinnier. France implemented this law, and I believe other countries have as well. I believe that we should have similar laws for AI generated content.
I'm all for some kind of disclosure, but where do we draw the line. I use a pretty smart grammar and spell checker, one that's got more "AI" in it to analyze the sentence structure. Is that AI content?
According to the spec, yes a grammar checker would be subject to disclosure:
> ai-modified Indicates AI was used to assist with or modify content primarily created by humans. The source material was not AI-generated. Examples include AI-based grammar checking, style suggestions, or generating highlights or summaries of human-written text.
This seems like a (potential) solution looking for a nail-shaped problem.
Yes, there is a huge problem with AI content flooding the field, and being able to identify/exclude it would be nice (for a variety of purposes)
However, the issue isn't that content was "AI generated"; as long as the content is correct, and is what the user was looking for, they don't really care.
The issue is content that was generated en-masse, is largely not correct/trustworthy, and serves only to to game SEO/clicks/screentime/etc.
A system where the content you are actually trying to avoid has to opt in is doomed for failure. Is the purpose/expectation here that search/cdn companies attempt to classify, and identify, "AI content"?
It's the evil bit, but unironically.
For today's lucky 10k:
https://www.ietf.org/rfc/rfc3514.txt
Note date published
>Attack applications may use a suitable API to request that [the evil bit] be set. Systems that do not have other mechanisms MUST provide such an API; attack programs MUST use it.
Potential flaw: I'm concerned that attackers may be slow to update their malware to achieve compliance with this RFC. I suggest a transitional API: Intrusion detection systems respond to suspected-evil packets that have the evil bit set to 0 with a depreciation notice.
deprecation notice
It says in the first paragraph it’s for crawlers and bots. How many humans are inspecting the headers of every page they casually browse? An immediate problem that could potentially be addressed by this is the “AI training on AI content” loop.
How many of the makers of these trash SEO sites are going to voluntarily identify their content as AI generated?
Moreover, I find it ironic that website owners will gracefully give AI companies the power to identify what is "good" data and what is not. I mean, why would I do the work for them and identify my data as AI, so that they would ignore it ? "yes please, take all my work, this is quality content, train on it, it's free !" that's what it sounds like
It would still be required for the content producer (ie, the content-spam-farm) to label their content as such.
The current approach is that the content served is the same for humans and agents (ie, a site serves consistent content regardless of the client), so who a specific header is "meant for" is a moot point here.
I believe this is why Google did SynthID https://deepmind.google/science/synthid/
Can we have a disclosure for sponsored content header instead?
I'd love to browse without that.
It does not bother me that someone used a tool to help them write if the content is not meant to manipulate me.
Let's solve the actual problem.
We already have those legally mandated disclosures per the FTC.
Maybe we should avoid training AI with AI-generated content: that's a use case I would defend.
Still I believe MIME would be the right place to say something about the Media, rather than the Transport protocol.
On a lighter note: we should consider second order consequences. The EU commission will demand its own EU-AI-Disclosure header be send to EU citizens, and will require consent from the user before showing him AI generated stuff. UK will require age validation before showing AI stuff to protect the children's brains. France will use the header to compute a new tax on AI generated content, due by all online platform who want to show AI generated content to french citizens.
That's a Pandora box I wouldn't even talk about, much less open...
> The EU commission will demand its own EU-AI-Disclosure header be send to EU citizens, and will require consent from the user before showing him AI generated stuff. UK will require age validation before showing AI stuff to protect the children's brains. France will use the header to compute a new tax on AI generated content, due by all online platform who want to show AI generated content to french citizens.
I think the recent drama related to the UK's Online Safety Act has shown that people are getting sick of country-specific laws simply for serving content. The most likely outcome is sites either block those regions or ignore the laws, realizing there is no practical enforcement avenue.
> Maybe we should avoid training AI with AI-generated content: that's a use case I would defend.
if this takes off I'll:
win win!then they'll start ignoring the header and it'll be useless
(of course, it was never going to be useful)
It depends but for example if I wanted to train a LoRa that outputs a certain art style from a specific model, I have no issue with this being done. Its not like you are making a model from scratch.
Content-Type/MIME type is for the format.
There are dedicated headers for other properties, e.g. language.
Actually you're 100% correct.
Feels weird to me that encoding is part of MIME, but language isn't, although I understand why.
Yeah. The reason is that charset is a specific to text types. Language can apply to many media.
Though FWIW, I think the Content-Encoding header is basically a mistake, should should been Content-Transform.
It feels like a header is the wrong tool for this, even if you hypothetically would want to disclose that, would you expect a blog cms to offer the feature? Or a web browser to surface it?
Approximately as useless as "do not track".
Seems like someone just trying to get their name on a published IETF standard for the bragging/resume rights
This is like asking the fox to announce itself before entering the henhouse
Completely the wrong way around. We are heading into a future where everything will be touched by AI in some way, be it things like Photoshop Generative Fill, spell check, subtitles, face filters, upscaling, translation or just good old algorithmic recommendations. Even many smartphones already run AI over every photo they make.
Doing it in a HTTP header is furthermore extremely lossly, files get copy around and that header ain't coming with them. It's not a practical place to put that info, especially when we have Exif inside the images themselves.
The proper way to handle this is mark authentic content and keeping a trail of how it was edited, since that's the rare thing you might to highlight in a sea of slop, https://contentauthenticity.org/ is trying to do that.
The authors do seem to be conflating AI as a marketing term with chat gpt types. AI encompasses a broad suite of technologies including the spell check you've mentioned and given them number of tools used today that would technically constitute AI, this header makes no sense.
Yup, this is the way. Assume everything is AI unless proven otherwise.
Interesting initiative but I wonder if the mode provides sufficient granularity. For example, what about an original human-generated text that is entirely translated by an AI?
> what about an original human-generated text that is entirely translated by an AI?
probably ai-modified -- the core content was first created by humans, then modified (translated into another language). translating back would hopefully return you the original human generated content (or at least something as close as possible to the original).
It certainly doesn't cover the case of mixed-origin content. Say for example, a dialog between a human and AI or even mixed-model content.
For those, my instinct is to fallback to markup which would seem to work quite well. There is the pesky issue of AI content in non-markup formats - think JSON that don't have the same orthogonal flexibility in annotating metadata.
The bigger challenge here is that we already struggle with basic metadata integrity. Sites routinely manipulate creation dates for SEO - I regularly see 5-year-old content timestamped as "published yesterday" to game Google's freshness signals.
While this doesn't invalidate the proposal, it does suggest we'd see similar abuse patterns emerge, once this header becomes a ranking factor.
Does that work? There’s no way…
Most web servers use mtime for Last-Modified header.
It would be crazy for Google to treat that as authorship date, and I cannot believe that they do.
> It would be crazy for Google to treat that as authorship date, and I cannot believe that they do.
I'm not sure what Google uses for authorship date, but if you do date-range based web searches, the actual dates of the content no longer have any meaningful relationship to what was set in the earch criteria (news seems mostly better but with some problems, but actual web search is hopeless). In both directions -- searching for recent stuff gets plenty of very old stuff mixed in, but searching for stuff from a period well in the past gets lots of stuff from yesterday, too.
On platforms like Wordpress these headers are settable via SEO plugins. Many sites will roll these headers forward.
Hack: only present this header to AI crawlers, so they don't index your content, lol.
I'm genuinely torn. On one hand, transparency is good. But on the other, I can totally see this header becoming a lazy filter for platforms to just automatically demote or even block any AI-assisted content. What happens to artists using AI tools, or writers using it for brainstorming?
They can adapt or get left behind
Ha
This feels like the Security Flag proposal (https://www.ietf.org/rfc/rfc3514.txt)
or end up like california prop 65 warnings: https://en.wikipedia.org/wiki/1986_California_Proposition_65
Why only for HTTP? This would be appropriate for MIME multipart/mixed part headers as well. ;)
Maybe better define an RDF vocabulary for that instead, so that individual DIVs and IMGs can be correctly annotated in HTML. ;)
Hoping I don't need to click on something, or have something obstructing my view.
The cookie banner just got 200px taller.
This is a Gentlemen’s agreement humans will not keep. Not how our species works.
Maybe an ignorant question but at the dictionary level, how would one indicate that multiple providers/models went into the resulting work (based on the example given)? Is there a standard for nested lists?
Years ago people were arguing that fashion magazines should have to disclose if they photoshopped pictures of women to make them look skinnier. France implemented this law, and I believe other countries have as well. I believe that we should have similar laws for AI generated content.
I'm all for some kind of disclosure, but where do we draw the line. I use a pretty smart grammar and spell checker, one that's got more "AI" in it to analyze the sentence structure. Is that AI content?
According to the spec, yes a grammar checker would be subject to disclosure:
> ai-modified Indicates AI was used to assist with or modify content primarily created by humans. The source material was not AI-generated. Examples include AI-based grammar checking, style suggestions, or generating highlights or summaries of human-written text.