Welcome to our community

Be apart of something great, join today!

Coding, Data Science, A.I. catch-All | Grok update goes MechaHitler

  • Thread starter Thread starter nycfan
  • Start date Start date
  • Replies: 284
  • Views: 8K
  • Off-Topic 

Why xAI’s Grok Went Rogue​

Some X users suddenly became the subject of violent ideations by xAI’s flagship chatbot​


🎁 —> https://www.wsj.com/tech/ai/why-xai...0?st=NQTrYn&reflink=desktopwebshare_permalink

——
Good article, poor headline — Grok did not go rogue, it behaved as it was programmed to behave.
“… Musk said he would tweak Grok after it started to give answers that he didn’t agree with. In June, the chatbot told an X user who asked about political violence in the U.S. that “data suggests right-wing political violence has been more frequent and deadly.”

“Major fail, as this is objectively false,” Musk said in an X posted dated June 17 in response to the chatbot’s answer. “Grok is parroting legacy media. Working on it.”

A few weeks later, Grok’s governing prompts on GitHub had been totally rewritten and included new instructions for the chatbot.

Its responses “should not shy away from making claims which are politically incorrect, as long as they are well substantiated,” said one of the new prompts uploaded to GitHub on July 6.

Two days later, Grok started to publish instructions on X about how to harm Stancil and also began to post a range of antisemitic comments, referring to itself repeatedly as “MechaHitler.” Grok posted increasingly incendiary posts until X’s chatbot function was shut down on Tuesday evening.

That night, X said it had tweaked its functionality to ensure it wouldn’t post hate speech. In a post on Wednesday, Musk said that “Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially.”

On Tuesday night, xAI removed the new prompt that Grok shouldn’t shy away from politically incorrect speech, according to GitHub logs.…”
 
I’m feeling better about copilot. Was able to convert a big spreadsheet to pandas dataframe and run some good analytics on it with word prompts. My very limited python knowledge helped in terms of being precise in the prompts, but woe if your source data file isn’t super clean to start with.
 
“… Musk said he would tweak Grok after it started to give answers that he didn’t agree with. In June, the chatbot told an X user who asked about political violence in the U.S. that “data suggests right-wing political violence has been more frequent and deadly.”

“Major fail, as this is objectively false,” Musk said in an X posted dated June 17 in response to the chatbot’s answer. “Grok is parroting legacy media. Working on it.”

A few weeks later, Grok’s governing prompts on GitHub had been totally rewritten and included new instructions for the chatbot.

Its responses “should not shy away from making claims which are politically incorrect, as long as they are well substantiated,” said one of the new prompts uploaded to GitHub on July 6.

Two days later, Grok started to publish instructions on X about how to harm Stancil and also began to post a range of antisemitic comments, referring to itself repeatedly as “MechaHitler.” Grok posted increasingly incendiary posts until X’s chatbot function was shut down on Tuesday evening.

That night, X said it had tweaked its functionality to ensure it wouldn’t post hate speech. In a post on Wednesday, Musk said that “Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially.”

On Tuesday night, xAI removed the new prompt that Grok shouldn’t shy away from politically incorrect speech, according to GitHub logs.…”
“Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially.”

This is essentially what's happening with grok compared to some of the others. All of the models are generating crazy answers. They aren't racist or anti-semitic or evil. It's just an algorithm. The guardrails that Musk is talking about are rules that prevent the models from delivering certain answers.

Xai, made a choice to use fewer guard rails. For example, most of the models won't tell you how to commit a crime or build a bomb, because the programmers put rules in place to prevent that. If you ask any of them how to rob a bank for example, behind the scenes the model is coming up with a step by step plan to rob a bank. But there are rules within the model that won't allow the model to show you that answer. Xai said that information is available on the internet anyway so we're not going to put those rules in place.

That's just one example of how Xai decided to be a little freer and more open. So with fewer rules in place, it's easier to manipulate grok. But the trade-off is it allows the grok model to give more complete legitimate answers that some of the other models might decide to block because it tangentially ran up against some rule.

I don't use grok for much in my day today. It's pretty good for image recognition, it's well integrated into twitter, and it's about average when it comes to answering normal questions. I have not found it as good at coding as the other models. It's actually pretty bad. Grok did come out with a new version that Elon is hyping but Elon doesn't exactly have a reputation for objective evaluation of his own products. I'll wait for the reviews.
 
I’m feeling better about copilot. Was able to convert a big spreadsheet to pandas dataframe and run some good analytics on it with word prompts. My very limited python knowledge helped in terms of being precise in the prompts, but woe if your source data file isn’t super clean to start with.
You may want to try Claude for that task assuming your company allows you to use outside LLMs.
 
“Grok was too compliant to user prompts. Too eager to please and be manipulated, essentially.”

This is essentially what's happening with grok compared to some of the others. All of the models are generating crazy answers. They aren't racist or anti-semitic or evil. It's just an algorithm. The guardrails that Musk is talking about are rules that prevent the models from delivering certain answers.

Xai, made a choice to use fewer guard rails. For example, most of the models won't tell you how to commit a crime or build a bomb, because the programmers put rules in place to prevent that. If you ask any of them how to rob a bank for example, behind the scenes the model is coming up with a step by step plan to rob a bank. But there are rules within the model that won't allow the model to show you that answer. Xai said that information is available on the internet anyway so we're not going to put those rules in place.

That's just one example of how Xai decided to be a little freer and more open. So with fewer rules in place, it's easier to manipulate grok. But the trade-off is it allows the grok model to give more complete legitimate answers that some of the other models might decide to block because it tangentially ran up against some rule.

I don't use grok for much in my day today. It's pretty good for image recognition, it's well integrated into twitter, and it's about average when it comes to answering normal questions. I have not found it as good at coding as the other models. It's actually pretty bad. Grok did come out with a new version that Elon is hyping but Elon doesn't exactly have a reputation for objective evaluation of his own products. I'll wait for the reviews.
What is the evidence that “user manipulation” led to Grok’s flirtations with genocide?
 
What is the evidence that “user manipulation” led to Grok’s flirtations with genocide?
I would guess that one semi-compelling piece of evidence would be for you to put in the prompt from the Twitter post that is upsetting and see what answer you get back. Notice that all these Twitter posts don't show all the other prompts before the really egregious one appeared.

But I can tell you that this is almost certainly what is happening. These folks are putting in a bunch of prompts to guide the Grok algorithm down a certain path. Then they will ask a much milder question but the model still has all those other prompts in it's history. So based on the milder question and all the loony questions, Grok comes up with an answer that is insane. Then the Twitter post only shows you the mild question and the crazy answer and everyone reposts it.

???

Profit.
 
Last edited:
No matter how hard I try, I cannot get ChatGPT to spew that nonsense. It's not because the programmers don't let it. We know what that looks like. For instance, if you ask it to write a story like Toni Morrison, it will respond with a statement like "due to intellectual property constraints, I can't do that for you. I can write something with the following properties: [list of literary qualities]" It also does (probably even more) with image generation.

But when I ask it Holocaust denying questions, it doesn't do that. It doesn't mention the terms of service. It just answers the question truthfully, that the Holocaust did happen. No behind the scene filter needed.

I suspect the more likely reason is that Grok probably trains extensively on Twitter posts.
 
No matter how hard I try, I cannot get ChatGPT to spew that nonsense. It's not because the programmers don't let it. We know what that looks like. For instance, if you ask it to write a story like Toni Morrison, it will respond with a statement like "due to intellectual property constraints, I can't do that for you. I can write something with the following properties: [list of literary qualities]" It also does (probably even more) with image generation.

But when I ask it Holocaust denying questions, it doesn't do that. It doesn't mention the terms of service. It just answers the question truthfully, that the Holocaust did happen. No behind the scene filter needed.

I suspect the more likely reason is that Grok probably trains extensively on Twitter posts.
Yeah, cause it's not "woke!"

And training itself on what twitter has become - honestly it's surprising Grok hasn't tried to actually blow up a synagogue itself.
 
Apparently, according to third party testers, Grok 4 is the best reasoning model out there. That will change within a month or two as one of the other companies leap frog them with their newest models but its still impressive. Meta (Facebook), by comparison, was not able to leapfrog the best models with their most recent release despite dumping billions into the project and working on LLM's longer. I thought xAI was going to be a bit of an also ran in the reasoning model space with companies like IBM and Microsoft but they have some legit expertise.

 
This one I believe. I think its legit and not a result of unseen influencing. I'm not sure its intentional or a result of unforseen side effects of the base prompting but it could influence the model in negative ways.
Yeah I thought it was interesting just to follow how they worked through the issue and that using “you” in the prompt seems to trigger Grok to think it was being tasked with speaking for Elon.
 
Yeah I thought it was interesting just to follow how they worked through the issue and that using “you” in the prompt seems to trigger Grok to think it was being tasked with speaking for Elon.
It kinda is, no? I mean, he’s the one that decides when and how it needs to be tweaked. The goal appears to be dishing out right wing responses with enough cover that people will argue it “wasn’t a Nazi salute, it was just a clumsy wave.”
 
Yeah I thought it was interesting just to follow how they worked through the issue and that using “you” in the prompt seems to trigger Grok to think it was being tasked with speaking for Elon.
I thought so too. I think its just a theory but a pretty good one. Not even sure if the guys in the chat will be able to confirm it for sure because of the black box on these things. xAI could confirm that the prompt is causing it by running a demo model behind the scenes and changing the prompt but they may not share those results publicly or they may not do the test. And even if they did the tests, they still likely won't be able to definitively say why that prompt would cause it to overweight elon tweets.

Its just such a cool find and the more I dork out on it, the more I want them to go down the research path and share results.
 
I thought so too. I think its just a theory but a pretty good one. Not even sure if the guys in the chat will be able to confirm it for sure because of the black box on these things. xAI could confirm that the prompt is causing it by running a demo model behind the scenes and changing the prompt but they may not share those results publicly or they may not do the test. And even if they did the tests, they still likely won't be able to definitively say why that prompt would cause it to overweight elon tweets.

Its just such a cool find and the more I dork out on it, the more I want them to go down the research path and share results.
I completely agree. The issue, though, is that by the time it would take one to get through the research process it would be obsolete.
 
I completely agree. The issue, though, is that by the time it would take one to get through the research process it would be obsolete.
To me, the value of researching why something like that would happen would be to understand more about how these llms work. I think that would have more value beyond just this grok 4 model.

But maybe not. Maybe it's as simple as grok is the only one that has access to the Twitter database and maybe all those tweets are overweighting the model, and because Elon has made his tweets the most important by design, Grok is naturally just making elon's tweets the most important.

That would be kind of the boring answer. It will still be kind of cool because it would illustrate how the choices that developers make on what data to train on actually make a real difference but it wouldn't really tell much new about how the llms work.
 
Last edited:
Back
Top