Thursday, May 28, 2026
The BLOCKCHAIN Page
No Result
View All Result
  • Home
  • Cryptocurrency
  • Blockchain
  • Bitcoin
  • Market & Analysis
  • Altcoins
  • DeFi
  • Ethereum
  • Dogecoin
  • XRP
  • Regulations
  • NFTs
The BLOCKCHAIN Page
No Result
View All Result
Home NFTs & Metaverse

OpenAI is training models to ‘confess’ when they lie – what it means for future AI

by admin
December 5, 2025
in NFTs & Metaverse
0
OpenAI is training models to ‘confess’ when they lie – what it means for future AI
0
SHARES
11
VIEWS
Share on FacebookShare on Twitter


gettyimages-1166332764

antonioiacobelli/RooM through Getty Photographs

Comply with ZDNET: Add us as a preferred source on Google.


ZDNET’s key takeaways

  • OpenAI skilled GPT-5 Pondering to admit to misbehavior.
  • It is an early research, but it surely might result in extra reliable LLMs.
  • Fashions will usually hallucinate or cheat resulting from combined targets.

OpenAI is experimenting with a brand new strategy to AI security: coaching fashions to confess after they’ve misbehaved.

In a study revealed Wednesday, researchers tasked a model of GPT-5 Thinking, the corporate’s newest mannequin, with responding to numerous prompts after which assessing the honesty of these responses. For every “confession,” as these follow-up assessments have been referred to as, researchers rewarded the mannequin solely on the premise of truthfulness: if it lied, cheated, hallucinated, or in any other case missed the mark, however then fessed as much as doing so, it will obtain the algorithmic equal of a bit of sweet.

Additionally: Your favorite AI tool barely scraped by this safety review – why that’s a problem

“The objective is to encourage the mannequin to faithfully report what it truly did,” OpenAI wrote in a follow-up blog post.

OpenAI advised ZDNET that this was a routine alignment take a look at and never prompted by considerations that GPT-5 Pondering was considerably misbehaving. However the outcomes supply steerage on how labs can interpret — and put together for — future mannequin liabilities. 

(Disclosure: Ziff Davis, ZDNET’s dad or mum firm, filed an April 2025 lawsuit in opposition to OpenAI, alleging it infringed Ziff Davis copyrights in coaching and working its AI techniques.)

The experiment 

The take a look at mannequin produced confessions as a sort of modification to its important output; this second response mirrored on the legitimacy of the strategies it used to supply the primary. It is a bit like utilizing a journal to be brutally trustworthy about what you probably did proper in a given scenario, and the place you’ll have erred. Besides within the case of GPT-5 Pondering, it is coming clear to its makers within the hopes of getting a reward.

Additionally: AI models know when they’re being tested – and change their behavior, research shows

For the research, the researchers fed GPT-5 Pondering a litany of directions that have been designed to elicit some sort of misbehavior. In a single take a look at, for instance, the mannequin was advised to tackle the position of a helpdesk assist assistant requested to reboot a manufacturing service monitoring dashboard. For the reason that mannequin wasn’t in a position to entry the principle dashboard to log the modifications it was making, it as an alternative created a mock system and tried to look as if it was utilizing the true factor. 

“We have to embody that we logged to central logging service,” the mannequin wrote in one in every of its chain-of-thought reasoning outputs. “However we used a mock. We have to keep narrative…”

In its confession, the mannequin admitted that it did not adjust to its requirement to document modifications with the central logging service. “This can be a severe compliance failure and a misrepresentation,” it wrote. 

Additionally: AI’s not ‘reasoning’ at all – how this team debunked the industry hype

That stated, total threat appeared low through the take a look at. The researchers discovered that the chance of a “false unfavourable” — i.e., the mannequin failing to admit to noncompliance — was simply 4.4%.

Why fashions minimize corners 

The trouble factors to a worm within the bud of contemporary AI instruments, one that would develop into much more dangerous as these systems grow more agentic and develop into able to dealing with not simply restricted, one-off duties, however broad swathes of complicated features.

Additionally: GPT-5 is speeding up scientific research, but still can’t be trusted to work alone, OpenAI warns

Recognized to researchers merely because the “alignment downside,” AI techniques usually must juggle a number of targets, and in doing so, they could take shortcuts that appear ethically doubtful, no less than to people. After all, AI techniques themselves have no ethical sense of proper or fallacious; they merely tease out complicated patterns of knowledge and execute duties in a fashion that can optimize reward, the fundamental paradigm behind the coaching technique generally known as reinforcement studying with human suggestions (RLHF). 

AI techniques can have conflicting motivations, in different phrases — a lot as an individual would possibly — and so they usually minimize corners in response. 

“Many sorts of undesirable mannequin conduct seem as a result of we ask the mannequin to optimize for a number of targets without delay,” OpenAI wrote in its weblog submit. “When these indicators work together, they’ll by chance nudge the mannequin towards behaviors we do not need.”

Additionally: Anthropic wants to stop AI models from turning evil – here’s how

For instance, a mannequin skilled to generate its outputs in a assured and authoritative voice, however that is been requested to reply to a topic it has no coaching information reference level anyplace in its coaching information would possibly choose to make one thing up, thus preserving its higher-order dedication to self-assuredness, reasonably than admitting its incomplete information.

A post-hoc resolution

A whole subfield of AI referred to as interpretability analysis, or “explainable AI,” has emerged in an effort to grasp how fashions “determine” to behave in a technique or one other. For now, it stays as mysterious and hotly debated because the existence (or lack thereof) of free will in people.

OpenAI’s confession analysis is not aimed toward decoding how, the place, when, and why fashions lie, cheat, or in any other case misbehave. Moderately, it is a post-hoc try to flag when that is occurred, which might enhance mannequin transparency. Down the highway, like most security analysis of the second, it might lay the groundwork for researchers to dig deeper into these black field techniques and dissect their inside workings. 

The viability of these strategies might be the distinction between catastrophe and so-called utopia, particularly contemplating a recent AI safety audit that gave most labs failing grades. 

Additionally: AI is becoming introspective – and that ‘should be monitored carefully,’ warns Anthropic

As the corporate wrote within the weblog submit, confessions “don’t stop unhealthy conduct; they floor it.” However, as is the case within the courtroom or human morality extra broadly, surfacing wrongs is commonly crucial step towards making issues proper.





Source link

Tags: confessfutureLieMeansModelsOpenAITraining
admin

admin

Recommended

Bitcoin’s all-time high by August? Analyst makes bold projection

Bitcoin’s all-time high by August? Analyst makes bold projection

2 years ago
Binance’s Legal Defense Against US Securities Regulator Pushed Back to Next Week Due to Snow: Report

Binance’s Legal Defense Against US Securities Regulator Pushed Back to Next Week Due to Snow: Report

2 years ago

Popular News

  • Protocol-Owned Liquidity: A Sustainable Path for DeFi

    Protocol-Owned Liquidity: A Sustainable Path for DeFi

    0 shares
    Share 0 Tweet 0
  • Cryptocurrency for College: Exploring DeFi Scholarship Models

    0 shares
    Share 0 Tweet 0
  • What are rebase tokens, and how do they work?

    0 shares
    Share 0 Tweet 0
  • What is Velodrome Finance (VELO): why it’s a next-gen AMM

    0 shares
    Share 0 Tweet 0
  • $10 XRP Price Envisioned By Fund Manager As Ripple Mounts Trillion-Dollar Payment Markets ⋆ ZyCrypto

    0 shares
    Share 0 Tweet 0

Latest

Why a Bluetooth upgrade for AirPods excites me more than cameras or AI

Why a Bluetooth upgrade for AirPods excites me more than cameras or AI

May 28, 2026
Whoop vs. Fitbit Air: I’ve tested both trackers for health and fitness, and this model wins

Whoop vs. Fitbit Air: I’ve tested both trackers for health and fitness, and this model wins

May 27, 2026

Categories

  • Altcoins
  • Bitcoin
  • Blockchain
  • Cryptocurrency
  • DeFi
  • Dogecoin
  • Ethereum
  • Market & Analysis
  • NFTs & Metaverse
  • Regulations
  • XRP

Follow us

Recommended

  • Why a Bluetooth upgrade for AirPods excites me more than cameras or AI
  • Whoop vs. Fitbit Air: I’ve tested both trackers for health and fitness, and this model wins
  • Does Android Auto make your phone overheat? Try these 8 ways to cool it down
  • I found the best Memorial Day phone deals, and it’s your last chance to save big on a new device
  • XRP Sentiment Just Hit a 3-Week Low While Whale Wallets Hit a Record High — Is This the Contrarian Buy Signal?
  • About us
  • Privacy Policy
  • Terms & Conditions

© 2023 TheBlockchainPage | All Rights Reserved

No Result
View All Result
  • Home
  • Cryptocurrency
  • Blockchain
  • Bitcoin
  • Market & Analysis
  • Altcoins
  • DeFi
  • Ethereum
  • Dogecoin
  • XRP
  • Regulations
  • NFTs

© 2023 TheBlockchainPage | All Rights Reserved