Altszn.com
  • Home
  • Crypto
    • Altcoins
    • Bitcoin
    • Ethereum
    • Monero
    • XRP
    • Zcash
  • Web3
  • DeFi
  • NFTs
No Result
View All Result
Altszn.com
  • Home
  • Crypto
    • Altcoins
    • Bitcoin
    • Ethereum
    • Monero
    • XRP
    • Zcash
  • Web3
  • DeFi
  • NFTs
No Result
View All Result
Altszn.com
No Result
View All Result

Developers Unveil a New GPT-4-Based Method for Self-Assessing LLMs, Achieving 80% Agreement with Human Evaluations

Altszn.com by Altszn.com
July 4, 2023
in Metaverse, Web3
0
Developers Unveil a New GPT-4-Based Method for Self-Assessing LLMs, Achieving 80% Agreement with Human Evaluations
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


In a recent series of articles discussing the evaluation of LLMs, it was highlighted that scalability and cost-effectiveness led to the adoption of a GPT-4 comparison approach. This involved using one model to evaluate different answers to the same question, selecting the best response to create a ranking system. As previously mentioned, this method had notable limitations. The creators of the LMSYS.org rating, who introduced this approach a few months ago, have now decided to replace it with a new evaluation method.

Developers Unveil a New GPT-4-Based Method for Self-Assessing LLMs, Achieving 80% Agreement with Human Evaluations
Credit: Metaverse Post (mpost.io)

Published: 4 July 2023, 9:14 am Updated: 04 Jul 2023, 9:19 am

Over the course of their work, the team gathered tens of thousands of real human responses comparing preferences for different answers. This extensive dataset allowed them to gain a more accurate understanding of the pros and cons associated with each response. The new evaluation method still relies on GPT-4, employing automation and scalability. It is accessible to everyone at an affordable price point.

To ensure fairness in the evaluation process using GPT-4, the following challenges were addressed:

  1. Estimation bias resulting from position preference.
  2. Predisposition to verbosity, favoring longer answers without considering their quality.
  3. Self-assertion bias, where preferences are inclined towards the model’s own answers or models trained on them.
  4. Limited reasoning ability when assessing mathematical and logical questions.
Here are some illustrations of the 80 assessed questions. For each of the three groups, there are two parts to the same question.
You can view all questions, all model responses, and pairwise comparisons between more than 20 models on a dedicated website (https://huggingface.co/spaces/lmsys/mt-bench). As usual, the Reasoning and Coding sections contain the most fascinating examples.

After implementing various solutions to mitigate these issues, the authors discovered that powerful language models like GPT-4 align well with human preferences, achieving over 80% agreement in evaluations. This means that the model’s assessment coincides with human ratings in 80% of cases, a level of agreement comparable to two different human evaluators working on the same task. OpenAI has also reported that even co-authors of an article, who closely collaborate, agree in 82-86% of cases.

This benchmark demonstrates how starkly the models differ in various sets of questions. The biggest gap is in reasoning and coding, where the level of models is so far beyond GPT-4. However, models can be used in both roleplaying and writing commonplace texts. The authors have published new Vicuna v1.3 models with sizes ranging from 7 to 33 billion parameters here https://github.com/lm-sys/FastChat/tree/main#vicuna-weights.

It is important to note that while this is not a “perfect way” of evaluation, it represents a significant improvement over previous methods. The authors are now aiming to expand their dataset to include 1000 questions instead of 80, and they are actively working on refining prompts to reduce biases in GPT-4 estimates. They consider two more objective assessments: one based on voting by real people (known as “arena,” where models compete) using Elo points, and another based on predictions from the MMLU benchmark.

Another intriguing fact is that the GPT-4 model is the only one that maintains quality when responding to the second question. This is somewhat contested for two reasons: 1) The model still assesses itself 2) Although the difference is negligible, it illustrates how inadequate other models are at following multi-turn dialogs and instructions.

Enhancing Model Comparison with GPT-4

With the recent emergence of various language models like Vicuna, Koala, and Dolly, the practice of comparing models using GPT-4 has gained popularity. A unique prompt is provided where two answers to the same question, one from model A and another from model B, are inserted. Evaluators are then asked to rate the answers on a scale from 1 to 8, with 1 indicating that model A is significantly better, 8 for model B, and 4-5 representing a draw. Scores of 2-3 and 6-7 indicate a “better model.”

It may seem logical that swapping models A and B would not affect the scores significantly (e.g., 7 becomes 2, 8 becomes 1), and consistent superiority of one model would lead to its victory. However, the phenomenon of “positional bias” arises, where the model tends to assign higher scores more frequently to model A (one). This bias is expected to exhibit symmetry around the 4-5 midpoint, as the prompt patterns are shuffled randomly. Human evaluation accounts for this bias to ensure fairness.

In an insightful study conducted by the team at HuggingFace, they assessed the answers of four models for 329 different questions. Among the interesting findings, the study revealed the following:

  1. The ranking of the four models based on pairwise comparisons was consistent between human assessment and GPT-4, although different Elo rating gaps were observed. This indicates that the model can distinguish between good and bad answers but struggles with borderline cases that are less aligned with human evaluations.
  2. Interestingly, the model rated answers from other models, particularly those trained on GPT-4 answers, higher than real human answers.
  3. There is a high correlation (Pearson=0.96) between the GPT-4 score and the number of unique tokens in the response. This suggests that the model does not evaluate the quality of the answer, emphasizing the need for cautious interpretation.

These findings underscore the importance of careful evaluation when utilizing GPT-4 for model comparison. While the model can differentiate between answers to some extent, its assessments may not always align perfectly with human judgments, especially in nuanced scenarios. It is crucial to exercise caution and consider additional factors when relying solely on GPT-4 scores. By refining prompts and incorporating diverse assessments, researchers aim to enhance the reliability and accuracy of GPT-4 estimates.

The article was written with the support of the telegram channel community.

Read more about AI:





Read More: mpost.io

Tags: AchievingagreementDevelopersEvaluationsGPT4BasedHumanLLMsMetaverseMethodSelfAssessingunveil
ADVERTISEMENT

Recent

Polygon co-founder steps down, says he can no longer give his best

Polygon co-founder steps down, says he can no longer give his best

May 24, 2025
Crypto, NFTs are a lifeboat in the sinking fiat system: Finance Redefined

Crypto, NFTs are a lifeboat in the sinking fiat system: Finance Redefined

May 23, 2025
'Major Wake-Up Call': How $400M Coinbase Breach Exposes Crypto's Dark Side

'Major Wake-Up Call': How $400M Coinbase Breach Exposes Crypto's Dark Side

May 23, 2025

Categories

  • Bitcoin (4,653)
  • Blockchain (11,040)
  • Crypto (8,975)
  • Dark Web (486)
  • DeFi (8,216)
  • Ethereum (4,697)
  • Metaverse (7,103)
  • Monero (265)
  • NFT (1,253)
  • Solana (4,963)
  • Web3 (20,186)
  • Zcash (475)

Category

Select Category

    Advertise

    Advertise your site, company or product to millions of web3, NFT and cryptocurrency enthusiasts. Learn more

    Useful Links

    Advertise
    DMCA
    Contact Us
    Privacy Policy
    Shipping & Returns
    Terms of Use

    Resources

    Exchanges
    Changelly
    Web3 Jobs

    Recent News

    Polygon co-founder steps down, says he can no longer give his best

    Polygon co-founder steps down, says he can no longer give his best

    May 24, 2025
    Crypto, NFTs are a lifeboat in the sinking fiat system: Finance Redefined

    Crypto, NFTs are a lifeboat in the sinking fiat system: Finance Redefined

    May 23, 2025

    © 2022 Altszn.com. All Rights Reserved.

    No Result
    View All Result
    • Home
      • Home – Layout 1
      • Home – Layout 2
      • Home – Layout 3

    © Altszn.com. All Rights Reserved.

    • bitcoinBitcoin (BTC) $ 107,763.00
    • ethereumEthereum (ETH) $ 2,498.60
    • tetherTether (USDT) $ 1.00
    • xrpXRP (XRP) $ 2.31
    • bnbBNB (BNB) $ 662.15
    • solanaSolana (SOL) $ 172.96
    • usd-coinUSDC (USDC) $ 0.999797
    • dogecoinDogecoin (DOGE) $ 0.221946
    • cardanoCardano (ADA) $ 0.738211
    • tronTRON (TRX) $ 0.272292
    • staked-etherLido Staked Ether (STETH) $ 2,497.03
    • wrapped-bitcoinWrapped Bitcoin (WBTC) $ 107,573.00
    • suiSui (SUI) $ 3.63
    • hyperliquidHyperliquid (HYPE) $ 34.90
    • wrapped-stethWrapped stETH (WSTETH) $ 3,019.10
    • chainlinkChainlink (LINK) $ 15.32
    • avalanche-2Avalanche (AVAX) $ 22.60
    • stellarStellar (XLM) $ 0.284412
    • shiba-inuShiba Inu (SHIB) $ 0.000015
    • bitcoin-cashBitcoin Cash (BCH) $ 424.08
    • leo-tokenLEO Token (LEO) $ 8.83
    • hedera-hashgraphHedera (HBAR) $ 0.188539
    • the-open-networkToncoin (TON) $ 3.01
    • moneroMonero (XMR) $ 403.35
    • litecoinLitecoin (LTC) $ 95.57
    • wethWETH (WETH) $ 2,498.69
    • polkadotPolkadot (DOT) $ 4.52
    • usdsUSDS (USDS) $ 0.999764
    • bitget-tokenBitget Token (BGB) $ 5.50
    • wrapped-eethWrapped eETH (WEETH) $ 2,662.72
    • binance-bridged-usdt-bnb-smart-chainBinance Bridged USDT (BNB Smart Chain) (BSC-USD) $ 1.00
    • pepePepe (PEPE) $ 0.000014
    • pi-networkPi Network (PI) $ 0.766422
    • ethena-usdeEthena USDe (USDE) $ 1.00
    • whitebitWhiteBIT Coin (WBT) $ 31.75
    • coinbase-wrapped-btcCoinbase Wrapped BTC (CBBTC) $ 107,790.00
    • aaveAave (AAVE) $ 267.50
    • bittensorBittensor (TAO) $ 427.33
    • daiDai (DAI) $ 1.00
    • uniswapUniswap (UNI) $ 6.00
    • nearNEAR Protocol (NEAR) $ 2.75
    • aptosAptos (APT) $ 5.18
    • jito-staked-solJito Staked SOL (JITOSOL) $ 208.01
    • okbOKB (OKB) $ 52.25
    • ondo-financeOndo (ONDO) $ 0.934573
    • blackrock-usd-institutional-digital-liquidity-fundBlackRock USD Institutional Digital Liquidity Fund (BUIDL) $ 1.00
    • crypto-com-chainCronos (CRO) $ 0.095515
    • kaspaKaspa (KAS) $ 0.105142
    • ethereum-classicEthereum Classic (ETC) $ 18.18
    • internet-computerInternet Computer (ICP) $ 5.15
    • bitcoinBitcoin (BTC) $ 107,763.00
    • ethereumEthereum (ETH) $ 2,498.60
    • tetherTether (USDT) $ 1.00
    • xrpXRP (XRP) $ 2.31
    • bnbBNB (BNB) $ 662.15
    • solanaSolana (SOL) $ 172.96
    • usd-coinUSDC (USDC) $ 0.999797
    • dogecoinDogecoin (DOGE) $ 0.221946
    • cardanoCardano (ADA) $ 0.738211
    • tronTRON (TRX) $ 0.272292
    • staked-etherLido Staked Ether (STETH) $ 2,497.03
    • wrapped-bitcoinWrapped Bitcoin (WBTC) $ 107,573.00
    • suiSui (SUI) $ 3.63
    • hyperliquidHyperliquid (HYPE) $ 34.90
    • wrapped-stethWrapped stETH (WSTETH) $ 3,019.10
    • chainlinkChainlink (LINK) $ 15.32
    • avalanche-2Avalanche (AVAX) $ 22.60
    • stellarStellar (XLM) $ 0.284412
    • shiba-inuShiba Inu (SHIB) $ 0.000015
    • bitcoin-cashBitcoin Cash (BCH) $ 424.08
    • leo-tokenLEO Token (LEO) $ 8.83
    • hedera-hashgraphHedera (HBAR) $ 0.188539
    • the-open-networkToncoin (TON) $ 3.01
    • moneroMonero (XMR) $ 403.35
    • litecoinLitecoin (LTC) $ 95.57
    • wethWETH (WETH) $ 2,498.69
    • polkadotPolkadot (DOT) $ 4.52
    • usdsUSDS (USDS) $ 0.999764
    • bitget-tokenBitget Token (BGB) $ 5.50
    • wrapped-eethWrapped eETH (WEETH) $ 2,662.72
    • binance-bridged-usdt-bnb-smart-chainBinance Bridged USDT (BNB Smart Chain) (BSC-USD) $ 1.00
    • pepePepe (PEPE) $ 0.000014
    • pi-networkPi Network (PI) $ 0.766422
    • ethena-usdeEthena USDe (USDE) $ 1.00
    • whitebitWhiteBIT Coin (WBT) $ 31.75
    • coinbase-wrapped-btcCoinbase Wrapped BTC (CBBTC) $ 107,790.00
    • aaveAave (AAVE) $ 267.50
    • bittensorBittensor (TAO) $ 427.33
    • daiDai (DAI) $ 1.00
    • uniswapUniswap (UNI) $ 6.00
    • nearNEAR Protocol (NEAR) $ 2.75
    • aptosAptos (APT) $ 5.18
    • jito-staked-solJito Staked SOL (JITOSOL) $ 208.01
    • okbOKB (OKB) $ 52.25
    • ondo-financeOndo (ONDO) $ 0.934573
    • blackrock-usd-institutional-digital-liquidity-fundBlackRock USD Institutional Digital Liquidity Fund (BUIDL) $ 1.00
    • crypto-com-chainCronos (CRO) $ 0.095515
    • kaspaKaspa (KAS) $ 0.105142
    • ethereum-classicEthereum Classic (ETC) $ 18.18
    • internet-computerInternet Computer (ICP) $ 5.15