Thursday, October 17, 2024

Apple Researchers Suggest Artificial Intelligence Is Still Mostly An Illusion

Researchers at Apple Computer Company have found evidence, via testing, showing that the seemingly intelligent responses given by AI-based LLMs are little more than an illusion. In their paper posted on the arXiv preprint server, the researchers argue that after testing several LLMs, they found that they are not capable of performing genuine logical reasoning…….Continue reading….

By: Bob Yirka

Source: Techxplore

.

Critics:

large language model (LLM) is a type of computational model designed for task related to natural language processing, including language generation. As language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a self-supervised and semi-supervised training process.

The largest and most capable LLMs, as of August 2024, are artificial neural networks built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data. Modern models can be fine-tuned for specific tasks or can be guided by prompt engineering.

These models acquire predictive power regarding syntax, semantics, and ontologies. Inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on. Before 2017, there were a few language models that were large as compared to capacities then available. In the 1990s, the IBM alignment models pioneered statistical language modelling.

A smoothed n-gram model in 2001 trained on 0.3 billion words achieved then-SOTA (state of the art) perplexity. In the 2000s, as Internet use became prevalent, some researchers constructed Internet-scale language datasets (“web as corpus”), upon which they trained statistical language models. In 2009, in most language processing tasks, statistical language models dominated over symbolic language models, as they can usefully ingest large datasets.

After neural networks became dominant in image processing around 2012, they were applied to language modelling as well. Google converted its translation service to Neural Machine Translation in 2016. As it was before Transformers, it was done by seq2seq deep LSTM networks. At the 2017 NeurIPS conference, Google researchers introduced the transformer architecture in their landmark paper “Attention Is All You Need”.

This paper’s goal was to improve upon 2014 Seq2seq technology, and was based mainly on the attention mechanism developed by Bahdanau et al. in 2014. The following year in 2018, BERT was introduced and quickly became “ubiquitous”. Though the original transformer has both encoder and decoder blocks, BERT is an encoder-only model.

Although decoder-only GPT-1 was introduced in 2018, it was GPT-2 in 2019 that caught widespread attention because OpenAI at first deemed it too powerful to release publicly, out of fear of malicious use. GPT-3 in 2020 went a step further and as of 2024 is available only via API with no offering of downloading the model to execute locally.

But it was the 2022 consumer-facing browser-based ChatGPT that captured the imaginations of the general population and caused some media hype and online buzz. The 2023 GPT-4 was praised for its increased accuracy and as a “holy grail” for its multimodal capabilities. OpenAI did not reveal high-level architecture and the number of parameters of GPT-4.

Competing language models have for the most part been attempting to equal the GPT series, at least in terms of number of parameters. Since 2022, source-available models have been gaining popularity, especially at first with BLOOM and LLaMA, though both have restrictions on the field of use. 

Mistral AI’s models Mistral 7B and Mixtral 8x7b have the more permissive Apache License. As of June 2024, The Instruction fine tuned variant of the Llama 3 70 billion parameter model is the most powerful open LLM according to the LMSYS Chatbot Arena Leaderboard, being more powerful than GPT-3.5 but not as powerful as GPT-4.

As of 2024, the largest and most capable models are all based on the Transformer architecture. Some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model). In the context of training LLMs, datasets are typically cleaned by removing toxic passages from the dataset, discarding low-quality data, and de-duplication.

Cleaned datasets can increase training efficiency and lead to improved downstream performance. A trained LLM can be used to clean datasets for training a further LLM. With the increasing proportion of LLM-generated content on the web, data cleaning in the future may include filtering out such content. LLM-generated content can pose a problem if the content is similar to human text (making filtering difficult) but of lower quality (degrading performance of models trained on it).

Training of largest language models might need more linguistic data than naturally available, or that the naturally occurring data is of insufficient quality. In these cases, synthetic data might be used. Microsoft’s Phi series of LLMs is trained on textbook-like data generated by another LLM. Most results previously achievable only by (costly) fine-tuning, can be achieved through prompt engineering, although limited to the scope of a single conversation (more precisely, limited to the scope of a context window).

In order to find out which tokens are relevant to each other within the scope of the context window, the attention mechanism calculates “soft” weights for each token, more precisely for its embedding, by using multiple attention heads, each with its own “relevance” for calculating its own soft weights. For example, the small (i.e. 117M parameter sized) GPT-2 model has had twelve attention heads and a context window of only 1k tokens.

In its medium version it has 345M parameters and contains 24 layers, each with 12 attention heads. For the training with gradient descent a batch size of 512 was utilized. The largest models, such as Google’s Gemini 1.5, presented in February 2024, can have a context window sized up to 1 million (context window of 10 million was also “successfully tested”).

Other models with large context windows includes Anthropic’s Claude 2.1, with a context window of up to 200k tokens.Note that this maximum refers to the number of input tokens and that the maximum number of output tokens differs from the input and is often smaller. For example, the GPT-4 Turbo model has a maximum output of 4096 tokens.

 “Better Language Models and Their Implications”OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.

Language Models are Few-Shot Learners” (PDF)Advances in Neural Information Processing Systems33. Curran Associates, Inc.: 1877–1901. Archived (PDF) from the original on 2023-11-17. Retrieved 2023-03-14.

NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning (PDF). Extended Semantic Web Conference 2024. Hersonissos, Greece.

“Human Language Understanding & Reasoning”Daedalus151 (2): 127–138. doi:10.1162/daed_a_01905S2CID 248377870Archived from the original on 2023-11-17. Retrieved 2023-03-09.

 “Introduction to the Special Issue on the Web as Corpus”. Computational Linguistics. 29 (3): 333–347. doi:10.1162/089120103322711569ISSN 0891-2017.

Scaling to very very large corpora for natural language disambiguation”Proceedings of the 39th Annual Meeting on Association for Computational Linguistics

“The Web as a Parallel Corpus”Computational Linguistics29 (3): 349–380. doi:10.1162/089120103322711578ISSN 0891-2017.

The Unreasonable Effectiveness of Data”IEEE Intelligent Systems24 (2): 8–12. doi:10.1109/MIS.2009.36ISSN 1541-1672.

Attention is All you Need” (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc. Archived (PDF) from the original on 2024-02-21. Retrieved 2024-01-21.

A Primer in BERTology: What We Know About How BERT Works”. Transactions of the Association for Computational Linguistics. 8: 842–866. arXiv:2002.12327doi:10.1162/tacl_a_00349S2CID 211532403Archived from the original on 2022-04-03. Retrieved 2024-01-21.

New AI fake text generator may be too dangerous to release, say creators”The GuardianArchived from the original on 14 February 2019. Retrieved 20 January 2024.

ChatGPT a year on: 3 ways the AI chatbot has completely changed the world in 12 months”Euronews. November 30, 2023. Archived from the original on January 14, 2024. Retrieved January 20, 2024.

GPT-4 is bigger and better than ChatGPT—but OpenAI won’t say why”MIT Technology ReviewArchived from the original on March 17, 2023. Retrieved January 20, 2024.

Parameters in notable artificial intelligence systems”ourworldindata.org. November 30, 2023. Retrieved January 20, 2024.

LMSYS Chatbot Arena Leaderboard”huggingface.coArchived from the original on June 10, 2024. Retrieved June 12, 2024.

What Is a Transformer Model?”NVIDIA BlogArchived from the original on 2023-11-17. Retrieved 2023-07-25.

All languages are NOT created (tokenized) equal”. Language models cost much more in some languages than others. Archived from the original on 2023-08-17. Retrieved 2023-08-17In other words, to express the same sentiment, some languages require up to 10 times more tokens.

Language Model Tokenizers Introduce Unfairness Between Languages”. NeurIPS. arXiv:2305.15425Archived from the original on December 15, 2023. Retrieved September 16, 2023 – via openreview.net.

OpenAI API”. platform.openai.com. Archived from the original on April 23, 2023. Retrieved 2023-04-30.

Pre-trained Language Models”. Foundation Models for Natural Language Processing. Artificial Intelligence:

The Art of Prompt Design: Prompt Boundaries and Token Healing”. Medium. Retrieved 2024-08-05.

Deduplicating Training Data Makes Language Models Better” (PDF). Proceedings of the 60th Annual Meeting of the Association for Computational

More Efficient In-Context Learning with GLaM”. ai.googleblog.com. Archived from the original on 2023-03-12. Retrieved 2023-03-09.

Emergent Abilities of Large Language Models”. Transactions on Machine Learning Research. ISSN 2835-8856Archived from the original on 22 March 2023. Retrieved 19 March 2023.

Illustrated transformer”Archived from the original on 2023-07-25. Retrieved 2023-07-29.

The Illustrated GPT-2 (Visualizing Transformer Language Models)”. Retrieved 2023-08-01.

Our next-generation model: Gemini 1.5″Google. 15 February 2024. Archived from the original on 18 February 2024. Retrieved 18 February 2024.

Long context prompting for Claude 2.1″. December 6, 2023. Archived from the original on August 27, 2024. Retrieved January 20, 2024.

A Short Survey of Pre-trained Language Models for Conversational AI-A New Age in NLP”. Proceedings of the Australasian Computer Science Week Multiconference. pp. 1–4.

Speech and Language Processing (PDF) (3rd edition draft ed.). Archived (PDF) from the original on 23 March 2023. Retrieved 24 May 2022.

From bare metal to a 70B model: infrastructure set-up and scripts”imbue.comArchived from the original on 2024-07-26. Retrieved 2024-07-24.

metaseq/projects/OPT/chronicles at main · facebookresearch/metaseq”GitHub.

State of the Art: Training >70B LLMs on 10,000 H100 clusters”http://www.latent.space. Retrieved 2024-07-24.

The emerging types of language models and why they matter”TechCrunchArchived from the original on 16 March 2023. Retrieved 9 March 2023.

No comments:

Post a Comment

Apple Researchers Suggest Artificial Intelligence Is Still Mostly An Illusion

Researchers at Apple Computer Company have found evidence, via testing, showing that the seemingly intelligent responses given by AI-based L...