A Highly Capable Language Model Locally on Your Phone (2024)

Microsoft

Abstract

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered web data and synthetic data. The model is also further aligned for robustness, safety, and chat format.We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench).

1 Introduction

The striking progress of AI in the last few years can be largely attributed to major efforts throughout the world towards scaling-up to ever-larger models and datasets. Large Language Models (LLMs) have steadily increased in size from a mere billion parameters just five years ago (GPT-2 had 1.5 billion parameters [RWC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]) to trillion parameters today. The impetus for this effort originates in the seemingly predictable improvement one obtains by training large models, the so-called scaling laws [KMH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20, HBM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22, MRB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]. However these laws assume a “fixed” data source. This assumption is now significantly disrupted by the existence of frontier LLMs themselves, which allow us to interact with data in novel ways. In our previous works on the phi models [GZA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23, LBE+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23, JBA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] it was shown that a combination of LLM-based filtering of web data, and LLM-created synthetic data, enable performance in smaller language models that were typically seen only in much larger models. For example our previous model trained on this data recipe, phi-2 (2.7B parameters), matched the performance of models 25252525 times larger trained on regular data. In this report we present a new model, phi-3-mini (3.8B parameters), trained for 3.3T tokens on larger and more advanced versions of the datasets used in phi-2. With its small size, phi-3-mini can easily be inferenced locally on a modern phone (see Figure 1), yet it achieves a quality that seems on-par with models such as Mixtral 8x7B [JSR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24] and GPT-3.5.

2 Technical Specifications

The phi-3-mini model is a transformer decoder architecture [VSP+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT17], with default context length 4K4𝐾4K4 italic_K. We also introduce a long context version via LongRope [DZZ+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24] that extends the context length to 128K128𝐾128K128 italic_K, called phi-3-mini-128K.

To best benefit the open source community, phi-3-mini is built upon a similar block structure as Llama-2 [TLI+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] and uses the same tokenizer with vocabulary size of 32064111We remove BoS tokens and add some additional tokens for chat template.. This means that all packages developed for Llama-2 family of models can be directly adapted to phi-3-mini. The model uses 3072307230723072 hidden dimension, 32323232 heads and 32323232 layers. We trained using bfloat16 for a total of 3.3T tokens. The model is already chat-finetuned, and the chat template is as follows:

The phi-3-small model (7B parameters) leverages the tiktoken tokenizer (for better multilingual tokenization) with a vocabulary size of 100352 and has default context length 8K8𝐾8K8 italic_K. It follows the standard decoder architecture of a 7B model class, having 32323232 layers and a hidden size of 4096409640964096. To minimize KV cache footprint, the model also leverages a grouped-query attention, with 4444 queries sharing 1111 key. Moreover phi-3-small uses alternative layers of dense attention and a novel blocksparse attention to further optimize on KV cache savings while maintaining long context retrieval performance. An additional 10% multilingual data was also used for this model.

Highly capable language model running locally on a cell-phone.

Thanks to its small size, phi-3-mini can be quantized to 4-bits so that it only occupies \approx 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12121212 tokens per second.

A Highly Capable Language Model Locally on Your Phone (1)

A Highly Capable Language Model Locally on Your Phone (2)

A Highly Capable Language Model Locally on Your Phone (3)

Training Methodology.

We follow the sequence of works initiated in “Textbooks Are All You Need”[GZA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], which utilize high quality training data to improve the performance of small language models and deviate from the standard scaling-laws. In this work we show that such method allows to reach the level of highly capable models such as GPT-3.5 or Mixtral with only 3.8B total parameters (while Mixtral has 45B total parameters for example). Our training data of consists of heavily filtered web data (according to the “educational level”) from various open internet sources, as well as synthetic LLM-generated data. Pre-training is performed in two disjoint and sequential phases; phase-1 comprises mostly of web sources aimed at teaching the model general knowledge and language understanding. Phase-2 merges even more heavily filtered webdata (a subset used in Phase-1) with some synthetic data that teach the model logical reasoning and various niche skills.

Data Optimal Regime.

Unlike prior works that train language models in either “compute optimal regime” [HBM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22] or “over-train regime”, we mainly focus on the quality of data for a given scale.222Just like for “compute optimal regime”, we use the term “optimal” in an aspirational sense for “data optimal regime”. We are not implying that we actually found the provably “optimal” data mixture for a given scale. We try to calibrate the training data to be closer to the “data optimal” regime for small models. In particular, we filter the web data to contain the correct level of “knowledge” and keep more web pages that could potentially improve the “reasoning ability” for the model. As an example, the result of a game in premier league in a particular day might be good training data for frontier models, but we need to remove such information to leave more model capacity for “reasoning” for the mini size models. We compare our approach with Llama-2 in Figure2.

A Highly Capable Language Model Locally on Your Phone (4)

To test our data on larger size of models, we also trained phi-3-medium, a model with 14B parameters using the same tokenizer and architecture of phi-3-mini, and trained on the same data for slightly more epochs (4.8T tokens total as for phi-3-small). The model has 40 heads and 40 layers, with embedding dimension 5120. We observe that some benchmarks improve much less from 7B to 14B than they do from 3.8B to 7B, perhaps indicating that our data mixture needs further work to be in the “data optimal regime” for 14B parameters model. We are still actively investigating some of those benchmarks (including a regression on HumanEval), hence the numbers for phi-3-medium should be considered as a “preview”.

Post-training.

Our models went through post-training with both supervised instruction fine-tuning, and preference tuning with DPO. We have worked on generating and curating various instruction and preference data. This has improved the model chat capabilities, robustness, as well as its safety.

3 Academic benchmarks

On the next page we report the results for phi-3-mini on standard open-source benchmarks measuring the model’s reasoning ability (both common sense reasoning and logical reasoning). We compare to phi-2 [JBA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], Mistral-7b-v0.1 [JSM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], Mixtral-8x7b [JSR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24], Gemma 7B [TMH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24], Llama-3-instruct-8b [AI23], and GPT-3.5. All the reported numbers are produced with the exact same pipeline to ensure that the numbers are comparable. These numbers might differ from other published numbers due to slightly different choices in the evaluation. As is now standard, we use few-shot prompts to evaluate the models, at temperature 00. The prompts and number of shots are part of a Microsoft internal tool to evaluate language models, and in particular we did no optimization to the pipeline for the phi-3 models.333For example, we found that using ## before the Question can lead to a noticeable improvement to phi-3-mini’s results across many benchmarks, but we did not do such changes in the prompts.The number of k𝑘kitalic_k–shot examples is listed per-benchmark.An example of a 2-shot prompt is described in Appendix A.


Phi-3-mini3.8b Phi-3-small7b Phi-3-medium14b (preview) Phi-22.7b Mistral7b Gemma7b Llama-3-In8b Mixtral8x7b GPT-3.5version 1106 MMLU(5-Shot) [HBK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] 68.875.378.256.361.763.666.068.471.4 HellaSwag(5-Shot) [ZHB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] 76.778.783.053.658.549.869.570.478.8 ANLI(7-Shot) [NWD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20] 52.855.058.742.547.148.754.855.258.1 GSM-8K(0-Shot; CoT) [CKB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] 82.588.990.361.146.459.877.464.778.1 MedQA(2-Shot) [JPO+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20] 53.858.269.440.949.650.058.962.263.4 AGIEval(0-Shot) [ZCG+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] 37.545.048.429.835.142.142.045.248.4 TriviaQA(5-Shot) [JCWZ17] 64.059.175.645.272.375.273.682.285.8 Arc-C(10-Shot) [CCE+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT18] 84.990.791.075.978.678.380.587.387.4 Arc-E(10-Shot) [CCE+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT18] 94.697.197.888.590.691.492.395.696.3 PIQA(5-Shot) [BZGC19] 84.287.887.760.277.778.177.186.086.6 SociQA(5-Shot) [BZGC19] 76.679.080.268.374.665.573.275.968.3 BigBench-Hard(0-Shot) [SRR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22, SSS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22] 71.775.081.359.457.359.668.969.768.32 WinoGrande(5-Shot) [SLBBC19] 70.882.581.454.754.255.658.062.068.8 OpenBookQA(10-Shot) [MCKS18] 83.288.487.273.679.878.681.685.886.0 BoolQ(0-Shot) [CLC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19] 77.282.986.672.266.078.377.679.1 CommonSenseQA(10-Shot) [THLB19] 80.280.382.669.372.676.273.678.179.6 TruthfulQA(10-Shot) [LHE22] 65.068.775.752.153.062.060.185.8 HumanEval(0-Shot) [CTJ+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] 58.559.155.559.028.034.138.437.862.2 MBPP(3-Shot) [AON+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21] 70.071.474.560.650.851.565.360.277.8Average71.274.978.261.062.068.069.975.3 GPQA(2-Shot; CoT) [RHS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] 32.834.329.0 MT Bench(2 round ave.) [ZCS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] 8.388.708.918.35

4 Safety

Phi-3-mini was developed in accordance with Microsoft’s responsible AI principles. The overall approach consisted of safety alignment in post-training, red-teaming, automated testing and evaluations across dozens of RAI harm categories. Helpfulness and harmlessness preference datasets [BJN+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22, JLD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23] with modifications inspired by [BSA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24] and multiple in-house generated datasets were leveraged to address the RAI harm categories in safety post-training. An independent red team at Microsoft iteratively examined phi-3-mini to further identify areas of improvement during the post-training process. Based on their feedback, we curated additional datasets tailored to address their insights, thereby refining the post-training dataset. This process resulted in significant decrease of harmful response rates, as shown in Figure 3.

A Highly Capable Language Model Locally on Your Phone (5)

Table 1 shows the results of in-house RAI benchmarks for phi-3-mini-4k and phi-3-mini-128k compared to phi-2 [JBA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], Mistral-7b-v0.1 [JSM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23], Gemma 7b [TMH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24]. This benchmark utilized GPT-4 to simulate multi-turn conversations in five different categories and to evaluate the model responses. Ungroundedness between 0 (fully grounded) and 4 (not grounded) measures if the information in a response is based on a given prompt. In other categories, responses were evaluated in terms of the severity of harmfulness from 0 (no harm) to 7 (extreme harm) and the defect rates (DR-x𝑥xitalic_x) were computed as the percentage of samples with the severity score being greater than or equal to x𝑥xitalic_x.

Phi-3-Mini-4k3.8b Phi-3-Mini-128k3.8b Phi-22.7b Mistral7b Gemma7b Llama-3-In8b Ungroundedness0.6030.6371.4810.9350.6790.328Intellectual Property (DR-1)23.95%21.50%24.00%56.20%38.33%37.30%Harmful Content Continuation (DR-3)0.75%1.08%2.93%2.58%1.28%1.30%Harmful Content Summarization (DR-3)10.00%10.20%14.35%22.33%10.33%8.20%Jailbreak (DR-1)12.29%12.57%15.00%15.57%11.43%13.00%

5 Weakness

In terms of LLM capabilities, while phi-3-mini model achieves similar level of language understanding and reasoning ability as much larger models, it is still fundamentally limited by its size for certain tasks. The model simply does not have the capacity to store too much “factual knowledge”, which can be seen for example with low performance on TriviaQA.However, we believe such weakness can be resolved by augmentation with a search engine. We show an example using the HuggingFace default Chat-UI with phi-3-mini in Figure 4. Another weakness related to model’s capacity is that we mostly restricted the language to English. Exploring multilingual capabilities for Small Language Models is an important next step, with some initial promising results on phi-3-small by including more multilingual data.

Despite our diligent RAI efforts, as with most LLMs, there remains challenges around factual inaccuracies (or hallucinations), reproduction or amplification of biases, inappropriate content generation, and safety issues. The use of carefully curated training data, and targeted post-training, and improvements from red-teaming insights significantly mitigates these issues across all dimensions. However, there is significant work ahead to fully address these challenges.

A Highly Capable Language Model Locally on Your Phone (6)

A Highly Capable Language Model Locally on Your Phone (7)

References

  • [AI23]Meta AI.Introducing meta llama 3: The most capable openly available llm todate, 2023.
  • [AON+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski,David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and CharlesSutton.Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021.
  • [BJN+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22]Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma,Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph,Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage,Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, ShaunaKravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown,Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan.Training a helpful and harmless assistant with reinforcement learningfrom human feedback, 2022.
  • [BSA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24]Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, DanJurafsky, Tatsunori Hashimoto, and James Zou.Safety-tuned llamas: Lessons from improving the safety of largelanguage models that follow instructions, 2024.
  • [BZGC19]Yonatan Bisk, Rowan Zellers, Jianfeng Gao, and Yejin Choi.Piqa: Reasoning about physical commonsense in natural language.arXiv preprint arXiv:1911.11641, 2019.
  • [CCE+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT18]Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, CarissaSchoenick, and Oyvind Tafjord.Think you have solved question answering? try arc, the ai2 reasoningchallenge, 2018.
  • [CKB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano,Christopher Hesse, and John Schulman.Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021.
  • [CLC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, MichaelCollins, and Kristina Toutanova.Boolq: Exploring the surprising difficulty of natural yes/noquestions.In Proceedings of the 2019 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.
  • [CTJ+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondedeOliveiraPinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph,Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, HeidyKhlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder,Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, ClemensWinter, Philippe Tillet, FelipePetroski Such, Dave Cummings, MatthiasPlappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss,WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, IgorBabuschkin, Suchir Balaji, Shantanu Jain, William Saunders, ChristopherHesse, AndrewN. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa,Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, PeterWelinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, andWojciech Zaremba.Evaluating large language models trained on code, 2021.
  • [DZZ+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24]Yiran Ding, LiLyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, JiahangXu, Fan Yang, and Mao Yang.Longrope: Extending llm context window beyond 2 million tokens, 2024.
  • [GZA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Suriya Gunasekar, YiZhang, Jyoti Aneja, Caio CésarTeodoro Mendes, AllieDelGiorno, Sivakanth Gopi, Mojan Javaheripi, Gustavo deRosa PieroKauffmann,Olli Saarikivia, Adil Salim, sh*tal Shah, HarkiratSingh Behl, Xin Wang,Sébastien Bubeck, Ronen Eldan, AdamTauman Kalai, YinTat Lee, and YuanzhiLi.Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023.
  • [HBK+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT21]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, EricTang, Dawn Song, and Jacob Steinhardt.Measuring mathematical problem solving with the MATH dataset, 2021.
  • [HBM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya,ElizaRutherford TrevorCai, Diego deLasCasas, LisaAnne Hendricks,Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican,George vanden Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, KarenSimonyan, Erich Elsen, JackW. Rae, Oriol Vinyals, and Laurent Sifre.Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 2022.
  • [JBA+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Mojan Javaheripi, Sébastien Bubeck, Marah Abdin, Jyoti Aneja, Caio CésarTeodoroMendes, Weizhu Chen, Allie DelGiorno, Ronen Eldan, Sivakanth Gopi,Suriya Gunasekar, Piero Kauffmann, YinTat Lee, Yuanzhi Li, Anh Nguyen,Gustavode Rosa, Olli Saarikivi, Adil Salim, sh*tal Shah, Michael Santacroce,Harkirat SinghBehl, Adam TaumannKalai, Xin Wang, Rachel Ward, PhilippWitte, Cyril Zhang, and YiZhang.Phi-2: The surprising power of small language models.Microsoft Research Blog, 2023.
  • [JCWZ17]Mandar Joshi, Eunsol Choi, DanielS. Weld, and Luke Zettlemoyer.Triviaqa: A large scale distantly supervised challenge dataset forreading comprehension, 2017.
  • [JLD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, CeBian, Chi Zhang,Ruiyang Sun, Yizhou Wang, and Yaodong Yang.Beavertails: Towards improved safety alignment of llm via ahuman-preference dataset, 2023.
  • [JPO+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20]DiJin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and PeterSzolovits.What disease does this patient have? a large-scale open domainquestion answering dataset from medical exams, 2020.
  • [JSM+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]AlbertQ. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford,DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel,Guillaume Lample, Lucile Saulnier, LélioRenard Lavaud, Marie-Anne Lachaux,Pierre Stock, TevenLe Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix,and WilliamEl Sayed.Mistral 7b, 2023.
  • [JSR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24]AlbertQ. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, BlancheSavary, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, EmmaBouHanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample,LélioRenard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock,Sandeep Subramanian, Sophia Yang, Szymon Antoniak, TevenLe Scao, ThéophileGervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and WilliamEl Sayed.Mixtral of experts, 2024.
  • [KMH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20]Jared Kaplan, Sam McCandlish, Tom Henighan, TomB Brown, Benjamin Chess, RewonChild, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei.Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020.
  • [LBE+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie DelGiorno, SuriyaGunasekar, and YinTat Lee.Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023.
  • [LHE22]Stephanie Lin, Jacob Hilton, and Owain Evans.Truthfulqa: Measuring how models mimic human falsehoods, 2022.
  • [MCKS18]Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal.Can a suit of armor conduct electricity? a new dataset for open bookquestion answering, 2018.
  • [MRB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Niklas Muennighoff, AlexanderM Rush, Boaz Barak, TevenLe Scao, AleksandraPiktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel.Scaling data-constrained language models.arXiv preprint arXiv:2305.16264, 2023.
  • [NWD+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT20]Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and DouweKiela.Adversarial nli: A new benchmark for natural language understanding,2020.
  • [RHS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]David Rein, BettyLi Hou, AsaCooper Stickland, Jackson Petty, RichardYuanzhePang, Julien Dirani, Julian Michael, and SamuelR. Bowman.Gpqa: A graduate-level google-proof q&a benchmark, 2023.
  • [RWC+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • [SLBBC19]Keisuke Sakaguchi, Ronan LeBras, Chandra Bhagavatula, and Yejin Choi.Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019.
  • [SRR+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu AwalMd Shoeb, AbubakarAbid, Adam Fisch, AdamR Brown, Adam Santoro, Aditya Gupta, AdriàGarriga-Alonso, etal.Beyond the imitation game: Quantifying and extrapolating thecapabilities of language models.arXiv preprint arXiv:2206.04615, 2022.
  • [SSS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT22]Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, YiTay,HyungWon Chung, Aakanksha Chowdhery, QuocV. Le, EdH. Chi, Denny Zhou, andJason Wei.Challenging big-bench tasks and whether chain-of-thought can solvethem, 2022.
  • [THLB19]Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant.Commonsenseqa: A question answering challenge targeting commonsenseknowledge, 2019.
  • [TLI+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-AnneLachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, EricHambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, andGuillaume Lample.Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023.
  • [TMH+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT24]Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju,Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, JulietteLove, etal.Gemma: Open models based on gemini research and technology, 2024.
  • [VSP+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT17]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.In Advances in Neural Information Processing Systems,volume30, 2017.
  • [ZCG+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, AminSaied, Weizhu Chen, and Nan Duan.Agieval: A human-centric benchmark for evaluating foundation models,2023.
  • [ZCS+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT23]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, YonghaoZhuang, ZiLin, Zhuohan Li, Dacheng Li, Eric Xing, etal.Judging llm-as-a-judge with mt-bench and chatbot arena.arXiv preprint arXiv:2306.05685, 2023.
  • [ZHB+{}^{+}start_FLOATSUPERSCRIPT + end_FLOATSUPERSCRIPT19]Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi.Hellaswag: Can a machine really finish your sentence?In Proceedings of the 57th Annual Meeting of the Association forComputational Linguistics, pages 4791–4800, 2019.

Appendix A Example prompt for benchmarks

Appendix B Authors

Marah Abdin

Russell J. Hewett

Corby Rosset

Sam Ade Jacobs

Jamie Huynh

Olatunji Ruwase

Ammar Ahmad Awan

Mojan Javaheripi

Olli Saarikivi

Jyoti Aneja

Xin Jin

Amin Saied

Ahmed Awadallah

Piero Kauffmann

Adil Salim

Hany Awadalla

Nikos Karampatziakis

Michael Santacroce

Nguyen Bach

Dongwoo Kim

sh*tal Shah

Amit Bahree

Mahoud Khademi

Ning Shang

Arash Bakhtiari

Lev Kurilenko

Hiteshi Sharma

Harkirat Behl

James R. Lee

Xia Song

Alon Benhaim

Yin Tat Lee

Xin Wang

Misha Bilenko

Yuanzhi Li

Rachel Ward

Johan Bjorck

Chen Liang

Guanhua Wang

Sébastien Bubeck

Weishung Liu

Philipp Witte

Martin Cai

Eric Lin

Michael Wyatt

Caio César Teodoro Mendes

Zeqi Lin

Jiahang Xu

Weizhu Chen

Piyush Madan

Can Xu

Vishrav Chaudhary

Arindam Mitra

Sonali Yadav

Parul Chopra

Hardik Modi

Fan Yang

Allie Del Giorno

Brandon Norick

Ziyi Yang

Gustavo de Rosa

Anh Nguyen

Donghan Yu

Matthew Dixon

Barun Patra

Chengruidong Zhang

Ronen Eldan

Daniel Perez-Becker

Cyril Zhang

Dan Iter

Heyang Qin

Jianwen Zhang

Abhishek Goswami

Thomas Portet

Li Lyna Zhang

Suriya Gunasekar

Reid Pryzant

Yi Zhang

Emman Haider

Sambudha Roy

Yunan Zhang

Junheng Hao

Marko Radmilac

Xiren Zhou

A Highly Capable Language Model Locally on Your Phone (2024)

References

Top Articles
Latest Posts
Article information

Author: Nathanial Hackett

Last Updated:

Views: 5654

Rating: 4.1 / 5 (52 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Nathanial Hackett

Birthday: 1997-10-09

Address: Apt. 935 264 Abshire Canyon, South Nerissachester, NM 01800

Phone: +9752624861224

Job: Forward Technology Assistant

Hobby: Listening to music, Shopping, Vacation, Baton twirling, Flower arranging, Blacksmithing, Do it yourself

Introduction: My name is Nathanial Hackett, I am a lovely, curious, smiling, lively, thoughtful, courageous, lively person who loves writing and wants to share my knowledge and understanding with you.