A few weeks before the end of last year, NeurIPS wrapped up its week-long 2023 programme in New Orleans.It was the biggest NeurIPS yet in terms of in-person attendees (13,307)and accepted papers (3,540), and possibly the largest academic AI conference ever1.
Given its scale, it’s an impossible conference to summarise. For some fragments from the invited talks, as well assome of the orals,the exhibit hall, poster sessions, tutorials, workshops, and competitions, see our daily blogs.
Having said that, in this post we attempt to take a step back and highlight themes from the conference that stood outto us — as well as what it might say about AI trends in 2024.
Plenty of room at the bottom
One of the main themes throughout the conference sessions was that many current cutting-edge models are too big.Not that they’re cumbersome to manage, expensive to run, difficult to train, or take up a lot of memory —but that they’re bigger than they need to be, and that equivalent performance can be achieved with smaller models.
Throughout this NeurIPS, there were researchers presenting significant leaps forward on the efficiency front —whether through mathematically-equivalent algorithmic improvements on the implementation of attention, alternativesto attention which improve asymptotic scaling, clever quantisation techniques which reduce memory usage, or morethoughtful data filtering which improves performance. For just a few highlights of these, see our summary of theEfficiency Oral Session,Chris Re’s Invited Talk,and the Beyond Scaling panel.
This thinking was also validated by models released during NeurIPS, such asMixtral andPhi. Both of thesesmall-ish models show benchmark performance that’s equivalent and sometimes superior to larger models.
To quote Björn Ommer (quoting Richard Feynman) during hisinvited talk on Scaling and Generative AI —“There’s plenty of room at the bottom”.2
Flavour of the week: LLMs and Diffusion Models
This was the first NeurIPS with submission deadlines after ChatGPT and Stable Diffusion’s release3, and asexpected there was a lot of attention on both LLMS and on Diffusion Models. Many of the best-attended sessions focusedon topics related to these — such as the tutorial onLatent Diffusion Models, and several ofthe Invited Talks.
Fittingly, the Test of Time award went to a paper which set up a lot of the ingredientsfor the LLM revolution (Jeff Dean and Greg Corrado presented; Ilya Sutskever and the other co-authors weren’t there tocollect it in person).
The exhibition hall featured lots of companies with specialised their solutions for effectively pre-training,fine-tuning, and serving LLMs — alongside the usual large tech firms, quant traders, and MLOps solutions.
Better data, please
Alongside the growth of the relatively youngdatasets and benchmarks track,data continues to be a focus at NeurIPS. Many of the speakers referenced the importance of a deep understanding oftraining and evaluation data, with the emphasis shifting from quantity to quality.
One of the runner-ups to the outstanding paper award,Scaling Data-Constrained Language Modelsexamined the effects of multi-epoch training in LLMs, as well as presenting several other interesting empirical resultsaround training data for LLMs.
In one of the conference competitions, the LLM Efficiency Challenge(where participants maximised fine-tuned model performance given only 24h and a single GPU),the winners attributed much of their edge over others to selecting the right subset of training data.
The tutorial on Data-Centric AI made a compelling case for data-centriclearning (as opposed to model-centric learning), and presented several useful resources to help use this approach inbuilding more reliable and responsible AI, including a tool for monitoring performance on subsets of data during modeltraining.
Degrees of openness
In the panel on Beyond Scaling, Percy Liang pointed out that thinking of afoundation model4 as “open” or “not open” isn’t a veryuseful distinction, and that it’s more useful to think about properties such as a model being open-weights,open-training-data, or open-training-code.
Many recent models, like Meta’s Llama/Llama2, Microsoft’s Phi,and Mistral’s models, are open-weights — in the sense that anyone can download the model weights for their owninference or fine-tuning. But this doesn’t tell us how the model was trained, or on what data5. And withoutknowing those two things, it’s hard to really know how good a model is, or how to get the most out of it.
Organisations that the panel highlighted for releasing models which are open in more respects than justweights wereEleuther,HuggingFace,BigScience,AI2, andLLM360.
Benchmarking and Goodhart’s Law
As the community shifts to using more foundation models with varying degrees of openness, the benchmarking norms thatwere designed for open models (or closed models fully developed within the organisation using them) are no longersufficient.
One of the key difficulties is: how can we know that a model wasn’t trained on the benchmark dataset it’s beingevaluated on?
Even if a model wasn’t trained directly on a benchmark dataset, over time any publiclyavailable benchmark dataset will leak into other data, especially when web-scraped training data is so pervasive.Without access to the training data, evaluators are unable to examine similarity between the eval/benchmark samples andthe training corpus. This problem is exacerbated by the fact that models are marketed on their benchmark performance,creating incentives that aren’t conducive to thorough cleaning of training data — a clear example of Goodhart’sLaw.
This is an open challenge, though the competitions track has been dealing withthese considerations for some time.
For occasional email updates from ML Contestswith more content like this conference coverage and insights into competitive ML,subscribe to our mailing list.
It was a great NeurIPS, and left us with the feeling that there’s much more to come soon — especially in termsof democratising access to powerful and fast models. We look forward to another year of groundbreaking research!
For more on NeurIPS 2023, read our daily blogs: expo day, tutorials,day 1, day 2, day 3,and the competition track days.
Our World in Data showsrecent data for some of the top conferences, aggregating both virtual and in-person attendees.NeurIPS 2020 and 2021 were fully virtual, and NeurIPS 2022 had 9,835 attendees(source: NeurIPS fact sheet).The only other conferenceslisted there with more than 13,000 attendees are IROS 2020 and ICML 2021, which were both fully virtual. It’s possiblethat there were larger AI conferences a few decades ago; data for those is not as readily available.↩︎
Richard Feynman used this phrase as the title of a lecture which some see as the origin ofnanotechnology. He was referring specifically to smaller-scale mechanical manipulation down to the level of individualatoms; in the machine-learning context it refers to parameter counts or memory usage rather thanphysical dimensions. More onWikipedia.↩︎
Stable Diffusion was released in August 2022, and ChatGPT in November 2022. The NeurIPS 2022 conferencetook place after this, in December 2022, but much of the agenda for that conference had been set muchearlier — with abstract and paper submission deadlines in May 2022.↩︎
Foundation model: “any model that is trained on broad data (generally using self-supervision at scale)that can be adapted(e.g., fine-tuned) to a wide range of downstream tasks” (arXiv)↩︎
There is a bit more info on Phi-2 training data — “Dataset size: 250B tokens, combination of NLPsynthetic data created by AOAI GPT-3.5 and filtered web data from Falcon RefinedWeb and SlimPajama, which was assessedby AOAI GPT-4” — than Llama2 — “Llama 2 was pretrained on 2 trillion tokens of data from publicly availablesources. The fine-tuning data includes publicly available instruction datasets, as well as over one million newhuman-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data”.(source: HuggingFace model cards)↩︎