ML-Bench: Large Language Models Leverage Open-source Libraries for Machine Learning Tasks

Yale University Nanjing University Peking University
* Contributed equally.

ML-Bench aims to evaluate the effectiveness of LLMs in utilizing existing libraries and performing machine-learning tasks.

Our contribution:

1. We propose a novel task that requires LLMs to comprehend long-context documents, navigate codebases, understand instructions, and generate executable code.

2. We carefully provide various settings to accommodate different LLMs (i.e., closed-source LLMs, open-source LLMs, and agents).

3. We conduct comprehensive evaluations across settings and popular LLMs. Experiments show that GPT-4 performs remarkable improvement over other LLMs, but still only manages to accomplish 39.73% of the tasks. Other popular LLms suffer from hallucinations and perform poorly.


Large language models have shown promising performance in code generation benchmarks. However, a considerable divide exists between these benchmark achievements and their practical applicability, primarily attributed to real-world programming's reliance on pre-existing libraries. Instead of evaluating LLMs to code from scratch, this work aims to propose a new evaluation setup where LLMs use open-source libraries to finish machine learning tasks. Therefore, we propose ML-Bench, an expansive benchmark developed to assess the effectiveness of LLMs in leveraging existing functions in open-source libraries. Consisting of 10,040 samples spanning 130 tasks over 14 notable machine learning GitHub repositories. In this setting, given a specific machine learning task instruction and the accompanying README in a codebase, an LLM is tasked to generate code to accomplish the task. This necessitates the comprehension of long and language-code interleaved documents, as well as the understanding of complex cross-file code structures, introducing new challenges. Notably, while GPT-4 exhibits remarkable improvement over other LLMs, it manages to accomplish only 39.73\% of the tasks, leaving a huge space for improvement. We address these challenges by proposing ML-Agent, designed to effectively navigate the codebase, locate documentation, retrieve code, and generate executable code. Empirical results demonstrate that ML-Agent, built upon GPT-4, results in further improvements.

Benchmark Construction