This article walks you through the process of running Llama 2 on CPU’s.
In the world of technology, innovation is the name of the game. But what if I told you that there’s a way to run Large Language Models (LLMs) like Llama 2 on CPUs, removing GPU constraints, and integrating smoothly with Apache Spark? That’s right, no third-party endpoints, no fuss. Just pure, unadulterated innovation.
The How-To Guide
The llama.cpp project has done the heavy lifting for us, enabling the running of simplified LLMs on CPUs by reducing the resolution of their numeric weights. Add in the llama-cpp-python bindings, and you’ve got simple access to llama.cpp from within Python. Spark’s applyInPandas() takes care of the rest, splitting giant data sources into manageable chunks.
The Plan
As a test, we’ll use Llama 2 to summarize Leo Tolstoy’s War and Peace. That’s over 360 chapters, each treated as a document. We’ll install the 7B quantized chat model and llama-cpp-python, download the novel, split it by chapter, and create a Spark DataFrame. Then, we’ll partition by chapter and generate summaries.
Processing the Text
With a bit of Bash and Python magic, we’ll download the novel, remove unnecessary headers and footers, and split it into chapters. The result? A DataFrame with 365 rows, each containing the full chapter text and number.
Spark Processing
The Python code for generating a single chapter summary is a breeze. We’ll use Spark to do the groupby() and call applyInPandas(), setting the schema to include the chapter summary and number. The result? A summary of each chapter in a matter of minutes.
Summary
We’ve shown how to distribute LLM workloads using Spark with minimal effort. Next steps might include more efficient load/caching of models, parameter optimization, and custom prompts. The possibilities are endless.