Ars AI headline experiment finale—we came, we saw, we used a lot of compute time

Ars AI headline experiment finale—we came, we saw, we used a lot of compute time
Aurich Lawson | Getty Images

We may have bitten off more than we could chew, folks.

An Amazon engineer told me that when he heard what I was trying to do with Ars headlines, the first thing he thought was that we had chosen a deceptively hard problem. He warned that I needed to be careful about properly setting my expectations. If this was a real business problem… well, the best thing he could do was suggest reframing the problem from “good or bad headline” to something less concrete.

That statement was the most family-friendly and concise way of framing the outcome of my four-week, part-time crash course in machine learning. As of this moment, my PyTorch kernels aren’t so much torches as they are dumpster fires. The accuracy has improved slightly, thanks to professional intervention, but I am nowhere near deploying a working solution. Today, as I am allegedly on vacation visiting my parents for the first time in over a year, I sat on a couch in their living room working on this project and accidentally launched a model training job locally on the Dell laptop I brought—with a 2.4 GHz Intel Core i3 7100U CPU—instead of in the SageMaker copy of the same Jupyter notebook. The Dell locked up so hard I had to pull the battery out to reboot it.

But hey, if the machine isn’t necessarily learning, at least I am. We’re almost at the end, but if this were a classroom assignment, my grade on the transcript would probably be an “Incomplete.”

The gang tries some machine learning

To recap: I was given the pairs of headlines used for Ars articles over the past five years with data on the A/B test winners and their relative click rates. Then I was asked to use Amazon Web Services’ SageMaker to create a machine-learning algorithm to predict the winner in future pairs of headlines. I ended up going down some ML blind alleys before consulting various Amazon sources for some much-needed help.

Most of the pieces are in place to finish this project. We (more accurately, my “call a friend at AWS” lifeline) had some success with different modeling approaches, though the accuracy rating (just north of 70 percent) was not as definitive as one would like. I’ve got enough to work with to produce (with some additional elbow grease) a deployed model and code to run predictions on pairs of headlines if I crib their notes and use the algorithms created as a result.

But I’ve got to be honest: my efforts to reproduce that work both on my own local server and on SageMaker have fallen flat. In the process of fumbling my way through the intricacies of SageMaker (including forgetting to shut down notebooks, running automated learning processes that I was later advised were for “enterprise customers,” and other miscues), I’ve burned through more AWS budget than I would be comfortable spending on an unfunded adventure. And while I understand intellectually how to deploy the models that have resulted from all this futzing around, I am still debugging the actual execution of that deployment.

If nothing else, this project has become a very interesting lesson in all the ways machine-learning projects (and the people behind them) can fail. And failure this time began with the data itself—or even with the question we chose to ask with it.

I may still get a working solution out of this effort. But in the meantime, I’m going to share the data set on my GitHub that I worked with to provide a more interactive component to this adventure. If you’re able to get better results, be sure to join us next week to taunt me in the live wrap-up to this series. (More details on that at the end.)

Modeler’s glue

After several iterations of tuning the SqueezeBert model we used in our redirected attempt to train for headlines, the resulting set was consistently getting 66 percent accuracy in testing—somewhat less than the previously suggested above-70 percent promise.

This included efforts to reduce the size of the steps taken between learning cycles to adjust inputs—the “learning rate” hyperparameter that is used to avoid overfitting or underfitting of the model. We reduced the learning rate substantially, because when you have a small amount of data (as we do here) and the learning rate is set too high, it will basically make larger assumptions in terms of the structure and syntax of the data set. Reducing that forces the model to adjust those leaps to little baby steps. Our original learning rate was set to 2×10-5 (2E-5); we ratcheted that down to 1E-5.

We also tried a much larger model that had been pre-trained on a vast amount of text, called DeBERTa (Decoding-enhanced BERT with Disentangled Attention). DeBERTa is a very sophisticated model: 48 Transform layers with 1.5 billion parameters.

DeBERTa is so fancy, it has outperformed humans on natural-language understanding tasks in the SuperGLUE benchmark—the first model to do so.

The resulting deployment package is also pretty hefty: 2.9 gigabytes. With all that additional machine-learning heft, we got back up to 72 percent accuracy. Considering that DeBERTa is supposedly better than a human when it comes to spotting meaning within text, this accuracy is, as a famous nuclear power plant operator once said, “not great, not terrible.”

Deployment death spiral

On top of that, the clock was ticking. I needed to try to get a version of my own up and running to test out with real data.

An attempt at a local deployment did not go well, particularly from a performance perspective. Without a good GPU available, the PyTorch jobs running the model and the endpoint literally brought my system to a halt.

So, I returned to trying to deploy on SageMaker. I attempted to run the smaller SqueezeBert modeling job on SageMaker on my own, but it quickly got more complicated. Training requires PyTorch, the Python machine-learning framework, as well as a collection of other modules. But when I imported the various Python modules required to my SageMaker PyTorch kernel, they didn’t match up cleanly despite updates.

As a result, parts of the code that worked on my local server failed, and my efforts became mired in a morass of dependency entanglement. It turned out to be a problem with a version of the NumPy library, except when I forced a reinstall (pip uninstall numpy, pip install numpy -no-cache-dir), the version was the same, and the error persisted. I finally got it fixed, but then I was met with another error that hard-stopped me from running the training job and instructed me to contact customer service:

ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateTrainingJob operation: The account-level service limit 'ml.p3.2xlarge for training job usage' is 0 Instances, with current utilization of 0 Instances and a request delta of 1 Instances. Please contact AWS support to request an increase for this limit.

In order to fully complete this effort, I needed to get Amazon to up my quota—not something I had anticipated when I started plugging away. It’s an easy fix, but troubleshooting the module conflicts ate up most of a day. And the clock ran out on me as I was attempting to side-step using the pre-built model my expert help provided, deploying it as a SageMaker endpoint.

This effort is now in extra time. This is where I would have been discussing how the model did in testing against recent headline pairs—if I ever got the model to that point. If I can ultimately make it, I’ll put the outcome in the comments and in a note on my GitHub page.

https://arstechnica.com/?p=1781863