Here’s an interesting prompt, turn a mockup into an application. I’ve been getting some dumbed down outputs so it’s no wonder that the Program Bench has such low scores.

The amount of hand-holding is absurd and it can still get it wrong when presented with the answer which is nuts. I do wonder what the future holds. Right now there’s just no way that these (large language) models are worth any money since outputs vary wildly.

While working on WebVG it became apparent that the (large language) models will often ignore part of the prompt. Cutting corners to “save tokens” is a ridiculous concept as it neither gets anything done right nor “save tokens”.