What is wrong with LLM benchmarks, and why are we still using them? - sh.itjust.works

micheal65536@lemmy.micheal65536.duckdns.org · 11 months ago

In that case ChatGPT is correct, it cannot work with links. You will need to download the video transcript (subtitles) yourself and ask it to summarise that. This definitely works, people have been doing it for months.

micheal65536@lemmy.micheal65536.duckdns.org · 11 months ago

Probably another case of “I don’t want people training AI on my posts/images so I’m nuking my entire online existence”.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Without knowing anything about this model or what it was trained on or how it was trained, it’s impossible to say exactly why it displays this behavior. But there is no “hidden layer” in llama.cpp that allows for “hardcoded”/“built-in” content.

It is absolutely possible for the model to “override pretty much anything in the system context”. Consider any regular “censored” model, and how any attempt at adding system instructions to change/disable this behavior is mostly ignored. This model is probably doing much the same thing except with a “built-in story” rather than a message that says “As an AI assistant, I am not able to …”.

As I say, without knowing anything more about what model this is or what the training data looked like, it’s impossible to say exactly why/how it has learned this behavior or even if it’s intentional (this could just be a side-effect of the model being trained on a small selection of specific stories, or perhaps those stories were over-represented in the training data).

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

There doesn’t appear to be a model anywhere, unless that has been published completely separately and not mentioned anywhere in the code documentation.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Someone explain to me why there are so many frameworks focused on LLM-based “agents” (LangChain, {{guidance}}, and now whatever this is) and how these are practically useful, when I have yet to find a model that can even successfully perform a simple database query to answer an easy question (searching for one or two items by keyword, retrieving their quantity, and adding the quantities together if applicable) regardless of the model, prompt template, and function API used.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

There are only a few popular LLM models. A few more if you count variations such as “uncensored” etc. Most of the others tend to not perform well or don’t have much difference from the more popular ones.

I would think that the difference is likely for two reasons:

LLMs require more effort in curating the dataset for training. Whereas a Stable Diffusion model can be trained by grabbing a bunch of pictures of a particular subject or style and throwing them in a directory, an LLM requires careful gathering and reformatting of text. If you want an LLM to write dialog for a particular character, for example, you would need to try to find or write a lot of existing dialog for that character, which is generally harder than just searching for images on the internet.
LLMs are already more versatile. For example, most of the popular LLMs will already write dialog for a particular character (or at least attempt to) just by being given a description of the character and possibly a short snippet of sample dialog. Fine-tuning doesn’t give any significant performance improvement in that regard. If you want the LLM to write in a specific style, such as Old English, it is usually sufficient to just instruct it to do so and perhaps prime the conversation with a sentence or two written in that style.

micheal65536@lemmy.micheal65536.duckdns.org · edit-2 1 year ago

WizardLM 13B (I didn’t notice any significant improvement with the 30B version), tends to be a bit confined to a standard output format at the expense of accuracy (e.g. it will always try to give both sides of an argument even if there isn’t another side or the question isn’t an argument at all) but is good for simple questions

LLaMa 2 13B (not the chat tuned version), this one takes some practice with prompting as it doesn’t really understand conversation and won’t know what it’s supposed to do unless you make it clear from contextual clues, but it feels refreshing to use as the model is (as far as is practical) unbiased/uncensored so you don’t get all the annoying lectures and stuff

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

What is wrong with LLM benchmarks, and why are we still using them? - sh.itjust.works