Next.js App Router + React Server Components Demo

new
past
show
ask
show
jobs
submit

▲How I run LLMs locally (abishekmuthian.com)

400 points by Abishek_Muthian 131 days ago | 231 comments

upghost 130 days ago [-]

> Before I begin I would like to credit the thousands or millions of unknown artists, coders and writers upon whose work the Large Language Models(LLMs) are trained, often without due credit or compensation

I like this. If we insist on pushing forward with GenAI we should probably at least make some digital or physical monument like "The Tomb of the Unknown Creator".

Cause they sure as sh*t ain't gettin paid. RIP.

qntmfred 130 days ago [-]

I was once a top 1% stack overflow user. I didn't start contributing to make money, just to share information and learn from each other.

that can and should also be good enough for the AI era.

o11c 130 days ago [-]

There's a difference between "I'm willing to exert myself to help people" and "I'm willing to exert myself so that somebody else can slap their name on my effort and make money, then deny me the right to do the same to them".

Also, at a social level - the worst kind of user has always been a help vampire, and LLMs are really good at increasing the number of help vampires.

Workaccount2 130 days ago [-]

An ungodly amount of paid software is built on the back free stack exchange answers. Complaining about lack of compensation for answers that ultimately lead to revenue was just as valid 20 years ago as today.

lmz 130 days ago [-]

> Also, at a social level - the worst kind of user has always been a help vampire, and LLMs are really good at increasing the number of help vampires.

Maybe we should hype up LLMs more so the help vampires and LLMs can keep each other busy?

sdesol 130 days ago [-]

> I didn't start contributing to make money

Your intentions might not have been to make money, but you were creating social credit that could be redeemed for a higher and better paying job. With GitHub, Stack Overflow, etc., you are adding to your resume, but with AI, you literally get nothing in return for contributing.

TheDong 130 days ago [-]

This isn't universally true.

My contributions to Stack Overflow have all been done anonymously, and I haven't ever felt even the slightest bit of desire to link that identity to my real one, add it to my resume, brag to friends.

Having and sharing knowledge to me is its own reward, and I have no intention of profiting from it in any way.

Your own personal way of thinking about the world shines through when you search for ulterior motivations and state as fact that those factors, like a desire for fame or money, must be present.

LVB 130 days ago [-]

I understand the perspective and generally have the same stance. But, a subtlety is that you know that "user425712" (made up) is you, and you can see your contributions being upvoted, quoted, discussed, your overall karma increase, etc.

Given that, as a thought experiment, would you be OK with your answers/comments being attributed to others, or that once you've submitted them, there is no linkage to you at all (e.g. you can't even know what you submitted)? Would it be as satisfying of a process if your contributions are just dissolved into a soup of other data irreversibly?

That doesn't sound like a system I'd be as keen to contribute to. Maybe the ulterior motive is at least being able to find my body of work as a source of personal fulfillment. Where is my work in the various LLMs? I have no idea, and will likely never know.

TheDong 130 days ago [-]

> Given that, as a thought experiment, would you be OK with your answers/comments being attributed to others, or that once you've submitted them, there is no linkage to you at all? Would it be as satisfying of a process if your contributions are just dissolved into a soup of other data irreversibly?

Yes. Wikipedia _almost_ operates like this. I have no expectations of anyone digging into who wrote what, it turns into a soup of information. I still do know I contributed, but I don't care if what I wrote gets rewritten, replaced, improved.

4chan does operate like this, and back in the days that /prog/ had meaningful discussions, I enjoyed participating in threads there.

bruce511 130 days ago [-]

Let me thank you for your contribution.

I have spent the last year in a new area (sql) and I've written a lot of questions to LLMs, which it gas been able to answer well enough for me to make speedy progress.

I'm a big fan of StackOverflow, and Google, and before that reference books to gather and learn.

Each technology builds on the layer before. Information is disseminated, repackaged, reauthored.

I get that some people feel like their contribution should be the end of the line. Despite perhaps that they got that knowledge from somewhere. Do they credit their college professor when posting on Reddit?

So again, thank you for your contribution. Your willingness to answer questions, and the answers you provided, will exist long after you and I do not.

I tip my hat.

126 days ago [-]

EvanAnderson 130 days ago [-]

>Your intentions might not have been to make money, but you were creating social credit that could be redeemed for a higher and better paying job. With GitHub, Stack Overflow, etc., you are adding to your resume, but with AI, you literally get nothing in return for contributing.

I guess I never really thought about having a dog in this race (effectively being a tradesman as I am, and not a "content creator"). I did write a ton on ServerFault.com. I guess I am in this unwittingly.

I'm a little salty about the LLM training on my Stack Exchange answers but I knew what I was getting in to when I signed-up. I don't really subscribe to notions of "intellectual property" so I don't feel strongly on that front.

It just feels impolite and rude. More like plagiarism and less like copyright infringement. A matter of tact between people, versus a legal matter.

The way LLMs turn the collective human expression into "slop" that "they" then "speak" with a tone of authority about feels scummy. It feels like a person who has read a few books and picked up the vernacular and idiom of a trade confidently lying about being an expert.

I can't attribute that scumminess to the LLM itself, since it's just a pile of numbers. I absolutely attribute that scumminess to the companies making money from them.

re: Stack Exchange social credit and redeeming it - I'm not a good self-promoter, and admittedly ServerFault.com is a much smaller traffic Stack Exchange site than Stack Overflow, but begin the top-ranked user on the site for 5+ years didn't confer much in the way of real-world benefits. I had a ton of fun though.

(I got a tiny bit of name recognition from some IRL people and a free trip to the Stack Overflow offices in NYC one time. I definitely got a boost of happiness every time a friend related a story to the effect of: "I ran into an issue, search-engined it, and came up with something you wrote on Server Fault that solved my problem.")

chii 130 days ago [-]

> I absolutely attribute that scumminess to the companies making money from them.

so if they weren't making money (or weren't planning on making any), then would it still be "scummy"?

In other words, do you feel that they're only scummy because they're able to profit off the work (where as you didn't or couldn't)? Why isn't this sour grapes?

TheDong 130 days ago [-]

"sour grapes" assumes that both I, and the person using this information, intended to make money and they were better at it.

The reality is many wikipedia, stack overflow, etc contributors want information to be free and correct, and don't want money, so it's not sour grapes, it's rather annoyance at a perversion of the intent and vision.

I contributed to wikipedia because I want a free reservoir of human knowledge to benefit all, I want the commons to be rich with information. Anyone making money off it is scummy not because I couldn't figure out how to, but because they are perverting the intention of information to be free.

Instead, we've ended up with one of the main interfaces to wikipedia being a paid often inaccurate chatbot for a for-profit company which doesn't attribute wikipedia and burns down forests as a side-effect.

This isn't sour grapes, this is recognizing exploitation of the commons.

bruce511 130 days ago [-]

I might suggest that exploiting the commons does not diminish the value or accessibility of the commons. Indeed, it spreads knowledge faster.

Equally I'd suggest that the commons is not free. It has to be paid for by someone. Wikipedia exists by begging for donations. Google sells advertising (as does StackOverflow as job listings) etc.

I mean, the first carpenter who took "common knowledge" and wrote a (paid for) book did the same thing. Knowledge is definitely not free, and it costs money to spread it.

(As an aside, I've been using LLMs for free all year.)

All through history people have exploited the commons. The printing press, books, universities, education, radio, television, through computers, Google, sites like SO. LLMs are just the latest step in a long long line of history.

intended 130 days ago [-]

If you know of the commons, you near certainly, know of the “tragedy of the commons’. It is VERY clear that exploiting the commons diminishes its value.

There is no such thing as a free lunch. Over grazing common pasture land, results in its decimation.

There are national and international level bodies required to ensure we dont kill all the rhinos. Hell - that we dont kill all the people.

The printing press, universities, education - these are NOT commons in many places, nor do they function as commons. Let alone function as LLMs.

Common knowledge is not the same as the commons.

130 days ago [-]

chii 130 days ago [-]

> It is VERY clear that exploiting the commons diminishes its value.

does it diminish, if the commons is knowledge based, such as online sources? Those sources does not truly disappear after the information is extracted and placed into an LLM.

Unlike a physical commons, which has limitations on use, informational commons don't.

So the fact that someone else is able to gain more value out of the knowledge than others is not a reason to make them scummy - as if they alone don't deserve access to the knowledge that you claim should be free.

If contributors, after seeing how someone else is able to make profits off previously freely available knowledge, feel that they somehow now suddenly deserve to be paid after the fact, then i dont know how to say it but to call it sour grapes.

8note 130 days ago [-]

that social credit remains though? you can still point at your history of answers for how well you understand the thing, and how well you can communicate that knowledge. a portfolio is still a portfolio

thornton 130 days ago [-]

These systems will collapse over time because the incentives are being removed for them to exist. So you won’t be able to point to your answers in quora or whatever but they’ll live in the training records and data and in some shape in neural nets being monetized.

I’m not like anti what’s happening or for it, it’s just, that social credit depends on those institutions surviving.

sdesol 130 days ago [-]

With StackOverflow, GitHub, etc. you would likely have people reach out to you for opportunities. With AI, if you contribute to StackOverflow and if it gets picked up by AI, people may or may not know.

CamperBob2 130 days ago [-]

That's a very small price to pay for what we all stand to gain.

I pay for o1-pro cheerfully, but I wouldn't pay anything at all for Stack Overflow. ChatGPT certainly generates its share of BS, but I have yet to have a question rejected because somebody who was using a different language or OS asked about something vaguely similar 8 years ago.

sdesol 130 days ago [-]

> That's a very small price to pay for what we all stand to gain.

Sure, if AI was made free for everybody (or only be charged for cost to run).

With Stack Overflow, GitHub and others, there is a mutual understanding that contributing can benefit the contributor. What is the incentive to continue contributing if the social agreement is, you get to help define a statistical weight for the next token and nobody will know?

I think the future business model may require AI companies to pay people to contribute, or it might not be a technology roadblock, but rather a data roadblock that prevents further advancement.

bruce511 130 days ago [-]

>> With Stack Overflow, GitHub and others, there is a mutual understanding that contributing can benefit the contributor.

I think lots of people contribute everywhere without getting any benefit at all.

I don't doubt that some who do contribute hoping for, or expecting, some ancillary benefit.

I'd suggest that pretty much the only tangible benefit I can see is those searching for a job. Contributing in public spaces is a good technique for self-promotion as being skilled in an area.

Then again I'd suggest that the majority of people participating on those sites are already employed, so they're not doing it for that benefit. I'd even argue that their day job accomplishments are likely to be more impressive than their github account when it comes to their next interview.

So perhaps I can reassure you. I'm pretty sure people will continue to absorb information, and skills, and will continue to share that with others. This has been the way for thousands of years. It has survived the inventions of writing, printing, radio, television and the internet. It will survive LLMs.

intended 130 days ago [-]

> This has been the way for thousands of years. It has survived the inventions of writing, printing, radio, television and the internet. It will survive LLM

Guilds used to jealously guard their secrets. Metallurgy techniques were lost when their creators died, or were silenced.

And most crucially - the audience has always been primarily humans. There has never been an audience composition, where authors have to worry about plagiarism as the default.

The idea of free exchange of ideas is something that we enjoyed only recently.

This isn’t naysaying or doom and gloom - this is simply reality. Placing our hopes on the wrong things leads to disappointment, anger and resentment when reality decides our hopes are an insufficient argument to change its ways.

sdesol 130 days ago [-]

> Then again I'd suggest that the majority of people participating on those sites are already employed, so they're not doing it for that benefit.

But the company benefits from less confusion and a better user experience. Companies are literally paying employees to provide content as it benefits the company.

I do believe there are people who freely choose to contribute with no strings attached, and I guess we'll learn in the coming years if people will contribute their time and effort for benevolent reasons.

intended 130 days ago [-]

This assumes that everything is a single move game; that people will not adapt to the new GenAI internet, nor the new behaviors of corporations.

Heck - how many people will go to stack overflow when they can get pseudo good answers from GenAI in the first place?

And stack overflow is filled with bots, and not humans?

Why would they contribute, when every action they take will simply mean OpenAI or someone will benefit, and some random bot will answer?

Signal vs Noise is what the internet is all about. Plastic was a godsend when it was invented. It’s a plague found at the bottom of the Mariana Trench today.

8n4vidtmkvmk 130 days ago [-]

Just FYI, I'm in the top 250 users on stack overflow and I think I've been contacted like 3 times in over 10 years. I'm not exactly getting a lot of opportunities from it, not that I've advertised as looking either.

fennecs 130 days ago [-]

Not sure that’s quite equivalent;

Megacorp scraping the entire internet vs individuals sharing knowledge on a platform are a completely different scale.

deathanatos 130 days ago [-]

> I was once a top 1% stack overflow user.

So am I.

> I didn't start contributing to make money, just to share information and learn from each other. that can and should also be good enough for the AI era.

Sure; my content was contributed under CC-BY-SA, and if AI honors the rather simple terms of that license, then it's also good enough for the AI era, just as I had the same expectations of human consumers.

formerlurker 130 days ago [-]

I wonder what would happen if we purposely start polluting the training data? It may be too late for older technology but technology is always changing. If this is a way for me to protect my job and increase my value I actually consider doing this.

AtlasBarfed 130 days ago [-]

The billionaires thank you for your generosity

Yehoshaphat 130 days ago [-]

That quote reads like a land acknowledgement.

mvdtnz 130 days ago [-]

It certainly has the same insincere vibe about it.

swyx 130 days ago [-]

performative wokeness. the main factor is whether you are using it or not. let's not claim OP is a better person because they said grace before sinning like the rest of us

Abishek_Muthian 130 days ago [-]

I agree, I'm not claiming to be a better person for giving credit where it's due; It's just my habit.

I stayed away from using LLMs for a long time but then came a point when it became a disadvantage to not use one, especially as a person belonging to disadvantaged section of the society I'm forced use any & all forms of technology which helps me & others like me to have some equity.

upghost 130 days ago [-]

Hey Abishek I really did not mean to derail the comments with my offhand observation. I actually really liked the attribution and it did not strike me as a (as I've come to learn the term) "land acknowledgement". It came across as sincere.

Besides I don't want a land acknowledgement, I want a statue!!

Abishek_Muthian 130 days ago [-]

Not at all. I really like constructive criticism of HN, may be the way I wrote it came off as insincere to some but I credit even the memes I share on social media.

WillAdams 130 days ago [-]

Why can't we train a model only on public domain materials and if anything is copyright only on materials where the rights-owners have granted permission?

slyall 130 days ago [-]

Because copyright last for 75+ years so there is relatively little that is public domain.

Why should training be subject to copyright? (and at what stages of the process). People learn most of what they know from copyright media, giving copyright owners more control over that might be a bad idea

But AI companies are paying content owners for for access recently (partially to reduce legal risk, partially to get access to material not publicly available) but then giving deep-pocket Incumbent a monopoly might be a bad idea.

WillAdams 130 days ago [-]

Because training intrinsically involves making a copy.

Perhaps requiring that deep-pocket companies actually compensate copyright holders would be a starting point for a fairer system?

ticulatedspline 130 days ago [-]

Mostly I see two outcomes. Either the holders "lose" and it basically follows "human" rules meaning you can train a model on any (legally obtained) material but there's restrictions on use and regurgitation. I say "human" rules because that's basically how people work. Any artist or writer worth their salt has been trained on gobs of copyrighted material, and you can totally hire people to use their knowledge to violate copyright now.

the other option is the holders "win" and these models must only be trained on owned material, in which case the market will collapse into a handful of models controlled by companies that already own huge swaths of intellectual property. basically think DisneyDiffusion or RandomHouse-LLM. Nobody is getting paid more but it's all above board since it's been trained on all the data they have rights to. You might see some holders benefit if they have a particularly large and useful dataset, like Reddit or the Wall Street Journal.

intended 130 days ago [-]

Both no?

People with power and money, can get paid. Artists who have no reach and recognition get exploited. Especially those from countries which arent in north America and Europe.

bruce511 130 days ago [-]

Should we extend test model to what university professors can teach?

Should we extend that model to text books? If I learn about a topic from a book, can I never write a book of my own on that topic?

Should we extend that model to the web? If I learned CSS and JavaScript reading StackOverflow, am I banned from writing a book, giving classes, or indeed even answering questions on those topics?

I ask this in seriousness. I get that LLM training is new, and it's causing concerns. But those concerns have existed forever - the dissemination of information has been going on a long time.

I'm sure the same moral panic existed the first time someone started making marks in clay to describe what us the best time of year to plant the crops.

WillAdams 130 days ago [-]

No, because none of those things involve creating a copy on a computer which will then be regurgitated w/o acknowledgement of what went before, and w/o any sort of compensation to the previous rights bearer.

Time was if a person read multiple books to write a new text, they either purchased them, or borrowed them from a library which had purchased them, and then acknowledged them in a footnote or reference section.

At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).

wittjeff 130 days ago [-]

> At least one author noted that there was a concern that writing would lead to a diminishment of human memory/loss of oral tradition (Louis L'Amour in _The Walking Drum_).

Really? I can't tell if you're joking, so I'll take it at face value.

See, I associate the earliest famous (I thought) expression of that concern with Plato, and before today I couldn't remember any other associated details enough to articulate them with confidence. ChatGPT tells me, using the above quote without the citation as a prompt, that it was in Plato in his dialogue Phaedrus, and offers additional succinct contextual information and a better quote from that work. I probably first learned to associate that complaint about writing with Plato in college, and probably got it from C.D.C. Reeve, who was a philosophy professor and expert on Plato at the college I attended. But I feel no need to cite any of Reeve's works when dropping that vague reference. If I were to use any of Reeve's original thoughts related to analysis of Plato, then a reference would be merited.

It seems to me that there are different layers of abstraction of knowledge and memory, and LLMs mostly capture and very effectively synthesize knowledge at layers of abstraction that are above that of grammar checkers and below that of plagiarism in most cases. It's true that it is the nature of many of today's biggest transformers that they do in some cases produce output that qualifies as plagiarism by conventional standards. Every instance of that plagiarism is problematic, and should be a primary focus of innovation going forward. But in this conversation no one seems to acknowledge that the bar has been moved. The machine looked upon the library, and produced some output, therefore we should assume it is all theft? I am not persuaded.

weitendorf 130 days ago [-]

I don't know if I would ever be able to get over the absurdity of a Statement of Country for internet posters

upghost 130 days ago [-]

Okay maybe maybe but hear me out -- it would be hilarious if this started happening often enough that the LLMs started echoing it XD. That would certainly cause some legal headaches.

Modified3019 130 days ago [-]

Huh. Well today I learned that was a thing: https://en.m.wikipedia.org/wiki/Land_acknowledgement

nameless_me 130 days ago [-]

Land acknowledgements are common in Canada especially in provinces that did not sign treaties with indigenous people prior to taking their land.

model-15-DAV 130 days ago [-]

This seems as useless as land-acknowledgements; "Hey look we took your stuff and are not paying you for it, and we are still profiting from it!!"

8note 130 days ago [-]

both this and land acknowledgements are making sure people are keeping those people in mind. that makes it harder to keep justifying screwing over those people.

mvdtnz 130 days ago [-]

I've seen lots of land acknowledgements from people who own land. Never once seen one put their money where their mouth is.

130 days ago [-]

chb 130 days ago [-]

I’m surprised to see no mention of AnythingLLM (https://github.com/Mintplex-Labs/anything-llm). I use it with an Anthropic API key, but am giving thought to extending it with local LLM. It’s a great app: good file management for RAG, agents with web search, cross platform desktop client, but can also easily be run as a server using docker compose.

Nb: if you’re still paying $20/mo for a feature-poor chat experience that’s locked to a single provider, you should consider using any of the many wonderful chat clients that take a variety of API keys instead. You might find that your LLM utilization doesn’t quite fit a flat rate model, and that the feature set of the third-party client is comparable (or surpasses) that of the LLM provider’s.

edit: included repo link; note on API keys as alternative to subscription

doodlesdev 130 days ago [-]

How does it compare to OpenWebUI? It seems pretty solid, no ideia how I never heard of it before!

alfredgg 130 days ago [-]

Totally agree! Other useful alternative: Lobe Chat (https://github.com/lobehub/lobe-chat)

stogot 130 days ago [-]

Haven’t heard of this. What does it provide that the anthropoic UI does not?

chown 130 days ago [-]

If anyone is looking for a one click solution without having to have a Docker running, try Msty - something that I have been working on for almost a year. Has RAG and Web Search built in among others and can connect to your Obsidian vaults as well.

https://msty.app

amrrs 130 days ago [-]

How is it different from LMstudio or GPT4ALl?

ein0p 130 days ago [-]

Interesting that you show Mistral Large in your screenshots. That model is only licensed for "research" use. While it's not terribly clear how "research" is defined, selling a product with it featured is clearly not "research".

navbaker 130 days ago [-]

They’re not selling Mistral, they’re selling a product that let’s you use Mistral

jswny 130 days ago [-]

I would try this but I want to be able to use it on mobile too which requires hosting. I don’t see how valuable an LLM interface can be if I can only use it on desktop

rgovostes 130 days ago [-]

Your top menu has an "As" button, listing how you compare to alternative products. Is that the label you wanted?

ukuina 130 days ago [-]

Might be better renamed to "Compare"

chown 130 days ago [-]

I wanted to present it as - “Msty As x Alternative”. But I think maybe I like Compare better. I will think about changing it. Thank you for the suggestion.

130 days ago [-]

remir 130 days ago [-]

Woah! Dude, I'm blown away by the app! The UX is wonderful! Just what I was looking for!

rspoerri 131 days ago [-]

I run a pretty similar setup on an m2-max - 96gb.

Just for AI image generation i would rather recommend krita with the https://github.com/Acly/krita-ai-diffusion plugin.

donquichotte 131 days ago [-]

Are you using an external provider for image generation or running something locally?

throwaway314155 131 days ago [-]

Open WebUI sure does pull in a lot of dependencies... Do I really need all of langchain, pytorch, and plenty others for what is advertised as a _frontend_?

Does anyone know of a lighter/minimalist version?

lolinder 130 days ago [-]

Some of the features (RAG retrieval) now use embeddings that are calculated in Open WebUI rather than in Ollama or another backend. It does seem like it'd be nice for them to refactor to make things like that optional for those who want a simpler interface, but then again, there are plenty of other lighter-weight options.

noman-land 131 days ago [-]

https://llamafile.ai/

throwaway314155 131 days ago [-]

I love what llamafile is doing, but I'm primarily interested in a frontend for ollama, as I prefer their method of model/weights distribution. Unless I'm wrong, llamafile serves as both the frontend and backend.

d3VwsX 130 days ago [-]

If I understand the distinction correctly, I run llamafile as a backend. I start it with the filename of a model on the command-line (might need a -M flag or something) and it will start up a chat-prompt for interaction in the terminal but also opens a port that speaks some protocol that I can connect to using a frontend (in my case usually gptel in emacs).

DrBenCarson 130 days ago [-]

LM Studio?

vunderba 130 days ago [-]

Given how much data (personal and otherwise) that you're likely going to feeding into it, I would HIGHLY recommend using an open-source chat interface like Jan.

It's cross platform, works with most LLMs, and has extension support.

https://github.com/janhq/jan

Another alternative is LibreChat, it's more fully featured but its heavier weight and spins up quite a few docker instances.

https://github.com/danny-avila/LibreChat

If you're going through the trouble of running models locally, it doesn't make sense to couple that with a proprietary closed source interface (like LM Studio).

synapsomorphy 130 days ago [-]

Did you misread? No one mentioned LM studio, open-webui (frontend) is open source and usually uses ollama (backend) which is also open source.

Jan and Librechat are both less popular than either open-webui or ollama, how do their features compare?

throwaway314155 130 days ago [-]

No one had suggested LM Studio at this point in the thread.

foldl2022 130 days ago [-]

Off the topic: I am also in favor of minimalist.

In the case of AlphaGeometry, I made AlphaGeometryRE to get rid of tensorflow/jax/flax, and 100+ Python packages.

xyc 129 days ago [-]

Check out https://recurse.chat/

ein0p 130 days ago [-]

Do you really care? Open WebUI has excellent UI and features, and its complexity is well hidden, IMO.

throwaway314155 130 days ago [-]

I agree it's a fine interface, clearly a lot of work has gone into making it polished and feature-rich. I'm simply seeking alternatives. If there aren't any with the same feature set, I'll probably just use Open WebUI.

figmert 131 days ago [-]

I ran it temporarily in Docker the other day. It was a 8gb image. I'm unsure why a webui is 8gb.

130 days ago [-]

halyconWays 131 days ago [-]

Super basic intro but perhaps useful. Doesn't mention quant sizes, which is important when you're GPU poor. Lots of other client-side things you can do too, like KoboldAI, TavernAI, Jan, LangFuse for observability, CogVLM2 for a vision model.

One of the best places to get the latest info on what people are doing with local models is /lmg/ on 4chan's /g/

ashleyn 130 days ago [-]

anyone got a guide on setting up and running the business-class stuff (70B models over multiple A100, etc)? i'd be willing to spend the money but only if i could get a good guide on how to set everything up, what hardware goes with what motherboard/ram/cpu, etc.

nickpsecurity 130 days ago [-]

People shared them regularly on r/LocalLlama:

https://www.reddit.com/r/LocalLLaMA/

Search for terms like hardware build, running large models, multiple GPU’s, etc. Many people there have multiple, consumer GPU’s. There’s probably something about running multiple A100’s.

HuggingFace might have tutorials, too.

Warning: If it’s A100’s, most people say to just rent them from the cloud as needed cuz they’re highly costly upfront. If they’re normally idle, then it’s not as cost effective to own them. Some were using services like vast.ai to get rentals cheaper.

bubaumba 130 days ago [-]

it may be even cheaper to use big models through API. Claude, GPT, whatever. Rental is efficient only for big butches, while API is priced per call/size and is cheap for small models.

talldayo 130 days ago [-]

I don't think you're going to find much of a guide out there because there isn't really a need for one. You just need a Linux client with the Nvidia drivers installed and some form of CUDA runtime present. You could make that happen in a mini PC, a jailbroken Nintendo Switch, a gaming laptop or a 3600W 4U rackmount. The "happy path" is complicated because you truly have so many functional options.

You don't want an A100 unless you've already got datacenter provisioning at your house and an empty 1U rack. I genuinely cannot stress this enough - these are datacenter cards for a reason. The best bang-for-your buck will be consumer-grade cards like the 3060 and 3090, as well as the bigger devkits like the Jetson Orin.

Aurornis 130 days ago [-]

> anyone got a guide on setting up and running the business-class stuff (70B models over multiple A100, etc)?

If you need a guide, the A100 class cards are certainly not what you want. Running data center level clusters in the home is no walk in the park. You wouldn’t even be able to power a decent sized cluster without getting an electrician involved for proper power. Networking is expensive. Even second-hand it all adds up fast. Don’t underestimate how much it would cost.

Getting some 3090s in a case with a big gaming PSU is the way to go for home use. If you really want A100 level, rent it from a cloud provider. Trying to build your own cluster would be like lighting money on fire for a rapidly depreciating asset.

jwrallie 130 days ago [-]

Question: what can you do with the A100s that you cannot to with multiple 3090s?

Or rephrasing for somebody that never tried to experiment with models: which one do you need to have something like ChatGPT-4o (text only) at home?

rcdwealth 130 days ago [-]

[dead]

foundry27 130 days ago [-]

Aye, there’s the kicker. The correct configuration of hardware resources to run and multiplex large models is just as much of a trade secret as model weights themselves when it comes to non-hobbyist usage, and I wouldn’t be surprised if optimal setups are in many ways deliberately obfuscated or hidden to keep a competitive advantage

Edit: outside the HPC community specifically, I mean

codybontecou 130 days ago [-]

The economic barrier to entry probably has a lot to do with it. I'd happily dig into this problem and share my findings but it's simply too expensive for a hobbyist that isn't specialized in it.

130 days ago [-]

dkkergoog 130 days ago [-]

[dead]

Salgat 131 days ago [-]

There is a lot I want to do with LLMs locally, but it seems like we're still not quite there hardware-wise (well, within reasonable cost). For example, Llama's smaller models take upwards of 20 seconds to generate a brief response on a 4090; at that point I'd rather just use an API to a service that can generate it in a couple seconds.

segmondy 131 days ago [-]

This is very wrong. Smaller models would generate response instantly on 4090. I run 3090's and easily get 30-50 tokens/seconds with small models. Folks with 4090 will see easily 80 tokens/sec for 7-8b model in Q8 and probably 120-160 for 3B models. Faster than most public APIs

hedgehog 131 days ago [-]

8B models are pretty fast even on something like a 3060 depending on deployment method (for example Q4 on Ollama).

r-w 130 days ago [-]

They're fast enough for me on CPU even.

deskamess 130 days ago [-]

What CPU and memory specs do you have?

imiric 130 days ago [-]

Sure, except those smaller models are only useful for some novelty and creative tasks. Give them a programming or logic problem and they fall flat on their face.

As I mentioned in a comment upthread, I find ~30B models to be the minimum for getting somewhat reliable output. Though even ~70B models pale in comparison with popular cloud LLMs. Local LLMs just can't compete with the quality of cloud services, so they're not worth using for most professional tasks.

kolbe 131 days ago [-]

They might be talking about 70B

nobodyandproud 131 days ago [-]

Wait, is 70b considered on the smaller side these days?

kolbe 131 days ago [-]

Relative to deepseek's latest, yeah

do_not_redeem 131 days ago [-]

You should absolutely be getting faster responses with a 4090. But that points to another advantage of cloud services—you don't have to debug your own driver issues.

imiric 130 days ago [-]

To me it's not even about performance (speed). It's just that the quality gap between cloud LLM services and local LLMs is still quite large, and seems to be increasing. Local LLMs have gotten better in the past year, but cloud LLMs have even more so. This is partly because large companies can afford to keep throwing more compute at the problem, while quality at smaller scale deployments is not increasing at the same pace.

I have a couple of 3090s and have tested most of the popular local LLMs (Llama3, DeepSeek, Qwen, etc.) at the highest possible settings I can run them comfortably (~30B@q8, or ~70B@q4), and they can't keep up with something like Claude 3.5 Sonnet. So I find myself just using Sonnet most of the time, instead of fighting with hallucinated output. Sonnet still hallucinates and gets things wrong a lot, but not as often as local LLMs do.

Maybe if I had more hardware I could run larger models at higher quants, but frankly, I'm not sure it would make a difference. At the end of the day, I want these tools to be as helpful as possible and not waste my time, and local LLMs are just not there yet.

jhatemyjob 129 days ago [-]

This is the right way of looking at it. Unfortunately (or fortunately for us?) most people don't realize how precious their time is.

zh3 131 days ago [-]

Depends on the model, if it doesn't fit into VRAM performance will suffer. Response here is immediate (at ~15 tokens/sec) on a pair of ebay RTX 3090s in an ancient i3770 box.

If your model does fit into VRAM, if its getting ejected there will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1 (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...).

e12e 131 days ago [-]

>> within reasonable cost

> pair of ebay RTX 3090s

So... 1700 USD?

zh3 130 days ago [-]

£1200 UKP, so a little less. Targetted at having 48GB (2x24Gb) VRAM for running the larger models; having said that, a single 12Gb RTX3060 in another box seems pretty close in local testing (with smaller models).

drillsteps5 131 days ago [-]

If you're looking for most bang for the buck 2x3060(12Gb) might be the best bet. GPUs will be around $400-$600.

drillsteps5 131 days ago [-]

Have been trying forever to find a coherent guide on building dual-GPU box for this purpose, do you know of any? Like selecting the MB, the case, cooling, power supply and cables, any special voodoo required to pair the GPUs etc.

zh3 131 days ago [-]

I'm not aware of any particular guides, the setup here was straightforward - an old motherboard with two PCIe X16 slots (Asus P8Z77V or P8Z77WS), a big enough power supply (Seasonic 850W) and the stock linux Nividia drivers. The RTX 3090's are basic Dell models (i.e. not OC'ed gamer versions), and worth noting they only get hot if used continuously - if you're the only one using them, the fans spin up during a query and back down between. Good 'smoke test' is something like 'while 1; do 'ollama run llama3.3 "Explain cosmology"'; done.

With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and the model uses about 44Gb; so the first start is slow (loading the model into VRAM) but after that response is fast (subject to comment above about KEEP_ALIVE).

mtkd 131 days ago [-]

The consumer hardware channel just hasn't caught up yet -- we'll see a lot more desktop kit appear in 2025 on retail sites (there is a small site in UK selling Nvidia A100 workstations for £100K+ each on a Shopify store)

Seem to remember a similar point in late 90s and having to build boxes to run NT/SQL7.0 for local dev

Expect there will be a swing back to on-prem once enterprise starts moving faster and the legal teams begin to understand what is happening data-side with RAG, agents etc.

fooker 131 days ago [-]

> Nvidia A100 workstations for £100

This seems extremely overpriced.

mtkd 131 days ago [-]

I've not looked properly but I think it's 4 x 80gb A100s

ojbyrne 131 days ago [-]

Maybe this?

https://www.ctoservers.com/nvidia-dgx-station-a100---quad-nv...

£ 80k

131 days ago [-]

o11c 131 days ago [-]

Even on CPU, you should get the start of a response within 5 seconds for Q4 8B-or-smaller Llama models (proportionally faster for smaller ones), which then stream at several tokens per second.

There are a lot of things to criticize about LLMs (the answer is quite likely to ignore what you're actually asking, for example) but your speed problem looks like a config issue instead. Are you calling the API in streaming mode?

131 days ago [-]

pickettd 131 days ago [-]

My gut feeling is that there may be optimization you can do for faster performance (but I could be wrong since I don't know your setup or requirements). In general on a 4090 running between Q6-Q8 quants my tokens/sec have been similar to what I see on cloud providers (for open/local models). The fastest local configuration I've tested is Exllama/TabbyAPI with speculative-decoding (and quantized cache to be able to fit more context)

nobodyandproud 131 days ago [-]

You may have been running with CPU inference, or running models that don’t fit your VRAM.

I was running a 5 bit quantized model of codestral 22b with a Radeon RX 7900 (20 gb), compiled with Vulkan only.

Eyeball only, but the prompt responses were maybe 2x or 3x slower than OpenLLMs gpt-4o (maybe 2-4 seconds for most paragraph long responses).

131 days ago [-]

jhatemyjob 131 days ago [-]

yeah many people don't understand how cheap it is to use the chatgpt API

not to mention all of the other benefits of delegating all the work of setting up the GPUs, public HTTP server, designing the API, security, keeping the model up-to-date with the state of the art, etc

reminds me of the people in the 2000s / early 2010s who would build their own linux boxes back when the platform was super unstable, constantly fighting driver issues etc instead of just getting a mac.

roll-your-own-LLM makes even less sense. at least for those early 2000s linux guys, even if you spent an ungodly amount of time going through the arch wiki or compiling gentoo or whatever, at least those skills are somewhat transferrable to sysadmin/SRE. i dont see how setting up your own instance of ollama has any transferable skills

the only way i could see it making sense is if you're doing super cutting edge stuff that necessitates owning a tinybox, or if you're trying to get a job at openAI or anysphere

eikenberry 131 days ago [-]

Depends on what you value. Many people value keeping general purpose computing free/libre and available to as many as possible. This means using free systems and helping those systems mature.

jhatemyjob 130 days ago [-]

If you're in a position like mitchellh or antirez then more power to you

grahamj 130 days ago [-]

I would agree with that if I didn't mind handing over my prompts/data to big tech companies.

But I do.

129 days ago [-]

jhatemyjob 129 days ago [-]

There are similar superstitions around proprietary software, I hope one day you are able to overcome it before you waste too much of your precious time

grahamj 129 days ago [-]

In what way is wanting to keep my prompts and data private superstition?

129 days ago [-]

pvo50555 131 days ago [-]

There was a post a few weeks back (or a reply to a post) showing an app entirely made using an LLM. It was like a 3D globe made with 3js, and I believe the poster had created it locally on his M4 MacBook with 96 GB RAM? I can't recall which model it was or what else the app did, but maybe someone knows what I'm talking about?

jckahn 130 days ago [-]

That was with Qwen 32B, which is still the best coder model for the size. You could run that just fine with a 36 GB RAM Mac.

pvo50555 129 days ago [-]

Is it actually able to generate all the files and directory structure, etc? Or did the author just take all the responses he got from multi-shot prompting and eventually assemble it into the final product? I believe I used LlamaCoder some time ago to build the scaffolding of an app, but I'm not sure about the state of that project now.

jckahn 126 days ago [-]

I imagine _a lot_ of iteration was involved regardless of the model.

dividefuel 131 days ago [-]

What GPU offers a good balance between cost and performance for running LLMs locally? I'd like to do more experimenting, and am due for a GPU upgrade from my 1080 anyway, but would like to spend less than $1600...

adam_arthur 131 days ago [-]

Inferencing does not require Nvidia GPUs at all, and its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.

Buy a MacMini or MacbookPro with RAM maxed out.

I just bought an M4 mac mini for exactly this use case that has 64GB for ~2k. You can get 128GB on the MBP for ~5k. These will run much larger (and more useful) models.

EDIT: Since the request was for < $1600, you can still get a 32GB mac mini for $1200 or 24GB for $800

talldayo 131 days ago [-]

> its almost criminal to be recommending dedicated GPUs with only 12GB of RAM.

If you already own a PC, it makes a hell of a lot more sense to spend $900 on a 3090 than it does to spec out a Mac Mini with 24gb of RAM. Plus, the Nvidia setup can scale to as many GPUs as you own which gives you options for upgrading that Apple wouldn't be caught dead offering.

Oh, and native Linux support that doesn't suck balls is a plus. I haven't benchmarked a Mac since the M2 generation, but the figures I can find put the M4 Max's compute somewhere near the desktop 3060 Ti: https://browser.geekbench.com/opencl-benchmarks

adam_arthur 131 days ago [-]

A Mac Mini with 24GB is ~$800 at the cheapest configuration. I can respect wanting to do a single part upgrade, but if you're using these LLMs for serious work, the price/perf for inferencing is far in favor of using Macs at the moment.

You can easily use the MacMini as a hub for running the LLM while you do work on your main computer (and it won't eat up your system resources or turn your primary computer into a heater)

I hope that more non-mac PCs come out optimized for high RAM SoC, I'm personally not a huge Apple fan but use them begrudgingly.

Also your $900 quote is a used/refurbished GPU. I've had plenty of GPUs burn out on me in the old days, not sure how it is nowadays, but that's a lot to pay for a used part IMO

fragmede 131 days ago [-]

if you're doing serious work, performance is more important than getting a good price/perf ratio, and a pair of 3090s is gonna be faster. It depends on your budget, however as that configuration is a bit more expensive, however.

adam_arthur 131 days ago [-]

Whether performance or cost is more important depends on your use case. Some tasks that an LLM can do very well may not need to be done often, or even particularly quickly (as in my case).

e.g. LLM as one step of an ETL-style pipeline

Latency of the response really only matters if that response is user facing and is being actively awaited by the user

seanmcdirmid 130 days ago [-]

> M4 Max's compute somewhere near the desktop 3060 Ti

The only advantage is the M4 Max's ability to have way more VRAM than a 3060 Ti. You won't find many M4 Maxes with just 8 or 16 GB of RAM, and I don't think you can do much except use really small models with a 3060 Ti.

talldayo 130 days ago [-]

It's a bit of a moot point when CUDA will run 4 3060Tis in parallel, with further options for paging out to system memory. Since most models (particularly bigger/MOE ones) are sparsely decoded, you can get quite a lot of mileage out of multiple PCIe slots fed with enough bandwidth.

There's no doubt in my mind that the PC is the better performer if raw power is your concern. It's far-and-away the better value if you don't need to buy new hardware and only need a GPU. $2,000 of Nvidia GPUs will buy you halfway to an enterprise cluster, $2,000 of Apple hardware will get you a laptop chip with HBM.

seanmcdirmid 130 days ago [-]

You need a lot of space for that, cooling, and a good fuse that won't trip when you turn it on. I would totally just pay the money for an M4 Ultra MacStudio with 128 GB of RAM (or an M4 Max with 64 GB). It is a much cleaner setup, especially if you aren't interested in image generation (which the Macs are not good at yet).

If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.

talldayo 130 days ago [-]

Well, that's your call. If you're the sort of person that's willing to spend $2,000 on a M4 Ultra (which doesn't quite exist yet but we can pretend it does), then I honest to god do not understand why you'd refuse to spend that same money on a Jetson Orin with the same amount of memory in a smaller footprint with better performance and lower power consumption.

Unless you're specifically speccing out a computer for mobile use, the price premium you spend on a Mac isn't for better software or faster hardware. If you can tolerate Linux or Windows, I don't see why you'd even consider Mac hardware for your desktop. In the OP's position, suggesting Apple hardware literally makes no sense. They're not asking for the best hardware that runs MacOS, they're asking for the best hardware for AI.

> If I could spend $4k on a non-Apple turn key solution that I could reasonably manage in my house, I would totally consider it.

You can't pay Apple $4k for a turnkey solution, either. MacOS is borderline useless for headless inference; Vulkan compute and OpenCL are both MIA, package managers break on regular system updates and don't support rollback, LTS support barely exists, most coreutils are outdated and unmaintained, Asahi features things that MacOS doesn't support and vice-versa... you can't fool me into thinking that's a "turn key solution" any day of the week. If your car requires you to pick a package manager after you turn the engine over, then I really feel sorry for you. The state of MacOS for AI inference is truly no better than what Microsoft did with DirectML. By some accounts it's quite a bit worse.

seanmcdirmid 130 days ago [-]

M4 Ultra with enough RAM will cost more than $2000. An M2 Ultra mac studio with 64GB is $3999, and you probably want more RAM than that to run bigger models that the ultra can handle (it is basically 2X as powerful as the Max with more memory bandwidth). An M2 Max with 64GB of RAM, which is more reasonable, will run you $2,499. I have no idea if those prices will hold when the M4 Mac Studious finally come out (M4 Max MBP with 64 GB of ram starts at $3900 ATM).

> You can't pay Apple $4k for a turnkey solution, either.

I've seen/read plenty of success stories of Metal ports of models being used via LM Studio without much configuration/setup/hardware scavenging, so we can just disagree there.

europeanplug09 130 days ago [-]

>You need a lot of space for that, cooling, and a good fuse

Or live in europe where any wall-socket can give you closer to 3kW. For crazier setups like charging your EV you can have three-phase plugs with ~22kW to play with. 1m2 of floor-space isn't that substantial either unless you already live in a closet in middle of the most crowded city.

dotancohen 130 days ago [-]

3 phase 240v at 16amps is just about 11kW. You're not going to find anything above that residential unless it was purpose-built.

That's still a lot of power, though, and does not invalidate your point.

natch 131 days ago [-]

Reasonable? $7,000 for a laptop is pretty up there.

[Edit: OK I see I am adding cost when checking due to choosing a larger SSD drive, so $5,000 is more of a fair bottom price, with 1TB of storage.]

Responding specifically to this very specific claim: "Can get 128GB of ram for a reasonable price."

I'm open to your explanation of how this is reasonable — I mean, you didn't say cheap, to be fair. Maybe 128GB of ram on GPUs would be way more (that's like 6 x 4090s), is what you're saying.

For anyone who wants to reply with other amounts of memory, that's not what I'm talking about here.

But on another point, do you think the ram really buys you the equivalent of GPU memory? Is Apple's melding of CPU/GPU really that good?

I'm not just coming from a point of skepticism, I'm actually kind of hoping to be convinced you're right, so wanting to hear the argument in more detail.

adam_arthur 131 days ago [-]

It's reasonable in a "working professional who gets substantial value from" or "building an LLM driven startup project" kind of way.

It's not for the casual user, but for somebody who derives significant value from running it locally.

Personally I use the MacMini as a hub for a project I'm working on as it gives me full control and is simply much cheaper operationally. A one time ~$2000 cost isn't so bad for replacing tasks that a human would have to do. e.g. In my case I'm parsing loosely organized financial documents where structured data isn't available.

I suspect the hardware costs will continue to decline rapidly as they have in the past though, so that $5k for 128GB will likely be $5k for 256GB in a year or two, and so on.

We're almost at the inflection point where really powerful models are able to be inferenced locally for cheap

seanmcdirmid 130 days ago [-]

For a coding setup, should I go with a Mac Mini M4 pro with 64GB of RAM? Or is it better to go with a M4 max (only available for the MBP right now, maybe in the Studio in a few months)? I'm not really interested in the 4090/3090 approach, but it is hard to make a decision on Apple hardware ATM.

I don't see prices falling much in the near term, a Mac Studio M2 Max or Ultra has been keeping its value surprisingly well as of late (mainly because of AI?). Just like 3090s/4090s are holding their value really well also.

ein0p 130 days ago [-]

It's reasonable when the alternative is 2-4x4090 at $2.2K each (or 2xA6000 at 4.5K each) + server grade hardware to host them. Realistically, the vast majority of people should just buy a subscription or API access if they need to run grotesquely large models. While large LLMs (up to about 200B params) work on an MBP, they aren't super fast, and you do have to be plugged in - they chew through your battery like it's nothing. I know this because I have a 128GB M3 MBP.

natch 129 days ago [-]

How large of a model can you use with your 128GB M3? Anything you can tell would be great to hear. Number of parameters, quantization, which model, etc.

ein0p 120 days ago [-]

I'm running 123B parameter Mistral Large with no issues. Larger models will run, too, but slowly. I wish Ollama had support for speculative decoding.

natch 118 days ago [-]

Thanks for the reply. Is that quantized? And what's the bit size of the floating point values in that model (apologies if I'm not asking the question correctly).

Abishek_Muthian 130 days ago [-]

OP here, I almost got a decked out Mac studio before I returned it for a Asus ROG as the native Linux support, upgradability & CUDA support is much more important to me.

Meagre VRAM in these Nvidia consumer GPUs is indeed painful but with increasing performance of smaller LLMs & fine tuned models I don't think 12GB, 14GB, 16GB Nvidia GPUs offering much better performance over a Mac can be easily dismissed.

2-3-7-43-1807 130 days ago [-]

how about heat dissipation which i assume a mbp is at a disadvantage compared to a pc?

Aurornis 130 days ago [-]

A MacBook Pro has lower peak thermal output but proportionally lower performance. For a given task you’d be dissipating somewhat similar heat, the MacBook Pro would just be spreading it out over a longer period of time.

nVidia GPUs actually have similar efficiency, despite all of Apple’s marketing. The difference is they the nVidia GPUs have a much higher ceiling.

elorant 131 days ago [-]

I consider the RTX 4060 Ti as the best entry level GPU for running small models. It has 16GBs of RAM which gives you plenty of space for running large context windows and Tensor Cores which are crucial for inference. For larger models probably multiple RTX 3090s since you can buy them on the cheap on the second hand market.

I don’t have experience with AMD cards so I can’t vouch for them.

fnqi8ckfek 130 days ago [-]

I know nothing about gpus. Should I be assuming that when people say "ram" in the context of gpus they always mean vram?

elorant 130 days ago [-]

Not always, because system RAM also has to be equally adequate, but mostly yes, it's about the total VRAM of the GPU(s).

layer8 130 days ago [-]

“GPU with xx RAM” means VRAM, yes.

kolbe 131 days ago [-]

If you want to wait until the 5090s come out, you should see a drop in the price of the 30xx and 40xx series. Right now, shopping used, you can get two 3090s or two 4080s in your price range. Conventional wisdom says two 3090s would be better, but this is all highly dependent on what models you want to run. Basically the first requirement is to have enough VRAM to host all of your model on it, and secondarily, the quality of the GPU.

Have a look through Hugging Face to see which models interest you. A rough estimate for the amount of VRAM you need is half the model size plus a couple gigs. So, if using the 70B models interests you, two 4080s wouldn't fit it, but two 3090s would. If you're just interested in the 1B, 3B and 7B models (llama 3B is fantastic), you really don't need much at all. A single 3060 can handle that, and those are not expensive.

thijson 130 days ago [-]

I've been running some of the larger models (like Llama 405B) via CPU on a Dell R820. It's got 32 Xeon's (4 chips), and 256 GB RAM. I bought it used for around $400. The memory is NUMA, so it makes sense if the computing is done on local data, not sure if Ollama supports that.

The tokens per second is very slow though, but at least it can execute it.

I think the future will be increasingly more powerful NPU's built into future CPU's. That will need to paired with higher bandwidth memory, maybe HBM, or silicon photonics for off chip memory.

redacted 131 days ago [-]

Nvidia for compatibility, and as much VRAM as you can afford. Shouldn't be hard to find a 3090 / Ti in your price range. I have had decent success with a base 3080 but the 10GB really limits the models you can run

zitterbewegung 131 days ago [-]

Get a new / used 3090 it has 24GB of RAM and it's below $1600.

christianqchung 131 days ago [-]

A lot of moderate power users are running an undervolted used pair of 3090s on a 1000-1200W psu. 48 GB of vram let's you run 70B models at Q4 with 16k context.

If you use speculative decoding (a small model generates tokens verified by a larger model, I'm not sure on the specifics) you can get past 20 tokens per second it seems. You can also fit 32B models like Qwen/Qwen Coder at Q6 with lots of context this way, with spec decoding, closer to 40+ tks/s.

yk 131 days ago [-]

Second the other comment, as much vram as possible. A 3060 has 12 GB at a reasonable price point. (And is not too limiting.)

grobbyy 131 days ago [-]

There's a huge step to I'm capability with 16gb and 24gb, for not to much more. The 4060 has a 16gb version, for example. On the the cheap end, the Intel Arc does too.

Next major step up is 48GB and then hundreds of GB. But a lot of ML models target 16-24gb since that's in the grad student price range.

navbaker 130 days ago [-]

At the 48GB level, L40S are great cards and very cost effective. If you aren’t aiming for constant uptime on several >70B models at once, they’re for sure the way to go!

bubaumba 130 days ago [-]

> L40S are great cards and very cost effective

from https://www.asacomputers.com/nvidia-l40s-48gb-graphics-card....

nvidia l40s 48gb graphics card Our price: $7,569.10*

Not arguing against 'great', but cost efficiency is questionable. for 10% you can get two used 3090. The good thing about LLMs is they are sequential and should be easily parallelized. Model can be split in several sub-models, by the number of GPUs. Then 2,3,4.. GPUs should improve performance proportionally on big batches, and make it possible to run bigger model on low end hardware.

nickthegreek 130 days ago [-]

Dual 3090s are way cheaper than the l40s though. You can even buy a few backups.

navbaker 130 days ago [-]

Yeah, I’m specifically responding to the parent’s comment about the 48GB tier. When you’re looking in that range, it’s usually because you want to pack in as much vram as possible into your rack space, so consumer level cards are off the table. I definitely agree multiple 3090 is the way to go if you aren’t trying to host models for smaller scale enterprise use, which is where 48GB cards shine.

bloomingkales 131 days ago [-]

Honestly I think if you just want to do inferencing the 7600xt and rx6800 have 16gb at $300 and $400 on Amazon. It's gonna be my stop gap until whatever. The RX6800 has better memory bandwidth than the 4060ti (think it matches the 4070).

kolbe 131 days ago [-]

AMD GPUs are a fantastic deal until you hit a problem. Some models/frameworks it works great. Others, not so much.

bloomingkales 131 days ago [-]

For sure but I think people on the fine tuning/training/stable diffusion side are more concerned with that. They make a big fuss about this and basically talk people out of a perfectly good and well priced 16gb vram card that literally works out of the box with ollama, lmstudio for text inferencing.

Kind of one of the reasons AMD is a sleeper stock for me. If people only knew.

throawayonthe 131 days ago [-]

[dead]

Der_Einzige 131 days ago [-]

Still nothing better than oobabooga (https://github.com/oobabooga/text-generation-webui) in terms of maximalism/"Pro"/"Prosumer" LLM UI/UX ALA Blender, Photoshop, Final Cut Pro, etc.

Embarrassing and any VCs reading this can contact me to talk about how to fix that. lm-studio is today the closest competition (but not close enough) and Adobe or Microsoft could do it if they fired their current folks which prevent this from happening.

If you're not using Oobabooga, you're likely not playing with the settings on models, and if you're not playing with your models settings, you're hardly even scratching the surface on its total capabilities.

skarlso 130 days ago [-]

Hi, I'm a noob in this plane, but I'm learning slowly. Could you please tell me how this is different from running something like [Msty](https://msty.app/)?

That also supports running local models that you already installed using ollama and things like, adding pdf and preserving chat history. So what am I looking at here with ooba?

Thanks!

Der_Einzige 130 days ago [-]

Oobabooga has a focus on and support for dozens more settings and parameters than any other UI does. You only get more by using base HF codebase via code.

Mysty is like obsidian in that it’s not a true prosumer or pro level tool. This is not photoshop for text. It’s Lightroom for text.

skarlso 130 days ago [-]

Thanks for the clarification! I see. I guess, I have to decide where and to what extend I would like to level my skills. And that where do I start.

Not going to lie, getting Oobabooga to work on my laptop was no small feet. Conda didn't work, python modules didn't work. cmd_macos failed with an error... But eventually I got there by hand. :D So the "Getting Started" was quite steep.

theropost 131 days ago [-]

this.

gulan28 131 days ago [-]

You can try out https://wiz.chat (my project) if you want to Run llama on your web browser. Still needs a GPU and the latest version of chrome but it's fast enough for my usage.

131 days ago [-]

coding123 130 days ago [-]

We will at some point have a JS API to run preliminary LLM to make local decisions, then the server will be final arbiter. So for example a comment rage moderator can help an end user change their proposed post while they write it, to help them not turn the comment into rage bate. This will be done best locally on the users browser. Then when they are ready to post, one final check by the server would be done. This would be like today's React front ends doing all the state and UI computation, reducing servers from having to render HTML, for example.

jokethrowaway 131 days ago [-]

I have a similar pc and I use text-generation-webui and mostly exllama quantized models.

I also deploy text-generation-webui for clients on k8s with gpu for similar reasons.

Last I checked, llamafile / ollama are not as optimised for gpu use.

For image generation I moved from automatic webui to comfyui a few months ago - they're different beasts, for some workflow automatic is easier to use but for most tasks you can create a better workflow with enough comfy extensions.

Facefusion warrants a mention for faceswapping

novok 130 days ago [-]

As a piece of writing feedback, I would convert your citation links into normal links. Clicking on the citation doesn't jump to the link or the citation entry, and you are basically using hyperlinks anyway.

mikestaub 130 days ago [-]

I just use MLC with WebGPU: https://codepen.io/mikestaub/pen/WNqpNGg

prettyblocks 130 days ago [-]

> I have a laptop running Linux with core i9 (32threads) CPU, 4090 GPU (16GB VRAM) and 96 GB of RAM.

Is there somewhere I can find a computer like this pre-built?

hasperdi 130 days ago [-]

Yes you can buy something like this prebuilt. Lenovo workstations for example. Quite expensive new, but there are many second hands on eBay

130 days ago [-]

erickguan 130 days ago [-]

How much memory can models take? I would assume the dGPU set up won't perform better at a certain point.

koinedad 131 days ago [-]

Helpful summary, short but useful

drillsteps5 131 days ago [-]

I did not find it useful at all. It's just a set of links to various places that can be helpful in finding useful information.

Like LocalLLama subreddit is extremely helpful. You have to figure out what you want, what tools you want to use (by reading posts and asking questions), and then find some guides on setting up these tools. As you proceed you will run into issues and can easily find most of your answers in the same subreddit (or subreddits designated to the particular tools you're trying to set up) because hundreds of people had the same issues before.

Not to say that the article is useless, it's just semi-organized set of links.

emmelaich 131 days ago [-]

It should list Simon Willison's https://llm.datasette.io/en/stable/

masteruvpuppetz 130 days ago [-]

David Bombal interviews a mysterious man where he shows how he uses AI/LLMs for his automated LinkedIn posts and other tasks. https://www.youtube.com/watch?v=vF-MQmVxnCs

sturza 131 days ago [-]

4090 has 24gb vram, not 16.

vidyesh 131 days ago [-]

The mobile RTX 4090 has 16GB https://www.notebookcheck.net/NVIDIA-GeForce-RTX-4090-Laptop...

dumbfounder 131 days ago [-]

Updation. That’s a new word for me. I like it.

sccomps 131 days ago [-]

It's quite common in India but I guess not widely accepted internationally. If there can be deletion, then why not updation?

dumbfounder 129 days ago [-]

What’s next, addition?

sccomps 129 days ago [-]

There is already a word, Insertion!

exe34 131 days ago [-]

that's double plus good.

rcdwealth 130 days ago [-]

[dead]

thangalin 131 days ago [-]

run.sh:

    #!/usr/bin/env bash

    set -eu
    set -o errexit
    set -o nounset
    set -o pipefail

    readonly SCRIPT_SRC="$(dirname "${BASH_SOURCE[${#BASH_SOURCE[@]} - 1]}")"
    readonly SCRIPT_DIR="$(cd "${SCRIPT_SRC}" >/dev/null 2>&1 && pwd)"
    readonly SCRIPT_NAME=$(basename "$0")

    # Avoid issues when wine is installed.
    sudo su -c 'echo 0 > /proc/sys/fs/binfmt_misc/status'

    # Graceful exit to perform any clean up, if needed.
    trap terminate INT

    # Exits the script with a given error level.
    function terminate() {
      level=10

      if [ $# -ge 1 ] && [ -n "$1" ]; then level="$1"; fi

      exit $level
    }

    # Concatenates multiple files.
    join() {
      local -r prefix="$1"
      local -r content="$2"
      local -r suffix="$3"

      printf "%s%s%s" "$(cat ${prefix})" "$(cat ${content})" "$(cat ${suffix})"
    }

    # Swapping this symbolic link allows swapping the LLM without script changes.
    readonly LINK_MODEL="${SCRIPT_DIR}/llm.gguf"

    # Dereference the model's symbolic link to its path relative to the script.
    readonly PATH_MODEL="$(realpath --relative-to="${SCRIPT_DIR}" "${LINK_MODEL}")"

    # Extract the file name for the model.
    readonly FILE_MODEL=$(basename "${PATH_MODEL}")

    # Look up the prompt format based on the model being used.
    readonly PROMPT_FORMAT=$(grep -m1 ${FILE_MODEL} map.txt | sed 's/.*: //')

    # Guard against missing prompt templates.
    if [ -z "${PROMPT_FORMAT}" ]; then
      echo "Add prompt template for '${FILE_MODEL}'."
      terminate 11
    fi

    readonly FILE_MODEL_NAME=$(basename $FILE_MODEL)

    if [ -z "${1:-}" ]; then
      # Write the output to a name corresponding to the model being used.
      PATH_OUTPUT="output/${FILE_MODEL_NAME%.*}.txt"
    else
      PATH_OUTPUT="$1"
    fi

    # The system file defines the parameters of the interaction.
    readonly PATH_PROMPT_SYSTEM="system.txt"

    # The user file prompts the model as to what we want to generate.
    readonly PATH_PROMPT_USER="user.txt"

    readonly PATH_PREFIX_SYSTEM="templates/${PROMPT_FORMAT}/prefix-system.txt"
    readonly PATH_PREFIX_USER="templates/${PROMPT_FORMAT}/prefix-user.txt"
    readonly PATH_PREFIX_ASSIST="templates/${PROMPT_FORMAT}/prefix-assistant.txt"

    readonly PATH_SUFFIX_SYSTEM="templates/${PROMPT_FORMAT}/suffix-system.txt"
    readonly PATH_SUFFIX_USER="templates/${PROMPT_FORMAT}/suffix-user.txt"
    readonly PATH_SUFFIX_ASSIST="templates/${PROMPT_FORMAT}/suffix-assistant.txt"

    echo "Running: ${PATH_MODEL}"
    echo "Reading: ${PATH_PREFIX_SYSTEM}"
    echo "Reading: ${PATH_PREFIX_USER}"
    echo "Reading: ${PATH_PREFIX_ASSIST}"
    echo "Writing: ${PATH_OUTPUT}"

    # Capture the entirety of the instructions to obtain the input length.
    readonly INSTRUCT=$(
      join ${PATH_PREFIX_SYSTEM} ${PATH_PROMPT_SYSTEM} ${PATH_PREFIX_SYSTEM}
      join ${PATH_SUFFIX_USER} ${PATH_PROMPT_USER} ${PATH_SUFFIX_USER}
      join ${PATH_SUFFIX_ASSIST} "/dev/null" ${PATH_SUFFIX_ASSIST}
    )

    (
      echo ${INSTRUCT}
    ) | ./llamafile \
      -m "${LINK_MODEL}" \
      -e \
      -f /dev/stdin \
      -n 1000 \
      -c ${#INSTRUCT} \
      --repeat-penalty 1.0 \
      --temp 0.3 \
      --silent-prompt > ${PATH_OUTPUT}

      #--log-disable \

    echo "Outputs: ${PATH_OUTPUT}"

    terminate 0

map.txt:

    c4ai-command-r-plus-q4.gguf: cmdr
    dare-34b-200k-q6.gguf: orca-vicuna
    gemma-2-27b-q4.gguf: gemma
    gemma-2-7b-q5.gguf: gemma
    gemma-2-Ifable-9B.Q5_K_M.gguf: gemma
    llama-3-64k-q4.gguf: llama3
    llama-3-64k-q4.gguf: llama3
    llama-3-1048k-q4.gguf: llama3
    llama-3-1048k-q8.gguf: llama3
    llama-3-8b-q4.gguf: llama3
    llama-3-8b-q8.gguf: llama3
    llama-3-8b-1048k-q6.gguf: llama3
    llama-3-70b-q4.gguf: llama3
    llama-3-70b-64k-q4.gguf: llama3
    llama-3-smaug-70b-q4.gguf: llama3
    llama-3-giraffe-128k-q4.gguf: llama3
    lzlv-q4.gguf: alpaca
    mistral-nemo-12b-q4.gguf: mistral
    openorca-q4.gguf: chatml
    openorca-q8.gguf: chatml
    quill-72b-q4.gguf: none
    qwen2-72b-q4.gguf: none
    tess-yi-q4.gguf: vicuna
    tess-yi-q8.gguf: vicuna
    tess-yarn-q4.gguf: vicuna
    tess-yarn-q8.gguf: vicuna
    wizard-q4.gguf: vicuna-short
    wizard-q8.gguf: vicuna-short

Templates (all the template directories contain the same set of file names, but differ in content):

    templates/
    ├── alpaca
    ├── chatml
    ├── cmdr
    ├── gemma
    ├── llama3
    ├── mistral
    ├── none
    ├── orca-vicuna
    ├── vicuna
    └── vicuna-short
        ├── prefix-assistant.txt
        ├── prefix-system.txt
        ├── prefix-user.txt
        ├── suffix-assistant.txt
        ├── suffix-system.txt
        └── suffix-user.txt

If there's interest, I'll make a repo.

amazingamazing 131 days ago [-]

I never have seen the point of running locally. Not cost effective, worse model, etc.

splintercell 131 days ago [-]

Even if it was not cost-effective, or you're just running worse models, it's learning an important skill.

Take for instance, self-hosting your website may have all these considerations, but you're getting information from the LLMs. It would be helpful to know that the LLM is in your control.

Oras 130 days ago [-]

Self hosting website as local server running in your room? what’s the point?

Same with LLMs, you can use providers who don’t log requests and SOC2 compliant.

Small models that run locally is a waste of time as they don’t have adequate value compared to larger models.

KetoManx64 130 days ago [-]

My self hosted website that has a bunch of articles about my self hosted setup and documentation, that i have running from a server in my living room got me my 6 figure devOps job.

troyvit 130 days ago [-]

> Self hosting website as local server running in your room? what’s the point?

It depends on how important the web site is and what the purpose is. Personal blog (Ghost, WordPress)? File sharing (Nextcloud)? Document collaboration (Etherpad)? Media (Jellyfin)? Why _not_ run it from your room with a reverse proxy? You're paying for internet anyway.

> Same with LLMs, you can use providers who don’t log requests and SOC2 compliant.

Sure. Until they change their minds and decide they want to, or until they go belly-up because they didn't monetize you.

> Small models that run locally is a waste of time as they don’t have adequate value compared to larger models.

A small model won't get you GPT-4o value, but for coding, simple image generation, story prompts, "how long should I boil an egg" questions, etc. it'll do just fine, and it's yours. As a bonus, while a lot of energy went into creating the models you'd use, you're saving a lot of energy using them compared to asking the giant models simple questions.

IOT_Apprentice 131 days ago [-]

Local is private. You are not handing over your data to An AI for training.

jiggawatts 130 days ago [-]

Most major providers have EULAs that specify that they don’t keep your data and don’t use it for training.

troyvit 130 days ago [-]

Considering how most major providers got their data to begin with these are EULAs that are probably impossible to enforce, and none of the major players have given much reason to believe they would stick to it anyway. And why bother when local models don't have nearly the same complications?

jiggawatts 130 days ago [-]

This is FUD.

Millions of businesses trust Google and Microsoft with their spreadsheets, emails, documents, etc… The EULAs are very similar.

If any major provider was caught violating the terms of these agreements they’d lose a massive amount of business! It’s just not worth it.

slowmovintarget 130 days ago [-]

Right. And the former CEO of Google basically explained their operating procedure. You go and steal everything, you break the rules, "then you have your lawyers clean all that up" when you get caught.

https://youtu.be/pvcK9X4XEF8?si=9j2ELkbXxHGoZEnZ&t=1157

> ...if you're a Silicon Valley entrepreneur, which hopefully all of you will be, is if it took off then you'd hire a whole bunch of lawyers to go clean the mess up, right. But if nobody uses your product it doesn't matter that you stole all the content.

troyvit 130 days ago [-]

> This is FUD.

Yeah. Because I have tons and tons of fear, uncertainty and doubt about how these giga-entities control so many aspects of my life. Maybe I shouldn't be spreading it but on the other hand they are so massively huge that any FUD I spread will have absolutely no effect on how they operate. If you want to smell real FUD go sniff Microsoft[1].

[1] https://en.wikipedia.org/wiki/Fear,_uncertainty,_and_doubt#M...

interloxia 129 days ago [-]

And my car doesn't keep full precision logs of it's location according to the terms that I agree to every time I drive somewhere.

But it turns out they did /do.

ogogmad 131 days ago [-]

Privacy?

babyshake 131 days ago [-]

Can you elaborate on the not cost effective part? That seems surprising unless the API providers are running at a loss.

amazingamazing 131 days ago [-]

A 4090 is $1000 at least. That’s years of subscription for the latest models, which don’t run locally anyway.

wongarsu 131 days ago [-]

I run LLMs locally on a 2080TI I bought used years ago for another deep learning project. It's not the fastest thing in the world, but adequate for running 8B models. 70B models technically work but are too slow to realistically use them.

homarp 131 days ago [-]

unless you have one already for playing games

130 days ago [-]

farceSpherule 130 days ago [-]

I pay for ChatGPT Teams. Much easier and better than this.

131 days ago [-]

deadbabe 131 days ago [-]

My understanding is that local LLMs are mostly just toys that output basic responses, and simply can’t compete with full LLMs trained with $60 million+ worth of compute time, and that no matter how good hardware gets, larger companies will always have even better hardware and resources to output even better results, so basically this is pointless for anything competitive or serious. Is this accurate?

pavlov 131 days ago [-]

The leading provider of open models used for local LLM setups is Meta (with the Llama series). They have enormous resources to train these models.

They’re giving away these expensive models because it doesn’t hurt Meta’s ad business but reduces risk that competitors could grow moats using closed models. The old “commoditize your complements” play.

philjohn 131 days ago [-]

Not entirely.

TRAINING an LLM requires a lot of compute. Running inference on a pre-trained LLM is less computationally expensive, to the point where you can run LLAMA (cost $$$ to train on Meta's GPU cluster) with CPU-based inference.

drillsteps5 131 days ago [-]

This is not my experience, but it depends on what you consider "competitive" and "serious".

While it does appear that a large number of people are using these for RP and... um... similar stuff, I do find code generation to be fairly good, esp with some recent Qwens (from Alibaba). Full disclaimer: I do use this sparingly, either to get the boilerplate, or to generate a complete function/module with clearly defined specs, and I sometimes have to re-factor the output to fit my style and preferences.

I also use various general models (mostly Meta's) fairly regularly to come up with and discuss various business and other ideas, and get general understanding in the areas I want to get more knowledge of (both IT and non-IT), which helps when I need to start digging deeper into the details.

I usually run quantized versions (I have older GPU with lower VRAM).

Wordsmithing and generating illustrations for my blog articles (I prefer Plex, it's actually fairly good with adding various captions to illustrations; the built-in image generator in ChatGPT was still horrible at it when I tried it a few months ago).

Some resume tweaking as part of manual workflow (so multiple iterations with checking back and forth between what LLM gave me and my version).

So it's mostly stuff for my own personal consumption, that I don't necessarily trust the cloud with.

If you have a SaaS idea with an LLM or other generative AI at its core, processing the requests locally is probably not the best choice. Unless you're prototyping, in which case it can help.

meta_x_ai 131 days ago [-]

Why would you waste time in Good models when there are great models?

drillsteps5 131 days ago [-]

Good models are good enough for me, meta_x_ai, I gain experience by setting them up and following up on industry trends, and I don't trust OpenAI (or MSFT, or Google, or whatever) with my information. No, I do not do anything illegal or unethical, but it's not the point.

KetoManx64 130 days ago [-]

The good local model isn't creating a profile about me, my preferences, my health issues, political leanings, and other info like the "great" Google and OpenAI are most likely doing based on the questions you ask the models. Just imagine if one day there's a data breach and your profile ends up on the dark web for future employers to find.

tsltrad 130 days ago [-]

I understand your concerns.

For me though, this would be all upside because I have largely explored technical topics with language models that would only be impressive to an employer.

At this point, it is like asking what does someone use a computer for? The use cases are so varied.

I can see how it would be interesting for myself to setup a local model just for the fun of setting it up. When it comes down to it for me though it is just so much easier to pay $20 a month for Sonnet that it isn't even close or really a decision point.

pickettd 131 days ago [-]

Depends on what benchmarks/reports you trust I guess (and how much hardware you have for local models either in-person or in-cloud). https://aider.chat/docs/leaderboards/ has Deepseek v3 scoring higher than most closed LLMs on coding (but it is a huge local model). And https://livebench.ai has QwQ scoring quite high in the reasoning category (and that is relatively easy to run locally but it doesn't score super high in other categories).

johnklos 131 days ago [-]

There's a huge difference between training and running models.

The multimillion (or billion) dollar collections of hardware assemble the datasets that we (people who run LLMs locally) run. The non-open datasets that we-host-LLMs-for-money companies do the same, and their data isn't all that much fancier. Open LLMs are catching up to closed ones, and the competition means everyone wins except for the victims of Nvidia's price gouging.

This is a bit like confusing the resources that go in to making a game with the resources needed to run and play that game. They're worlds different.

prettyStandard 131 days ago [-]

Even Nvidia's victims are better off... (O)llama runs on AMD GPUs

kolbe 131 days ago [-]

I think the bigger issue is hardware utilization. Meta and Qwen's medium sized models are fantastic and competitive with 6 month old offerings from OpenAI and Claude, but you need to have $2500 of hardware to run it. If you're going to be spinning this 24/7, sure the API costs for OpenAI and Anthropic would be insane, but as far as normal personal usage patterns go, you'd probably spend $20/month. So, either spend $2500 that will depreciate to $1000 in two years or $580 in API costs?

alkonaut 131 days ago [-]

If big models require big hardware how come there are so many free LLMs? How much compute is one ChatGPT interaction with the default model? Is it a massive loss leader for OpenAI?

drillsteps5 131 days ago [-]

A big part of it is the fact that Meta tries very hard to not lose the relevance against OpenAI, so they keep building and releasing a lot of their models for free. I would imagine they're taking huge losses on these but it's a price they seem to be willing to pay for being late to the party.

The same applies (to smaller extent) to Google, and more recently to Alibaba.

A lot of free (not open-source, mind you) LLMs are either from Meta or built by other smaller outfits by tweaking Meta's LLMs (there are exceptions, like Mistral).

One thing to keep in mind is that while training a model requires A LOT of data, manual intervention, hardware, power, and time, using these LLMs (inference, or forward pass in the neural network) is really not. Right now it typically requires NVidia GPU with lots of VRAM but I have no doubt in just a few months/maybe a year someone will find much easier way to do that without sacrificing the speed.

alkonaut 130 days ago [-]

Yes I get that using the models costs a lot more than training. But using them isn’t free. Even using a large nVidia chip for a fraction of a second is expensive. So where does this money come from? Gains from paid tiers, investor money?

drillsteps5 129 days ago [-]

You got it backwards. Training is extremely costly in many different ways, while running inference is much less expensive. Meta, Google, Alibaba, and some others train their models and put them into the open for free burning through their cash reserves trying hard not to lose it all to OpenAI (and MSFT).

alkonaut 128 days ago [-]

Yes I phrased it backwards in the first sentence (of course). So the answer is ”investor money to operate at a loss?”

dboreham 130 days ago [-]

This isn't accurate.

quest88 131 days ago [-]

That's a strange question. It sounds like you're saying the author isn't serious in their use of LLMs. Can you describe what you mean by competitive and serious?

You're commenting on a post about how the author runs LLMs locally because they find them useful. Do you think they would run them and write an entire post on how they use it if they didn't find it useful? The author is seriously using them.

prophesi 131 days ago [-]

It's the missing piece of the article. How is their experience running LLM's locally compared to GPT-4o / Sonnet 3.5 / etc, outside of data privacy?

th0ma5 131 days ago [-]

In my experience, if you can make a reasonable guess as to the total contained information about your topic within the given size of model, then you can be very specific with your context and ask and it is generally as good as naive use of any of the larger models. Honestly, Llama3.2 is as accurate as anything and runs reasonably quick on a CPU. Larger models to me mostly increase the surface area for potential error and get exhausting to read even when you ask them to get on with it.

momo_O 130 days ago [-]

> if you can make a reasonable guess as to the total contained information about your topic within the given size of model

Can you expand a bit on how you might do this?

th0ma5 130 days ago [-]

There's something you don't know that it may know and you want to see what it knows. This is like just a sentence or maybe a few both input and output. All the other talk about this model vs that model vs agents vs rag vs prompt engineering is all about practitioner worries. Keep in mind the thing is probably wrong as you would with any of them. Or that they are subliminally telling you something wrong you may accidentally repeat in front of someone at some time where it really matters and you're going to let everyone and yourself down. Which is current state of all of these things, so, if you're not building them, or are an NLP specialist working with multidisciplinary researchers on a specific goal of pushing research, then these things all have the same utility at the end of the day. Some of the most short sighted systems advice seems to just spill out of Claude unsolicited, so, whatever the big models are up to isn't entirely helpful in my opinion. Hopefully they'll be pressured to reveal their prompts and other safety measures.

Rendered at 09:56:35 GMT+0000 (UTC) with Wasmer Edge.