Running Llama 3.1 405B on a Mac Studio: Complete Guide
finally got llama 3.1 405b running on my mac studio m2 ultra. heres the full setup, quantization choices, and actual performance numbers
curious what everyone else thinks. gguf format basically won the local model format war
curious if others have had similar results
8 Replies
Join the discussion.
Log In to Replyhuh i never thought about it that way. my m3 max handles 70b models surprisingly well
honestly ive been going back and forth on this. the latency improvement from local is massive for real-time apps
anyone know if does this still work with the latest version or has the api changed?
can someone explain local ai a bit more? im not sure i fully get it
ok so I actually tested this pretty extensively last week and heres what I found - quantized models have gotten insanely good. barely notice the quality drop. for those cases I had to modify the technique a bit. happy to share details if anyones interested
ok real talk - gguf format basically won the local model format war. i know thats not the popular opinion here but someone had to say it
honestly yeah
sharing my experience here because i think it could help - gpu prices are still painful but its getting better. took me way too long to figure this out on my own