Rendered at 22:43:32 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
stego-tech 1 days ago [-]
The Unified Memory pool is what will continue to be the “game changer” in systems architecture, especially outside of data centers.
The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory. Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.
A unified memory pool does two things:
1) Lets systems optimize utilization based on need, rather than be confined to specific pools
2) Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
So at a time when memory is expensive, unified pools make more sense. Even when memory becomes cheap and plentiful again, it’s just practical at this point to allocate a larger overall pool instead of managing discrete sets.
The one big drawback is security. A shared memory pool means side-channel attacks against memory from the GPU or CPU could potentially compromise the other as well, meaning memory-safe designs are going to be critical to security going forward (which is good for Rust adherents, I figure).
AnthonyMouse 22 hours ago [-]
> Lets systems optimize utilization based on need, rather than be confined to specific pools
The trouble with this is that the different types of memory have different characteristics. Latency for ordinary system memory is actually better than it is for GDDR, because GDDR is optimized for bandwidth. RTX 5090 has 1.8TB/s of memory bandwidth with a 512-bit memory bus. The same bus width for DDR5-9600 would have better latency but only a third of the bandwidth.
CPU workloads are generally bounded by latency and GPU workloads are generally bounded by bandwidth, which is why they use two different types.
> Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
The trouble with this is cost. In principle you could get the same 1.8TB/s of memory bandwidth as the RTX 5090 has, with the better latency of DDR5, by using DDR5 with a 1536-bit bus. This is indeed with multi-socket servers do, two sockets with 768-bit in memory channels per socket, but now check how much those system boards cost.
But the remaining alternatives are both worse. If you use GDDR for the unified memory then GDDR costs more than DDR and you're going to have significantly worse latency for the CPU. If you use DDR without a 3-4 times wider bus than the already-wide GPU then the GPU gets starved for bandwidth.
Melatonic 3 hours ago [-]
Isn't GDDR also based on a much earlier DDR implementation than DDR5 ?
It also has way better throughput because it's physically surrounding the chip itself and wired in a way that maximises this.
The real problem is interconnect speed and latency. We have made tons of progress elsewhere but AI is exposing that the interconnect in many systems is just not great. Even future PCIE 6.0 is fairly bandwidth constrained compared to 8 channels of DDR memory or the way we solder GDDR next to the chip.
We moved on from AGP and older formats to PCI-E and I think it's time to do that again. And maybe even "slot" based implementations in general for both RAM (system and graphics) and GPUs.
We need consumer and workstations in summary to use pin based stuff like LPCAMM ram. And the interconnect on the motherboard itself needs to be both wider (more bandwidth) and lower latency. This might require moving on from motherboard being 2 dimension only (a flat board) to something like an L shape to gain more physical board space.
ssivark 11 hours ago [-]
How about having a large pool of unified memory and expanding the next layer (L3?) of cache to accommodate more of the CPU's the low-latency RAM usage?
marcosdumay 6 hours ago [-]
As a rule, increasing the size of cache increases its latency, and how much of it you can use is capped by the quality of your cache management algorithms and the latency of the level above it.
Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.
trumpdong 6 hours ago [-]
We might even decide to put 32GB of high-latency cache on the system board and then 12GB of throughput-optimized main memory close to the GPU. ;)
marcosdumay 5 hours ago [-]
You meant a 128GB (instead of 12GB)?
And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.
It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.
trumpdong 10 minutes ago [-]
No I mean 12GB, an amount that is typical in such a system today because GPU and graphics cards vendors seem to limit it a lot.
Melatonic 3 hours ago [-]
I think that's basically what Cerebras doing ?
stego-tech 10 hours ago [-]
I get all of that already, but stand by my original points: for most consumer, non-data center workloads, the compromises aren’t likely to be noticeable to the end user. We’re not talking about edge cases like local-AI or AAA gaming enthusiasts who want to run software at bleeding-edge capabilities and who will dissect performance deltas between driver versions or overclock their kit for maximum performance, because we’re the edge cases in the marketplace.
Everything is ultimately a compromise of some sort, and modern Unified Memory feels like one of the better compromises out there given the current plateauing of hardware scaling, the growing costs associated with memory and NAND, and the shifting complexity from hardware (more instruction sets, more accelerators, more cores) to software (more abstraction layers, more machine learning).
fc417fc802 21 hours ago [-]
These are all good points that I agree with but rather than seeing an intractable problem I predict we'll see the role that GDDR would otherwise fill in this scenario replaced by a small block of HBM on the APU die. I don't know if it will ultimately end up unified or not but either way I don't think memory segmentation is the core problem here. Simply not needing to send transfers across the narrow and slow PCIe bus would fix most of the practical problems (at least AFAIK but I'm not an expert).
Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)
Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.
maccard 1 days ago [-]
> The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory
Game dev here. For anyone reading this - it’s not because we’re lazy, it’s because _it’s really hard to do_.
One of the biggest differences between the current generation consoles and the current gen PCs is unified memory.
stego-tech 21 hours ago [-]
I live with a game dev myself, so I get it. Hell, it's hard even for PC developers who want to do things without leaning on abstraction layers or existing engines. Managing multiple discrete memory pools, asset swaps or calls between them, getting the respective subsystems to exchange data at just the right time so as not to impact other code and drag down performance - it's fucking hard in general.
A unified pool of memory suddenly makes that simultaneously easier, but also far more flexible, which frees up developer time and bandwidth to focus on other, more important tasks.
BuyMyBitcoins 1 days ago [-]
How much of that difficulty comes from the chosen game engine? I assume the engine is the primary factor in how resources are allocated.
maccard 1 days ago [-]
Both lots and none at the same time. The engines definitely make decisions for you but with unreal (for example) you can modify the RDG any way you see fit.
The problem is that when you need something in gpu you have to go through RAM first (unless you have DMA which is a more recent addition). That doesn’t just add latency it also adds an extra step of cache invalidation, so you have to plan for that from the highest level of gameplay. If you need to prepare for a GPU memory miss _and_ a CPU memory miss as a worst case all the time, it’s very hard to make good use of the bandwidth in the best case
keyringlight 1 days ago [-]
One related question that you need to follow that with is the associated costs of switching the whole studio to another engine that's technically better, or if proposing teach studio tailor-make their own engine the costs of that engineering, if presumably they have or learn the expertise to surpass whatever they're using currently.
I'm not a game developer, but it would also seem to be a link between resource usage by the engine, and whatever content the production side are making. For all the commentary about how brilliant the id software engines are, if you examine the levels you pass through they're also very efficient with what they demand out of the engine - it's like an orchestra playing well together, not one instrument that means you can do anything.
nearbuy 24 hours ago [-]
I think much of the difficulty is just that, for example, the 1.8 TB/s of an RTX 5090 is a lot of bandwidth for a game to use. That's over 50,000 4k textures per second at 32bpp.
maccard 14 hours ago [-]
I agree with you in theory. A couple of points - that’s currently the most experience and high performing card on the market. Most people on steam are using an RTX 3060 which has more like 360GB/s. That’s a factor of 6. How do you design resource usage that scales with that amount of extremity? (We try to, fwiw).
That spec is also a throughput measured per second whereas our frame rates are much higher than 1/s. At 60hz, that’s now between 140 and 800 textures a frame. If you miss _one_ you don’t get that back.
A single main character in a game can be 2-5 regular textures, plus all of the extra mapping textures we have these days. Now do landscapes, environments, props, background videos, and it all adds up. 4k textures are pretty universally used. If you look at a tiny object up close we need a higher res texture to be able to show it neatly.
You also have memory pressure - raytracing makes heavy use of VRAM so you have to make the tradeoff of how much do you want to allocate to caching lighting, vs how much you want to keep textures and geo around.
Lastly, as you say, actually keeping up with 360GB/s from the CPU side is tough. If you require any transformation or CPU operations that’s just not going to happen. If you need to pull from disk, even on an NVMe drive reading synchronously, the max throughput is < 10% of that, and that assumes you are actually reading 360GB from disk. If you pause to do anything else, you’ll significantly slow that down. Players also generally don’t like it if we thrash their NVMe disks :)
gmueckl 22 hours ago [-]
That sounds like a lot, but: modern renderers do between 20 to 40 passes, many of them in screen space. And each screen space pass typically reads from at least two input images, sometimes 3 or 4 even with optimally packed inputs. At 60fps that can quickly get up to way over 2000 full screen buffer reads per second and more for less than optimal access patterns in some algorithms. That also doesn't account for texture access during shading passes, which are somewhat random memory accesses.
nearbuy 18 hours ago [-]
Very true, but I'll point out that even those 2000 full screen reads per second at 4k are only 4% of the 5090's bandwidth. Sacrificing some of that speed for a unified memory architecture seems like a good trade.
Plus, DLSS can greatly reduce the bandwidth requirements for 4K gaming.
gmueckl 16 hours ago [-]
I'm being very, very conservative with my estimates here. Based on the renderers I know, I could have easily tweaked the numbers to go up to 8000 full screen texture reads per second. That doesn't include texture or geometry or BVH reads or any memory writes. That is all in addition to those operations.
nearbuy 5 hours ago [-]
But do you think you'll reach 1.8 TB/s?
gmueckl 4 hours ago [-]
Quite likely, but the transfer throughput is required in bursts, not necessarily continously.
Let me put it this way: what I care about is how quickly data arrives after a bunch of shader threads request it. Throughput is one way for hardware to reduce that time. The other way is to hide the latency (GPUs do a lot to keep themselves busy while waiting for memory), but those strategies can only do so much.
Lower memory throughput almost always leads to a longer runtime of GPU calls in practice, and thus lower update rates.
Stevvo 23 hours ago [-]
What? It's incredibly easy to take full use of memory bandwidth. For example, put proper volumetric smoke/fire/explosion sim in your game. But game developers don't do that because they are lazy.
maccard 12 hours ago [-]
No, we don’t do it because the tradeoff isn’t worth it. A gpu based particle sim is very difficult to do well - it’s easy (but computationally expensive) to do a volumetric sim, but when you want that simulation to interact with world geometry correctly it comes with an explosion in complexity and performance.
I promise you want our games to look as good as you want them to look.
Stevvo 6 hours ago [-]
How does interaction with world geometry come with an explosion in complexity and performance? Advection has almost same cost regardless of if some cells are solid or not. It's one extra line in your shader + 1 bit per cell. JFA to build solid mask.
cm2187 1 days ago [-]
And conveniently, by making your machine non upgradeable, it allows the manufacturer to enforce market segmentation / charge a huge premium for small RAM upgrade (a la Apple)
to11mtm 1 days ago [-]
It doesn't -have- to be that way necessarily...
LPCAMM2/SOCAMM2 exist, heck I think Framework is using LPCAMM2 in one of their new laptops.
Heck, I'm willing to bet that a lot of manufacturers would rather go that route than soldered in, if for no other reason than the relative cost of warranty work between the two.
However, people probably need to stop being obsessed with ultrathin laptops for that to happen.
fc417fc802 20 hours ago [-]
> However, people probably need to stop being obsessed with ultrathin laptops for that to happen.
I've never been able to understand this. Once we made it down to ~20 mm (which for the record still accommodates dual-stacked SO-DIMMs, a 2.5 inch bay, and a user replaceable battery but not an RJ45 jack) I don't understand what the practical impact of any further reduction is supposed to be. Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.
wtallis 17 hours ago [-]
> Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.
There's very wide variation between laptops in how noticeably they'll flex or yield or creak when pressed. Laptops with a build quality that actually feels solid are far from being ubiquitous or even a majority.
Doubling the thickness of my MacBook Air would probably make it regress on that solid feeling, unless the weight was also significantly increased.
And regardless of whether current laptop form factors could accommodate a 2.5" drive, there's no use in doing so. That drive form factor is entirely obsolete for laptops and is just a waste of space and materials, and has been for about a decade.
fc417fc802 17 hours ago [-]
I wasn't saying that I want a 2.5 inch drive, I was merely listing off a number of rather large things that fit just fine within a 20 mm budget.
I'm not sure why you seem to think that making something thicker would reduce the stiffness or strength. It's generally the opposite - see the concept of a torsion box. Anyway that wasn't the point. The point was that regardless of how thin you make the thing it will forever remain a cumbersome and delicate item that you have to treat with care when packing so what meaningful positive impact does shaving off those last few mm have? It's never made any sense to me.
bloqs 14 hours ago [-]
They aren't, that was a push from manufacturers and PR. Find me one person that asked for a thinner phone after the iPhone 4
stego-tech 21 hours ago [-]
I came here to say just this myself! Modern DIMM formats make SFF/portable builds with unified memory pools far more plausible than prior designs. There's absolutely no reason desktop machines couldn't implement similar DIMM formats or design a new board standard around something similar.
Unified memory doesn't have to be soldered on or serviceable. That's a choice Apple made because it fit their product vision, but it's not mandatory in the slightest.
Melatonic 3 hours ago [-]
Yup - we need pin based memory. Period. It's a physics thing.
CPUs don't slot in for a reason
ForOldHack 1 days ago [-]
Sir! I am typing this on a Lenovo Carbon X1, with soldered on ram, and you are EXACTLY CORRECT!
I would much prefer two SODIMM sockets with the option to go to 32MB shared video memory, or DDR4/DDR5. Give me OPTIONS!
arka2147483647 1 days ago [-]
There is LPCAMM2, if manufacturers want to use it.
So, it does not have to be soldered.
sroussey 20 hours ago [-]
LPCAMM2 is available in real systems at 7467MT/s and 120ns latency, vs apple (and intel) at 9600MT/s (and apple soldered memory at 100ns latency).
I don't know how linear or sensitive CPU and GPU benchmarks are to such a 20% slowdown, but i don't think Apple wants to pay it. And it looks like the next generation will be even closer to the SOC.
Melatonic 3 hours ago [-]
LPCAMM2 is also brand new. It likely will improve a lot.
We're also hitting the limit of DDR5 here (before moving to multiplexed)
I would guess if you had LPCAMM2 located physically around the CPU (one or two on each of the 4 CPU edges) you could also reduce that latency.
Lplololopo 13 hours ago [-]
Its still further away than the Ram on a packaged CPU and latency is limited by speed of light/electrons on that scale.
bpavuk 1 days ago [-]
how about the LPCAMM route? Framework uses LPCAMM2 in 13 Pro laptop mainboards and claims that it satisfies the iGPU and NPU hardware without needing soldered RAM
GeekyBear 1 days ago [-]
Until LPCAMM2 came along, using low power LPDDR RAM meant soldiering RAM to the motherboard.
If you wanted to get sleep right and improve battery life, that was the trade off.
nottorp 15 hours ago [-]
> to get sleep right
Thought getting sleep right was something that happened before MS decided they need to be able to wake your PC any time they want and not hardware related much.
GeekyBear 9 hours ago [-]
Macs were known for far longer standby times while sleeping long before MS completely screwed the pooch with their "modern" standby.
nine_k 18 hours ago [-]
Maybe I won't care about upgradeability right now. The architecture is clearly in flux, the roles of traditional "CPU" and "GPU" are rapidly evolving. Maybe in 5 years, or even 3 years, a brand-new machine from 2026 won't be worth upgrading for a new role due to a seriously different architecture, but would only be relegated to do something "traditional".
MBCook 1 days ago [-]
Is that required or just a choice Apple made?
cm2187 1 days ago [-]
What do you mean by required? Apple's prices are notoriously disconnected from the cost of manufacturing.
MBCook 1 days ago [-]
I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?
It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?
But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.
wtallis 1 days ago [-]
> I mean is it possible to make unified memory systems with good performance or is it not really feasible due to memory timing/trace length issues?
LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.
AnthonyMouse 22 hours ago [-]
> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty.
LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus
Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.
Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.
wtallis 20 hours ago [-]
> LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
The difference here is in what the standard defines on paper vs what is actually shipping in products and readily available off the shelf. Who's selling a whole system with LPCAMM2 certified for 9600MT/s? Intel's current-gen Panther Lake top of the line laptop chips are rated for 9600MT/s when using soldered LPDDR5x but only 7467MT/s when using LPCAMM2, according to their current datasheet: https://www.intel.com/content/www/us/en/content-details/8721...
That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds. Intel's own shipping memory speed moved past 7467MT/s a few months earlier than even Apple's.
> Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs.
> Moreover, making the bus wider is "easy"
Citations needed. Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s. The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s (which again, is not yet available even in a 128-bit configuration in shipping hardware). Adding more channels to even a LPCAMM2 configuration means adding more trace length, because only two modules can actually be adjacent to the CPU socket. (Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.)
Melatonic 3 hours ago [-]
Multiplexed DDR (MRDIMM) can go faster.
But for throughput served with 12 channels have pretty high theoretical even with slower
lelanthran 12 hours ago [-]
> LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using;
Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.
ForOldHack 1 days ago [-]
You mean Apple prices are notoriously over priced, over hyped, under powered, and
"Abdul Jabar, couldn't have made these prices, with a sky hook."
QQ00 1 days ago [-]
both. soldered ram is faster. also Apple don't want to offer upgradblity after purchase.
ForOldHack 1 days ago [-]
Don't I/you wish. The mechanical junction adds no delay, only manufacturing expense, and the delay of purchasing new systems to keep up with OS bloat.
Actually the opposite is true. Socketed RAM can be made to overclock and adjust timings, while soldered ram, no. Two Lenovo's one soldered ( Carbon X1 ), one T590, one slot: Crucial 16GB, 260-pin SODIMM, DDR4 PC4-19200. Exact same processor, the X1 is DDR3 soldered on 532.0 MHz PC3-1066. The T590, has DDR4, PC4-19200, 1200Mhz.
Both have a Core i7 8665U... and the T590 is much faster, with socketed ram.
lmz 1 days ago [-]
I think you'll find that in the current day, high speed LP(?)DDR5 requires a better signal path than what the SODIMM can provide. Which is why laptop makers initially moved to soldered RAM before moving to CAMM (probably only for the high end ones).
ValentineC 20 hours ago [-]
I wish manufacturers could consider a hybrid approach. There should be no reason an architecture can't support both unified memory (effectively L4(?) cache), and cheaper, upgradeable system memory on sticks for old-school application use.
wtallis 17 hours ago [-]
Upgradable memory and unified memory aren't entirely mutually exclusive. You can design a chip that uses DDR5 and has a decently-powerful iGPU that can use that whole memory pool. But you'll be starving that GPU of bandwidth relative to what you'd achieve with soldered LPDDR, and it's not really worth the trouble of building a large iGPU unless you're also going to feed it with the fastest memory you can reasonably put down.
If you look at eg. an Intel laptop chip, you'll see they design and build a memory PHY that can interface with either DDR5 or LPDDR5x. They don't support splitting it to have one controller operating with DDR5 and the other with LPDDR5x, for fairly obvious reasons: more complex hardware, harder for software/operating systems to manage optimally, and not a lot of benefits to drive demand and justify the expenses. The speed difference between LPDDR5x and DDR5 isn't really large enough to use LPDDR5x as an L4 cache; it would be more like two different NUMA nodes, with complications for laptop power management.
If you want somebody to build a chip with more than the usual 128-bit bus and make some of the memory controllers use LPDDR and some DDR5, then you're asking for a significant increase in chip cost due to the extra memory PHYs and pin count. That cost is only justified if almost all products using the bigger chips are going to actually take advantage of the full complement of memory controllers.
Onavo 1 days ago [-]
Are there no PCIe standards that are sufficient to support both use cases?
What happened to PCIe 8 and CXL?
to11mtm 1 days ago [-]
AFAIK PCIe6 just started getting implemented in hardware last year... PCIe7 Spec was just released last year too...
PCIe6 is a much larger change than 'just bump up the transfer rate', the encoding changed too (on top of the new code length, it's no longer NRZ,) so everyone needed to design and validate both the new encoding block, negotiation, etc etc.
That said, I'm guessing PCIe7 will be a 'smoother' transition from PCIE6, i.e. we might see 7.0 products in 2027. That will theoretically get you ~240GB/sec, on an x16 link, or hypothetically a little less than the hypothetical max of a current Strix Halo. (I'm guessing however, that PCIe protocol overhead will make the difference larger.)
tjoff 1 days ago [-]
Don't really buy the economic argument. For 99% pf all workloads you need at least an order of magnitude more system memory than gpu memory.
Most systems barely need more gpu memory than what is required for video, browsing etc.
Just because we found a new usecase doesn't flip that on its head.
Besides, I want to keep doing what I'm doing today. So if I need 128GB today and my local AI needs 128 GB then I'd need 256 GB to keep doing the same work.
The argument rather seems to be that we shouldn't use such expensive memory on the GPU. Which might be true if you only want to do inference on it.
Joel_Mckay 14 hours ago [-]
Jensen Huang has publicly stated he wants a future where "AI" agents use more PC computers than people.
It is ambitious, and absurd... like all CEOs that eventually go loopy. =3
david-gpu 1 days ago [-]
DRAM optimized for CPU usage looks very different from DRAM optimized for GPU usage. You are leaving a lot performance on the table when you have a unified memory architecture. It makes sense in some situations, but it is not a silver bullet.
jayd16 22 hours ago [-]
>[..] take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory.
I'm honestly a little confused by what you mean here. Why would we want to maximize those things? Games are about consistent output under the frame deadline, not full saturation of the hardware.
Why would anyone try to saturate a 5090 with their game? The addressable market is tiny and you'd have to hope their full spec runs as well as or better than your test rig or they'll still not hit framerate.
simonbw 22 hours ago [-]
You could do some sort of adaptive quality where you spend time incrementally improving fidelity until your frame budget is up. In practice I think that might be trickier than it sounds, but I feel like theoretically there's something there that could get you the best graphics your rig can handle without dropping frames. I've been considering doing something like this when I've been building a game/engine lately.
Rohansi 20 hours ago [-]
There's only so high you can go because the game assets have a maximum quality. Maybe you'll be able to max out the 5090 but what about the next flagship GPU?
You're also likely not going to maximize all of bandwidth, compute, etc. because one of them will likely be your bottleneck. And it might be different depending on the GPU, too.
rustystump 15 hours ago [-]
Most games are strictly scaled on resolution due to how deferred pipelines run. This is exactly the slider to max or not max everything on a gpu for games. The more pixels the more memory and the more compute.
Rohansi 6 hours ago [-]
If you're rendering at native resolution, which many PC gamers do, going higher isn't significantly better because it just helps with antialiasing via supersampling. There's no point rendering so much more pixels just because you can, that's just a waste of electricity.
jayd16 6 hours ago [-]
The more pixels, the more compute of fragments but not necessarily more memory. A fragment might hit the same texel as an adjacent fragment.
Certainly not more from main memory, and maybe not more from the vram either depending on how the pipeline goes.
It's not a linear slider.
Retr0id 1 days ago [-]
Memory safety is orthogonal to side-channels, and hardware-enforced isolation (e.g. IOMMU) is more powerful than compiler-enforced isolation (but both are good!)
bobmcnamara 1 days ago [-]
Oh no now I have to worry about shaders row hammering my OS ram /s
Retr0id 1 days ago [-]
You really do have to worry about that!
RiverCrochet 23 hours ago [-]
Isn't this how the Xbox 360 got hacked? Not necessarily rowhammer but other methods.. IIRC some shader code in King Kong was able to affect CPU execution or something like that.
seemaze 1 days ago [-]
And here I am with 128GB Strix Halo longingly eyeing the Blackwell cards that spit tokens 10-20x the speed.
The question is ultimate shape of knowledge compression and bandwidth optimization at which we arrive I suppose.
canyp 1 days ago [-]
If you haven't already, check/increase the GPU memory carve-out on your UEFI.
that link actually recommends not doing it from UEFI and doing it via software
Salgat 1 days ago [-]
That was the main reason for the big hype around Memristors 15 years ago. High density, high speed persistent memory to completely remove the need for hdd/ssds, potentially even removing the need for external memory altogether. So frustrating that it still seems like we're a long ways from that becoming reality. There's some renewed interest in Memristors as they can simulate neural network connections in models, so maybe the funding will return for it.
zozbot234 1 days ago [-]
The one example of persistent memory that managed to reach the mass market was Intel Optane/3dXPoint (still popular today among people looking to save on RAM costs) and that used a kind of phase-change memory, which is but tangentially related to memristors. ReRAM is somewhat closer, but it's also been less successful so far.
Melatonic 3 hours ago [-]
Optane was still much slower than Ram. And not that much faster than NVME (theoretically)
ForOldHack 1 days ago [-]
Well, back in the day... The MacIIfx had video memory, ( dual ported ram ) that could be read and written to out of different ports. Wicked fast. It 486DX2s more than a year to catch up.
Lplololopo 13 hours ago [-]
" Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers."
What do you mean by this? Memory bandwidth is fundamental to the speed of an local AI model
pbalau 1 days ago [-]
What is the difference between unified memory and shared memory?
Shared memory existed since the first CPU with an embedded GPU came to market and you could set in BIOS how much memory goes to what component.
I do have an opinion about how unified memory could be different, but I want a proper explanation.
saltcured 1 days ago [-]
I'm not sure everyone uses the terms consistently, but the difference is that the old "shared" memory was reserving a section to act as VRAM under the control of the GPU, ignored by the OS. The CPU ran the same kind of code pretending there is a "bus transfer" between host memory and graphics memory.
In unified memory, all the memory is host memory and data can go from program to GPU with zero copy movements. The addresses of buffers can be shared via appropriate MMU translation support, so that the application and graphics subsystem are communicating effectively through the basic RAM cache coherency protocols over the same buffers.
Edit to add: Aside from the zero copy transfer potential, it also means dynamic allocation strategies can shift the balance between host and graphics allocations on the fly. Individual image and message buffers can be allocated on the fly instead of setting a static split between the two worlds.
johnny22 1 days ago [-]
Reserved sounds like it would have been a better term now that I'm reading this many years later.
stego-tech 21 hours ago [-]
You got it in one! That's exactly what makes unified memory superior for current use cases, and different from the shared memory woes of old.
pbalau 1 days ago [-]
That's my understanding, or, maybe a better word would be "guess". The CPU telling the GPU: this is your memory now.
surajrmal 22 hours ago [-]
To some degree this is how it already feels like to program basically anything with dma today. You map hardware into an iommu and stop touching it when the hardware is supposed to use it, and then you reclaim it afterwards. So the model from the os feels the same, the difference is that it's not copying the memory into some local memory to operate on it.
ImprobableTruth 1 days ago [-]
Shared memory of the past meant reserving a part of the memory for the GPU, which could then not be used or accessed by the CPU. If the CPU wanted to access something, it had to copy it from the GPU's section of the memory to its own. Unified memory means both just fully share the same memory.
cthalupa 1 days ago [-]
For these in specific, they appear basically transparently to the GPU. There's a lot of software/firmware stuff for this, but also a different hardware architecture - while the RAM is on the CPU die, the nvlink-c2c gives it extremely low latency and 600GB/s bandwidth between the GPU and CPU.
1 days ago [-]
1 days ago [-]
Rohansi 1 days ago [-]
Marketing, mostly? But perhaps also more flexibility with how much memory the GPU can directly access without reserving it.
MBCook 1 days ago [-]
No. Let’s define terms, as others have pointed out they’re not perfect.
Unified memory is what Apple is doing, other phones do, and many low end built in GPUs have done in PCs for ages. There is only one physical memory pool. Both the CPU and GPU can access it at full speed.
This means no copying between pools of memory. No speed penalty accessing the CPU memory from GPU or vice versa. If the GPU only needs 2 GB to draw the desktop it only uses 2 GB of the pool. Or it can use 45 GB if it needs it and the CPU doesn’t. But all memory has to be the same speed, and that ain’t cheap given how fast GPUs like things. I don’t know if expandable memory is possible, and they use the same bus do they compete for bandwidth. Seems theoretically easier to program for to me.
The opposite is what’s been common in graphics cards since the 2D era. CPU and GPU have their own memory and can talk over PCI/AGP/PCI-E. This is what I think they mean by shared memory, if it’s not what’s the point in touting unified?
In this model if the GPU uses 2 GB of its 12 GB total, the other 10 isn’t available to the OS at full speed and I’m not aware of any operating systems that would use it for programs/cache by default. If the GPU needs 45 GB… too bad. You have to page things in and out of GPU memory over the much slower system bus. Starting a game means loading assets into main memory then transferring them to the GPU (newer tech can accelerate this). But the CPU can have slower memory than the GPU saving money. Memory expansion on the CPU side easy. And the CPU saturating its memory bus has no effect on the speed of the GPU memory bus because it’s physically separate. More complicated memory model but it’s the one everyone uses used to.
Which is better is a matter of opinion and workload needs.
Rohansi 1 days ago [-]
Yes, I know there is an actual difference vs. dedicated GPUs with their own VRAM. I say it's marketing because Apple popularized the unified memory term even though, as you said, it existed in iGPUs long before Apple Silicon and was called shared GPU memory.
> I don’t know if expandable memory is possible
It technically is. These new systems (mostly) get their high bandwidth by using more channels (wider bus) of normal RAM modules. A system that has LPCAMM2 sockets should allow using the same LPDDR5X memory but you'd need a socket per two channels. A typical PC only supports two channels so having four (two sockets) would double the bandwidth.
MBCook 1 days ago [-]
Bandwidth by going wider, not faster. That makes sense.
Gareth321 1 days ago [-]
System RAM has much lower bandwidth and less predictable access. Notably, the transfer from system to GPU is very slow. About 30x slower. LLMs aren’t designed to queue or parallelise operations to account for this. They just become much slower.
NikolaNovak 1 days ago [-]
The "one big drawback" is the lack of consumer upgrades, and the seemingly arbitrary prices charged by vendors for memory upgrades at time of system purchase. I'm not saying it has to be that way, but seems like it has been so far :-(
wren6991 21 hours ago [-]
> Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.
I'm not sure what you mean by this. Memory bandwidth is the main bottleneck for single-user decode. The bottleneck is actually more severe for end-user inference than cloud inference, because end users don't have the option to increase arithmetic intensity by computing tokens for multiple clients in the same pass.
One thing we've learned from Apple is the viability of spamming more LPDDR5X channels (up to 1024-bit total bus width on M3U) as a means of achieving high bandwidth while keeping the cost/capacity reasonable.
jorvi 1 days ago [-]
Yeah, no. GDDR is functionally very different than SDRAM.
GDDR tries to push out as much bandwidth as possible, because that really matters for (traditional) GPU workloads. A constant but insignificant (= correctable) error rate is considered completely fine for GDDR, because that sacrifice allows the memory to be pushed much farther.
Meanwhile most (traditional) SDRAM workloads don't give a hoot about bandwidth but really care about latency. And ideally you want no errors, hence ECC RAM being so venerated.
If you unify memory, you're gonna have to choose to sacrifice one of those workloads or go suboptimal for both.
Weirdly enough this mostly matters for non-gaming workloads. The Apple M-series are absolute monsters in gaming, completely crushing the RTX XX90 editions in performance-per-watt, but as soon as memory bandwidth becomes paramount the M-series falls heavily behind.
GTP 1 days ago [-]
While I'm a supporter of Rust, I have to point out that Rust's memory safety doesn't help against side-channel attacks.
jmyeet 1 days ago [-]
Unified memory is only a feature because NVidia so aggressively uses VRAM for market segmentation.
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.
Salgat 1 days ago [-]
To this day I do not get why Intel doesn't just offer massive memory options for their cards. Just charge what it costs to add the extra memory, no upcharge, and they will never be able to keep up with demand. Cheap VRAM is enough to justify a lot of open source investment into challenging CUDA.
zozbot234 1 days ago [-]
> To this day I do not get why Intel doesn't just offer massive memory options for their cards.
They seem to? Intel Arc is the cheapest option by far for a discrete card with 32GB VRAM.
to11mtm 1 days ago [-]
They took longer than everyone expected and then shortly after release they made announcements that made people worry that Intel might kill the project the way they tend to kill GPU projects.
(I still kinda want to get one tho.)
Auracle 1 days ago [-]
That’s not massive, though. Make it 96GB at $2,000 (ok, probably impossible right now, but they could have before the surge in prices) and you’ll see developers work really hard to make AI tooling work for their cards, CUDA be damned. The same goes for AMD.
It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.
zozbot234 1 days ago [-]
If you buy three 32GB GPUs, that's 96GB total at a very reasonable price. An AI model splits easily by layers, so running on multiple GPUs is quite feasible.
schubidubiduba 7 hours ago [-]
Doesn't split as easily on an Intel GPU as ona NVIDA GPU though, regarding software support. Sure, it's probably not too difficult if you know what you're doing, but not sure how big that market would be.
htrp 1 days ago [-]
Missed a zero here.
Needs 320 GB Vram
ActorNightly 1 days ago [-]
Memory is just one part. AMD has had offerings competitive to NVIDIA for quite some time, but nobody uses AMD cards.
The biggest advantage with NVIDIA is CUDA.
overfeed 22 hours ago [-]
> but nobody uses AMD cards
AMD is selling every MI card it makes, and the market wants more of them.
ActorNightly 2 hours ago [-]
They are only selling because Nvidia is hard to get, and something is better than nothing.
Melatonic 3 hours ago [-]
It's also ECC ram but to be fair - yes quite overpriced. The RTX Pro line are basically what the Titan line used to be but way way more expensive.
dahart 1 days ago [-]
I have so many questions… Since Apple already sells unified memory systems, what is the market opportunity you envision? Do you see Nvidia and Apple as competitors, and how? (And I’m not suggesting they’re not, necessarily, but I want to hear where you’re coming from, and they do have very different markets.) Hasn’t Apple used storage size (RAM & disk) for market segmentation for decades? And how does a machine with 128GB unified mem not potentially cut into some people’s reasons for wanting a 96GB GPU?
JohnBooty 1 days ago [-]
I'm not the person you're replying to, but I wholeheartedly agree with them...
Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.
Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.
Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.
Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.
By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.
The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.
Since Apple already sells unified memory systems, what
is the market opportunity you envision?
nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.
(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)
So.
Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.
They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.
The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.
zozbot234 1 days ago [-]
They use different technology for their VRAM though. Apple, AMD Strix and NVidia DGX/RTX Spark use LPDDR, whereas discrete cards will be either GDDR or HBM. That directly impacts the memory bandwidth figures. As for compute available, Apple and AMD still have very good figures there for what's essentially a general-purpose iGPU that ships as part of the stock system, rather than a special-purpose piece of dedicated hardware.
robotresearcher 23 hours ago [-]
The M5 has 16 dedicated ‘Neural Engine’ cores and a ‘Neural accelerator’ in each of its conventional GPU cores. It’s been pretty special-purpose juiced for inference.
zozbot234 23 hours ago [-]
When it comes to the very largest models the ANE seems to be only marginally useful for prefill. The M5 Neural Accelerators (NAX) help a lot but at a real cost wrt. power and thermals.
robotresearcher 23 hours ago [-]
Yep, but Apple products don’t spend most of their time running huge models. They are running lots of little ones all the time, using hardware designed for that.
zozbot234 23 hours ago [-]
It seems that you're agreeing with what I wrote above. They ship a general-purpose stock system and tailor their compute offering towards that. Accelerating 'lots of little models' fits naturally into what they offer, in a way that a more compute-intensive design might not.
robotresearcher 22 hours ago [-]
Yep, I misunderstood your point. Thanks for your patience. In my defense, the 'general purpose system' has a lot of model-inference-specific hardware. But not LLM-specific hardware.
If there's an M5 Ultra it'll be interesting to see what they've optimized it for.
MBCook 1 days ago [-]
There’s something else. Memory size.
Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.
My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.
Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.
JohnBooty 24 hours ago [-]
You're correct about some things but mostly wrong.
Yes, a Mac with 128GB+ will let you load some pretty big models.
However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.
And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.
So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.
The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.
My understanding is this is the advantage that’s pushing huge Mac Studio demand.
This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.
zozbot234 24 hours ago [-]
12 tok/s can absolutely be "usable output" depending on what you're doing. I agree though that the 27B dense model often feels slow due to an overall weakness of memory throughput on that particular platform. Most real-world 120B models though will be MoE-based with only a small fraction of active parameters, and these run quite well. Also, dense models can benefit from batching, which is at least marginally viable with Qwen if you stick to shorter contexts and smaller batches.
jmyeet 1 days ago [-]
Apple offers relatively affordable options for a high-memory workstation that uses unified memory. They previously offered 256/512GB Mac Studios (both discontinued). Because of this they can keep larger models in memory.
BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:
1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and
2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.
According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.
The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.
You can chain Mac Studios together into a cluster with TB5 too.
But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.
zozbot234 1 days ago [-]
Even low-VRAM cards are actually very useful for running the comparatively smaller dense layers in large local MoE models. This only requires transfering very small amounts of data across the PCIe bus (similar to pipeline parallelism) so it fits nicely around the existing bottlenecks on that hardware.
woodson 1 days ago [-]
> 5090 ($2k MSRP but realistically $3-3.5k)
These days, more like >$4.1K (at least in the US).
simonebrunozzi 1 days ago [-]
What should Apple do, in your view, to "embrace" it?
Nevermark 23 hours ago [-]
Mx Extreme = 2 x Mx Ultra = more cores. (Opportunity: processor chiplets could be designed to integrate in higher quantities.)
Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.
Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.
Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.
pjjpo 19 hours ago [-]
Isn't the big drawback not having a swappable GPU? Perhaps that's not as important anymore but I'm not sure we've confirmed the market demand for that.
aabdi 1 days ago [-]
If this thing only has as much gpu bandwidth as the spark, it’s kinda pointles
cthalupa 1 days ago [-]
Not true. This is aimed squarely at the Strix Halo and Mac markets. It's basically just strictly better than the Strix, and it's not clear cut vs that Macs in any sort of blanket statement.
My M5 Max 128gb MBP decodes faster than one of my Sparks, but the Spark's prefill is so much faster it can often answer the same query before the mac's prefill is finished. If you have large prompts, low cacheability, etc., a spark might be a very good options.
Not to mention you get can get two sparks and the MBP will be 85%+ of the cost at half the RAM.
I'm kind of tempted to pick one up. Leave running big models to my dual dgx setup, and all the misc. random stuff on an rtx.
zozbot234 1 days ago [-]
Prefill will be a huge deal if batched unattended inference of SOTA models (on consumer platforms) becomes viable, because at that point it's the main remaining bottleneck. If running 30 inferences together boosts your decode throughput to 3x (that's consistent with some very rough experiments, though these haven't even looked at trying to mask SSD offload latency just yet), that's a 10x in total decode time but a 30x in total prefill time, because prefill workloads are fully compute bound already on consumer platforms and don't benefit from batching much at all.
aabdi 19 hours ago [-]
Fair, but I don’t see what case you have w this. Mind sharing?
Seems niche to be both uncacheable and long context?
Asmod4n 1 days ago [-]
yeah, you only see double digits in performance degradation from going from pcie 5 to 3 with a 5090 (at x16 speed), with everything else its like in the single digits area.
stego-tech 1 days ago [-]
And the thing we gamers forget is that we’re the outlier. We’re the edge case.
Most consumers will never really care about, let alone see, the difference in PCIe or memory bandwidth impacts from such a shift to unified memory pools. We might (being, at least in my case, a huge nerd), but I’m increasingly of the opinion that if modern blockbuster games are built for upscaling/reconstruction anyhow, then suddenly such sacrifices to performance seem acceptable relative to the gains in efficiency.
jayd16 1 days ago [-]
Well I mean, the idea with games is it all fits in vram. You really don't want to be thrashing. It's that things are still so slow that they must be avoided entirely, no?
No copy unified memory will help with that but you do pay the read speed costs.
BoredPositron 1 days ago [-]
gen3 is 16 years old.
nalekberov 1 days ago [-]
It’s also the reason, why you will never be able to repair or upgrade your computer in the future. From technological point of view these are indeed big advancements.
However, I couldn’t care less about faster CPU when:
1. It limits my ability to upgrade my system
2. Windows gets increasingly bloated and slower
merb 1 days ago [-]
LPCAMM2
supertroop 1 days ago [-]
Intel was doing UMA with their i740 graphics in the late 90s. Codename TIMNA was cancelled, but they pioneered it and used it on their you/cpu chips as well as their breakthrough 810 chipset that dominated graphics market for a decade. It was despised because it wa ubiquitous and a low performing graphics engine but games had to accommodate it.
Funny that it is getting credit only now.
p_l 1 days ago [-]
SGI O2 was the famous "unified memory architecture" graphics system, two years before i740 that didn't really do UMA.
O2 was popular in systems where large textures or textures generated dynamically (like mapping external video input to texture) was important
up2isomorphism 1 days ago [-]
This kind of post shows you have little idea why cpu and gpu are not sharing memory in the first place.
Izikiel43 1 days ago [-]
> The Unified Memory pool is what will continue to be the “game changer” in systems architecture, especially outside of data centers.
The ps4 was the prime example of this, and how it could run so many great games.
testing22321 1 days ago [-]
> The Unified Memory pool is the “game changer”
M1 knocking from 2020.
Gamed changed, past tense, six years ago.
This is catch-up.
jandrese 1 days ago [-]
Hell, SGI O2s from 1996 had this. For all of the hype the performance gains were pretty modest.
zdw 1 days ago [-]
FWIW, the O2's UMA let it handle far more textures than almost any other contemporary system with reasonable performance.
Most other SGIs had single or low double-digit megabytes of texture memory, whereas the O2 could host one gigabyte of unified memory and use a huge chunk of that for textures.
wmf 1 days ago [-]
UMA was never about performance and it still isn't. Spark is slower than a 5090.
JMiao 1 days ago [-]
did they learn why? were there other gains?
p_l 1 days ago [-]
O2 GPU was slower than other SGI options at the time, however it could use hilariously larger pool of memory without copying, which meant that O2 could use approaches that were punishingly hard (very tight transfer loops) or impossible (huge textures that couldn't be virtualized due to needing whole texture).
That was because unlike other GPUs at the time, O2's didn't have dedicated memory but shared the memory with CPU - way slower, but zero copies and bigger.
Arguably early home computers and workstations also used "unified memory" :D
Rohansi 1 days ago [-]
FYI it existed long before that. Shared memory between CPU and iGPU has been a thing for a long time.
Ah. Well, what kind of consumer hardware/software combo could I purchase to use this? outside of perhaps the... PS4?
Rohansi 1 days ago [-]
Everything that doesn't have a discrete GPU has unified memory these days. If you're asking for something closer to the RTX Spark or Apple Silicon then look at AMD's Strix Halo systems.
throwaway27448 24 hours ago [-]
> Everything that doesn't have a discrete GPU has unified memory these days.
Sorry, I meant before the M1 came out. And you and I both know that "unified memory" doesn't refer to allocating ram to the gpu for zero-swap sharing.
Rohansi 23 hours ago [-]
> Unified memory is supported on Linux by all modern AMD GPUs from the Vega series onward
Every AMD APU since introduction of HSA did it, which is how AMD ended up doing SoCs for PS4, PS5, and Xbox
throwaway27448 24 hours ago [-]
Ok, so which one of these contemporary or previous chipsets could compete with the M1 for inference? Perhaps I'm missing some major detail.
Rohansi 22 hours ago [-]
You are missing a major detail: integrated GPUs are crap. They win on efficiency but not on raw compute. Before AI (and crypto too, I guess) people bought GPUs to render graphics and that was their main consumer. People built more and more demanding games that required increasingly powerful GPUs to render well. Gaming systems always had a discrete GPU so there was no reason to scale up integrated GPUs because they wouldn't sell, or they would be a waste of die space.
I don't think the M1 specifically focused on inference. Their goal was to replace Intel/AMD/Nvidia with their own chips, and since the previous Macs shipped discrete GPUs, they had to match or beat those so they don't ship something slower.
ac29 23 hours ago [-]
The M1 isnt particularly good at inference, so pretty much every major current competitor with a 256+ bit unified memory system is better: AMD Strix Halo, NVIDIA DGX Spark, possibly Intel Panther Lake
throwaway27448 10 hours ago [-]
Sure, but none of these shipped before the M1. That was the first chip I encountered that managed to do something useful without a discrete GPU.
p_l 1 hours ago [-]
Even in Apple land M1 isn't the first with unified memory - pretty much all intel on-chip GPU (Sandy Bridge and newer) count - it was even a reason for driver issues early in intel's new dedicated GPU lineup, as the drivers expected unified memory - but M1 is essentially modification of an iPad chip, and you can see "unified memory" there going all the way back to first chip Apple bought from Samsung to power iPhone 2G
Rohansi 6 hours ago [-]
You were likely only using Intel systems then. AMD systems have had iGPUs capable of light gaming for a very long time. It took Intel a long time to get to that level, after M1 (2020).
bombcar 1 days ago [-]
I want unified but not uniform - everything can address anything, but you can add slower RAM to the system without requiring an entirely new chip. NUMA is cool.
zokier 1 days ago [-]
AMD Fusion knocking from 2010.
vlovich123 1 days ago [-]
> (which is good for Rust adherents, I figure).
As a Rust adherent, please do not put words in our mouths or set up unrealistic expectations for other people by linking together concepts at a very shallow level.
Language level memory safety has no answer for hardware security flaws which is what side channel attacks are. No programming language can provide memory privacy if another chip in your machine can read your memory. Just like no programming language can protect your application from a kernel vulnerability of the kernel it’s running on.
stego-tech 1 days ago [-]
Damn. That wasn’t my intention at all, I was just pointing out that Rust has another reason to see wider adoption vis a vis the usual Valley advertising bullshit of deliberately conflating hardware security with software security. I personally give no fucks what something is written in, only that it’s written well enough that I don’t have to twist arms or babysit yet another sloppy piece of code in my enterprise.
b112 1 days ago [-]
But... it's rust.
infecto 1 days ago [-]
"I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games."
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
root-parent 1 days ago [-]
"I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games..."
This is the 2026 edition of Ken Olsen:
"There is no reason anyone would want a computer in their home"
throw0101a 1 days ago [-]
> This is the 2026 edition of Ken Olsen: "There is no reason anyone would want a computer in their home"
Digging into this:
> In conclusion, there is evidence that Ken Olsen did doubt the need for computers in the home, but the evidence is based primarily on the testimony of David Ahl who was perturbed when the personal computer project he championed at DEC was not supported by Olsen in 1974.
> Olsen’s resistance may have been similar to that expressed by another DEC executive, Gordon Bell. In 1980 Bell thought home terminals would act as gateways to remote computers which would provide appropriate services.
It was supposedly said in 1977: most computers at that time were not small, and so it would not be surprising that people would not expect the general public to desire a large, power-hungry, noise-y apparatus in their house.
wccrawford 1 days ago [-]
That's exactly the point. Until recently, AI models that could run on home machines were so bad that it was very hard to imagine anyone wanting to.
And, like the overly large machines of 1977, models are getting faster, leaner, and better. It's happening a lot quicker, though.
Silhouette 1 days ago [-]
This is why I'm bearish on Anthropic, OpenAI, and friends. I am not confident that we will continue to see the same pace of improvement in frontier model capabilities as we have seen over the past year or two - not using similar mathematics at least. But I think that getting results that are close enough to the same standard to be a realistic substitute but in a model small enough to run locally may well happen quite quickly. And if it does - where is the moat to defend these AI organisations with their astronomical budgets when they're already starting to price more realistically and that's already killing a lot of the hype they've enjoyed until very recently? They have an accidental moat because they bought up the global supply chain for storage but that surely isn't going to last once the data centres to hold that storage are becoming liabilities.
throw0101a 8 hours ago [-]
> This is why I'm bearish on Anthropic, OpenAI, and friends.
Just because you can do more and more things at home (thanks Moore and Dennard), doesn't preclude needing things also done remotely. The number of at-home systems seems to have fed a growing number of remote systems (especially once always-on connectivity became ubiquitous).
It's basically the angle Apple is going for: do as much locally (for the sake of privacy), and then offload when it becomes "too much".
api 1 days ago [-]
If model performance asymptotes and CPU/GPU and RAM keep growing, even slowly, then eventually we will have frontier models on desktop that are totally competitive with hosted. It’s only a matter of time.
You already can if you’re willing to spend many thousands of dollars on a beast of a machine. I’m talking about middle tier desktops and laptops here. Maybe eventually even phones.
The only way hosted stays strongly competitive in that world is if they can keep pushing the frontier or by playing the classic social media and SaaS games of network effect building and integrations.
Many people might still use hosted, of course, but what I really mean is that their multiples won’t be justified and they will have little to no moat. AI will become commoditized, like a sophisticated next generation form of an encyclopedia with search.
kristov 1 days ago [-]
We kinda ended up with terminals connected to mainframes anyway. The terminal being the web browser, and the mainframe being SaS. So it wasn't that far off.
supermatt 1 days ago [-]
the network is the computer
parineum 1 days ago [-]
It doesn't really need this much explanation.
People take these quotes out of context all the time. Said in a business context, there was no need, at that time, for someone to have a personal computer.
There's no business justification in 1977 for a personal computer department at a business. It's similar to the gates quote about RAM (I think it was 64KB?).
These statements aren't meant to be forever quotes. Their business plan quotes.
michaelcampbell 1 days ago [-]
> It's similar to the gates quote about RAM (I think it was 64KB?)
640, and Bill Gates said he either never said that, or at least never remembered having said it. I think there is no evidence anywhere that he did.
That exact quote? No, never.
He said something like: current computers at the time had 64kb of RAM, so the OS was designed with a limit of 640kb, and he believed this would give them 10 years of future proofing. As it happened, that limit was reached much faster, in about 6 years.
valleyer 14 hours ago [-]
MS-DOS didn't create that limit; the physical memory map of the 5150 did. So Microsoft (and Gates) would not have made that decision.
glimshe 1 days ago [-]
Or maybe he simply made a mistake. Big deal. This doesn't speak negatively of his other achievements.
shermantanktop 1 days ago [-]
He had a long career and presumably many successes, and is fallible like the rest of us. But a half-remembered zinger with no context makes for zippier posts I guess.
The early popularity of Minitel, the continued popularity of ssh/tmux, and the web browser itself indicates that bespoke client applications are not the only way. He wasn’t directionally wrong.
wslh 1 days ago [-]
The simple explanation is that predicting the future is generally impossible. It doesn't matter if it's Olsen or anybody else.
dakolli 1 days ago [-]
I will not be spending thousands in hardware to run the worlds most mediocre llms at meh speeds. Sorry. I know for llm bros they think every output made by an LLM is magic, like every NFT guy thought every NFT collection was game changing, but there's nothing useful you can do with llms and 128gb of RAM (and there never will be) unless you have llm psychosis. Who cares.
Gigachad 23 hours ago [-]
Nothing isn't quite right but you wouldn't be using it like the hosted ones. 128gb is more than enough to run models to index my files and photos, denoise photos / AI photo masking, magic eraser type tasks for images, frame generation for gaming, etc.
Even for a lot of LLM type tasks, 128gb is likely more than enough to control a lot of PC configuration and automation with natural language.
Nobody ever said that, at least not as an assertion or prediction. The actual instances of similar language are from multiple people describing their earlier thoughts before they learned it wasn’t true.
throw1234567891 1 days ago [-]
There’s no public proof this has ever been said, and if it was, if it was not taken out of context.
DonHopkins 1 days ago [-]
I have that many browser tabs.
fg137 1 days ago [-]
You seriously think running LLM is the same thing as general computing?
ako 1 days ago [-]
It’s better, it’s useful even for those who don’t have a deep knowledge of computers. I’d expect more AI users than programmers, than ms-word users, than excel users.
fg137 11 hours ago [-]
You are confusing "using AI" with "running LLM locally".
AaronAPU 1 days ago [-]
That’s too strong of an assertion.
Local models aren’t deterministically equivalent in capabilities to foundation models. Home computers are turing complete; just like a mainframe. They are just slower. Often not slower enough to matter.
sandworm101 1 days ago [-]
Most people are ok with slower. An AI that lets you edit a family picture, in say 30 seconds, locally is preferable to one that is instantaneous but requires you to submit that picture to examination/storage/training/sale in someone else's AI ecosystem. If i want to crop my ex out of family photos, i should not have to first give that photo to Microsoft. If want an LLM to write a book report for me, i dont want it also alerting my school. And if i write a memo for a client, and i want an LLM to check the spelling, i dont want that memo leaked either.
robotresearcher 1 days ago [-]
It’s completely technically possible to have cloud services where customer data is opaque to the provider. Some of Apple’s services are like this already, for example.
I think there’s a sweet spot currently with munging your data blindly on the server so that your client device battery still lasts all day.
Meanwhile Apple and others push on with making client side models more efficient so that eventually the server costs and complexities go away.
fg137 11 hours ago [-]
This.
If asked to choose between photo editing done within 3s using cloud provider vs an average of 30s using local compute, most consumers will choose the former without hesitation.
Most users' usage is also going to fall nicely in the free tier of a typical freemium pricing model, like ChatGPT today.
People who talk endlessly about local inference have no idea about user workflows and usability.
dominotw 1 days ago [-]
dont want to share my pics with "cloud services"
Pxtl 1 days ago [-]
I'd like to think so but the existence of Google and Apple and Microsoft's cloud based photo tools with phone integration suggests that's false.
You could run a pretty good home server on $50 of gear and yet we never saw any real adoption of OwnCloud/NextCloud style products as an alternative to Google Drive/Photos or Apple Cloud.
Why should LLM/Transformers be any different? Especially when you need a proper expensive GPU to run them instead of a Raspberry Pi?
thewebguyd 1 days ago [-]
Apple's photo tools run on device, and they'll probably ship more on device foundation models at WWDC too.
On-device AI is going to be important, I think. It doesn't have to take the form of a chatbot UI to be useful.
com2kid 1 days ago [-]
After the latest round of cloud storage price increases my non technical wife has been asking if we can do local backups instead...
parineum 1 days ago [-]
> Most people are ok with slower. An AI that lets you edit a family picture, in say 30 seconds, locally is preferable to one that is instantaneous but requires you to submit that picture to examination/storage/training/sale in someone else's AI ecosystem.
Maybe if you ask them that question, but if you show them two products, they'll definitely prefer the faster one. 30 seconds is a long time to watch a progress bar.
sandworm101 1 days ago [-]
Fast and public, or slow and private. Not everyone wants, or is allowed to, share their data with the AI world. And do not doubt that every bit shared with an AI service will be used for training.
parineum 1 days ago [-]
The question here is about markets though. Not everyone wants x but if the vast majority of people want y, x is going to be niche and expensive.
You don't think the commercials of Google's AI photo features aren't going to have an impact on Apple users of their phones can do a worse version of that feature and it takes longer?
spwa4 1 days ago [-]
Plus there's the other question. If this thing is slower ... what's the price? The desktop/mini-pc version of this is $3000, after all. At this performance level what is an acceptable price for the laptops?
People definitely aren't going to accept more expensive + slower ...
jb1991 1 days ago [-]
He’s just a braggart. When you see something like this in somebody’s personal bio on social media, it’s basically a banner that means “take everything I say in the context of me promoting myself.”
flatline 1 days ago [-]
The HN crowd is, by and large, not the target audience for his self promotion. I guarantee there is one and this is more or less effective.
smcleod 1 days ago [-]
Qwen 3.6 is far ahead of Gemma for most (but not all) things. I've deployed it out across a number of M5 MacBooks and it's genuinely useful for many tasks. It won't replace an Opus or current gen Sonnet sized model but it's still amazingly good for its size and probably as good as or just a bit before Sonnet 4 era. Far more reliable for tool calling, coding, agentic tasks and faster than the Gemma models especially with MTP.
zozbot234 1 days ago [-]
Qwen 3.6 is a toy compared to DeepSeek V4 Flash or Pro. These models can now run on Apple Silicon hardware with as little as 32GB RAM for the Flash (with 2-bit quant, which is still quite capable) using SSD offloading, with just-about-reasonable performance for interactive use, and far better performance on longer contexts than Qwen (due to the more efficient KV cache/attention mechanisms in DeepSeek).
Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.
greenavocado 1 days ago [-]
Qwen 3.6 27B still curb stomps Deepseek V4 in coding
epolanski 1 days ago [-]
1. Deepseek V4 is still in preview (training is not finished)
2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.
3. Qwen doesn't like quantization at all.
kgeist 1 days ago [-]
I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.
Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.
zozbot234 1 days ago [-]
A 260k context (close to the stock maximum for Qwen, though it's possible to extend it) will take ~16GB RAM for storing the KV cache, barring quantization tricks which severely degrade quality. That's a whole lot more than what DeepSeek requires for a similar context length, and makes it infeasible to batch multiple inferences together. This used to be the status quo for consumer inference, in fact it still is for models like Kimi and GLM (which can sometimes be smarter than even DeepSeek V4 Pro!) but we can also do better nowadays.
kgeist 1 days ago [-]
[dead]
ColonelPhantom 1 days ago [-]
Deepseek V4 Flash still has 13B active params though? That is about half as many as Qwen3.6-27B (and much more than Qwen3.6-35B-A3B). Given that RAM (even on a base M4 or 'regular' Intel/AMD system) is like an order of magnitude faster than an SSD, even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading. And the MoE will be much faster still.
Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.
Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.
zozbot234 1 days ago [-]
> even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading.
If you have reasonable amounts of RAM to cache the most likely experts, that's not true at all. Qwen 27B is marginally faster on a nearly empty context, then falls behind as context length increases due to the different attention mechanisms. Prefill for Qwen is much faster, but you're still comparing vastly different model sizes and capabilities. DeepSeek Flash is the best deal overall.
> completely fit in a high-end consumer or mid-end pro GPU
Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
ColonelPhantom 1 days ago [-]
> the most likely experts
Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).
Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)
> as context length increases
What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.
> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.
zozbot234 1 days ago [-]
> Is that how MoEs work?
For any given workload/session? Empirically, yes, that's what has been found across different models. There's quite a bit of predictability that makes caching helpful.
> Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
There are ways of masking some of that latency, though it requires some architecture-specific cleverness which is less directly applicable to a generic engine like llama.cpp.
> Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal
The llama.cpp folks are working on adding support, and the ds4 project is working on CUDA support for streaming inference, targeting the DGX Spark.
> From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens.
DeepSeek V4 seems to do quite well on recall tasks even with large context. That's one plausible benefit of its compressed attention mechanism, compared to earlier models. Some degradation will likely still be there, but it's not necessarily obvious.
As for why people are calling Qwen 27B "borderline unusable" that may have to do with it being a dense model which makes for an increased compute intensity and pushes users towards discrete GPU platforms, since those tend to have the most compute overall as far as consumer hardware is concerned. I might agree that Qwen 27B is quite ideally tailored towards these platforms, but that does come with some limitations.
trollbridge 1 days ago [-]
You can run the 35B A3B model which is an MoE. Runs great on a 5090.
Pxtl 1 days ago [-]
I've got a Qwen 3.5 running on a 12GB 3060 and it's dumb as a stump but still smart enough to get some useful work done. Since it's my daily driver desktop I havent jumped to 3.6 since last time I did I quickly ran out of vram and locked the desktop environment.
But yeah, the Qwen line is pretty impressive on commodity hardware.
derefr 1 days ago [-]
I must be using LLMs very differently than y'all, because I can't think of a single thing I would rely on an LLM that's "dumb as a stump" to do for me.
To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.
I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?
Pxtl 1 days ago [-]
Things that are tedious but simple but I'm unfamiliar with.
"Go add a gh action to compile and deploy this thing and run its tests" is one I've found it's good at. Yes I know how to make a gh pipeline but it's always a hassle to remember what goes where.
Cranking out unit tests is okay. It's good at summarizing things so it's not half bad at writing jsdoc/xmldoc comments.
epolanski 1 days ago [-]
Qwen suffers quantization a lot, rendering it borderline unusable.
unmole 1 days ago [-]
> you may run some models locally if only from a cost perspective
I have a hard time believing running a model on a laptop will be cheaper than running it in a datacenter. Why wouldn't economies of scale apply here as with every other computation?
wazdra 1 days ago [-]
This is assuming that you'll be priced the fraction of computing that you consumed. But you are actually paying for their infrastructure, for the R&D (and also the computation that went into training the model) etc.
It is not clear that, for your own small computations, this kind of costs are needed, but you will still pay your share in the investment the provider made so that they could serve everyone's computation needs.
hungryhobbit 1 days ago [-]
But, currently ... you're not. AI companies are operating at a loss, and are being subsidized by their investors.
Local may or may not be cheaper than remote now, depending on the details, but the factors you describe won't affect the math nearly as much as they will once that subsidization ends.
dannyw 22 hours ago [-]
Not for API pricing. The latest models are not subsidised API wise anymore.
Qwen3.6 is practically indistinguishable to Sonnet 4.6 at least in my personal experience. And sonnet 4.6 is not that cheap.
wjnc 1 days ago [-]
In that analogy bigtech AI is currently investing in cleaner air for all of us? We _could_ breath it through their hose, but might as well breath it outside.
zozbot234 1 days ago [-]
The datacenter setting has huge economies of scale for low-latency, just-in-time inference using extremely large models, but that's not the only viable use of AI. Batched, unattended inference of possibly smaller and weaker models, while theoretically viable in a datacenter setting, is far from the best use of that hardware. This is where local AI is at its best.
lrae 19 hours ago [-]
Does it apply for every other computation? Purely for the computation part?
You can host all kinds of things locally cheaper right now than in the cloud, no? (At least pre memory price hikes.)
It does, of course, come with its downsides like availability/reliability, less convenience, scaling options,..., but purely the computing price - I don't see why it wouldn't be cheaper in the future - at least for some use cases.
dgellow 1 days ago [-]
A laptop is really a pretty bad form factor to run LLMs. Worst cooling, more expensive memory that you cannot replace, resell value depreciating fast. It’s fine for tinkering, small scale research, and demos but it’s definitely niche.
The vision NVIDIA is selling is pure marketing IMHO
TylerE 1 days ago [-]
Because economy of scale isn't really the right metric here. A machine you were you were going to buy anyway essentially has a TCO of $0.
dofm 1 days ago [-]
AI models will pretty undeniably affect your electricity bill; yes you already own the computer, but it will cost more to run it if it's doing inference!
TylerE 1 days ago [-]
To a point, but we're talking a laptop, not a server farm. Even if you're going fullbore wide open 24/7 that's about $150/yr in electricity bills at average rates. Not quite nothing but in terms of AI costs that's pretty close to rounding to zero.
itishappy 1 days ago [-]
It's cheaper for the AI provider to use your laptop instead of their datacenter.
jerf 1 days ago [-]
What "every other computation"? I seem to have a lot processing power at my disposal here, between my cell phones, laptops, gaming PCs, various other hardware devices.
You're going to need to analyze the problem much more deeply because it sound like the standards you are implicitly applying would result in "economically, everything should be centrally hosted" but that is clearly not the result that obtains. Even a modern mid-grade cell phone is no slouch; you may not be running a current-gen frontier AI on it but you certainly can do a lot of other rather intense things locally that would have been laughable 10 years ago, like suprisingly high powered games.
strictnein 1 days ago [-]
I also don't get why this twitter user is linked here, versus all the news articles about this new hardware that have been everywhere over the past number of days.
latch 17 hours ago [-]
I also dislike his self-promotion, but his work _is_ well know and, as far as I know, well looked upon. I think he has more expertise and knowledge in this area than most (including what you'd find in the news).
bespokedevelopr 1 days ago [-]
The security aspect is the main driver why I’m seeing so many businesses investing in local hardware. They know the models aren’t as good (caveat that they also can’t run Chinese models) and that’s ok. Places that really care about security and data governance already aren’t on the bleeding edge. They wait for the nice stable lts version, they lock down dev machines in frustrating ways and have lots of IT admin layers.
But they also want to taste the sweet fruit of AI so the only way to do this that a CISO will approve is on local air gapped hardware. It’s a niche but still a billion dollar niche.
I suspect personal privacy and need to run AI workflows to handle the litany of administration tasks of a household will be what result in regular need for local AI.
Apple is already out front with this on a personal, individual level, but they are not obviously headed toward multiuser/family-level ~biz admin with a persistent server running local LLM.
voidfunc 1 days ago [-]
> "Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers"
This made me laugh. I can only image how insufferable this person is to deal with.
falsemyrmidon 1 days ago [-]
> this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
Do you think he's in mensa too?
GeekyBear 1 days ago [-]
> However, it will make decent machines to play video games."
Where you will need games to be rewritten for ARM to get full performance, just like on Apple's M series chips.
jayd16 1 days ago [-]
Maybe they just mean from a "it can run a lot of DLSS" perspective.
unstatusthequo 1 days ago [-]
I hope a family-level AI appliance is a thing later. Local non-cloud assistant that lives in the house, families interact via voice or phones or whatever. Knows the contextual family stuff you need, etc.
Pxtl 1 days ago [-]
We didn't get people buying family-level file servers for the family photo gallery and documents at any real scale, so i doubt we'll see similar for AI especially when the cost is that much higher for GPUs vs an SBC machine.
JMiao 1 days ago [-]
because nas hardware and software suck and everything else was a poorly executed subscription product...i think one was called helm, another was by early twitter alumni. imagine a home device that manages and maintains itself and is a joy to interact with.
Pxtl 1 days ago [-]
And why would the hypothetical "OwnAI" product be any different?
JMiao 1 days ago [-]
not automatically, but a meaningful step up in ease of use (managing photo/video backup from all family devices) without a subscription would be a solid foundation
epolanski 1 days ago [-]
DeepSeek Flash v4 is the leading local AI on 128GB machines, and DS4 is still in preview (training not finished), no?
Especially on Dwarfstar.
sandworm101 1 days ago [-]
Lots of people are already running AI locally. They are the people buying up all the consumer-grade nvidea gpus. What are they doing with them? Well, the same things people with home media or email servers are doing: stuff they dont want to share with the general public.
Zetaphor 1 days ago [-]
I want to reduce my dependency on companies like Google, OpenAI, and Anthropic. Aside from the concerns of data sharing I'm also not a fan of how they run their operations, for example Anthropic now using xAI's Colossus data center which is poisoning a marginalized community, or OpenAI getting in bed with the military.
Not everything I want to use an LLM for requires "PhD level intelligence", and increasingly I'm finding more uses that involve sharing my personal data.
Yesterday my local model helped me when looking for a doctor who is in-network for my insurance. I threw it a screenshot from the providers search results and it looked up reviews for all of them.
sandworm101 1 days ago [-]
My local AI is currently upscaling an old british comedy from sub-dvd quality to 1k. (It is not availible other than on DVD.) It looks like it will take about a week for my pair of 5060s to chew through the task.
eszed 1 days ago [-]
Which show?
sandworm101 1 days ago [-]
Chelmsford 123
I own the DVDs so I'm OK upscaling/editing my own copies for my own use. But if I ran the task on an ai service I would no doubt trigger copyright issues.
pratnala 1 days ago [-]
Which model are you running?
Zetaphor 1 days ago [-]
Qwen 3.6 35B-A3B and 27B both at Q8 on a Strix Halo machine
cyanydeez 1 days ago [-]
128GB seems the sweet spot for local models. I can program and install most GitHub projects with opencode and QWEN 32b with mtp.
anyone whose addicted to token theoughput is losing the operational knowledge and offline capabilities.
if you arent moving to the AMD 395 or MACs then youre hitching aride on the expensive calory ride
throw1234567891 1 days ago [-]
If you could buy a 256GB you’d be claiming that 256GB is a sweet spot. But I agree with you. Crack-tokens are not the future.
cyanydeez 1 days ago [-]
no, the fact that MACs and x86 and soon ARM are all going to have 128GB models in every sector, yeah, sure.
But watching everyone flounder because claude goes down or forcing you on API costs.
I'm programming things that'd take me days with a PC that, without OpenAI's VRAM shenagans, would cost you $2k.
It's more than just 'this is what I could do' it's definitely about 'this is what anyone could do with a new PC purchase'.
throw1234567891 1 days ago [-]
You must be unaware that System76 was already selling 192GB machines, mac studios used to be 512GB max. The only reason why we don’t have them anymore is that we are in RAM shortage.
cyanydeez 1 days ago [-]
I'm aware you can have more. the term "SWEET SPOT" references a area that anyone/everyone can get to and isn't some magical expensive unicorn.
You're doing what the IT industry has been addicted to for decades: number goes up.
throw1234567891 1 days ago [-]
> You're doing what the IT industry has been addicted to for decades: number goes up.
No, I have a hands on experience with bigger models, and understand the advantages of using them.
cyanydeez 1 days ago [-]
you mean you're addicted to not understanding anything you do. That's fine. The rest of us arn't going to experience the glory of api bills going up.
You also probably believe you need to 'escape the permanent underclass'
throw1234567891 1 days ago [-]
You assume I use a subscription. There are other options but they require more than 128GB unified RAM. You also assume a lot about how I work. And those final assumptions about what and how I think of others speak more about your anxieties rather than what I think.
You assume a lot. Sometimes it’s good to simply ask a question.
speed_spread 1 days ago [-]
Those 192GB aren't unified memory though. 128GB on Mac or 395 can be used by both CPU and GPU. It's the GPU + large memory that opens up fast local LLM inteference.
throw1234567891 1 days ago [-]
Yes, true. But if we had the ability to buy that much RAM in the laptop, everyone would be looking in that direction. Until this thing discussed here comes to the market, “we didn’t have computers with unified 128GB RAM either” (except of macs).
1 days ago [-]
iLoveOncall 1 days ago [-]
> "Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
Lol yeah seriously, that stinks "I ask AI to generate a huge amount of bullshit and upload it to pad irrelevant stats".
Absolute loser.
nkurz 1 days ago [-]
I agree that it sends the wrong symbol, but actually Daniel is great. He cares tremendously about doing work that is actually real-world useful. I've co-written a few papers with him, and he's really hard working and open to outside suggestions. The danger is that if you send him comments, he'll eventually manage to rope you into writing a new and improved version. Seriously, if you are a non-academic computer scientist with a good idea that you want to publish, he'd be incredibly open to working with you.
As to why he now has this on his blog? I also cringe when I read it. I presume someone told him he should self-promote more, and this is his lame attempt to do so. He's almost certainly the most cited person in his department, but it's entirely possible that none of his colleagues actually know this. Cut him some slack. Self-promotion is not his strength. He's a nerd's nerd, and not a marketer. I'll mention to him that his attempt here might be backfiring when I'm next in contact with him.
infecto 1 days ago [-]
I cringe calling it out but it just stood out as it was plastered everywhere and I actually have never seen his links before.
hgoel 1 days ago [-]
I kind of get it in the sense that every academic has to make themselves somewhat comfortable with self-promotion even if they don't like it. It's an important part of getting funding, but putting a blurb like that everywhere just hurts his credibility I think.
iLoveOncall 1 days ago [-]
> As to why he now has this on his blog?
He doesn't just have it on his blog, he has it EVERYWHERE. Sometimes 2 or 3 times on the same page.
dgacmu 1 days ago [-]
He's not a loser; he's done some really fun work that many people use daily. I've used his range mapping trick in multiple projects/papers. It's elegant.
It sounds like he's gotten bad advise about how to market himself /or/ this is being marketed to people who have bigger checks to write and whom he believes will be responsive to this kind of marketing. As an academic, it rubs me very wrong - I think it's detrimental to the field when we get into h-index stacking contests or citation count comparisons. But I don't know what incentives he's responding to, which seems important for putting this stuff in context.
(as an aside, it turns out that polars + fastexcel is about 10x faster than pandas + openpyxl for searching that dataset, if anyone else is curious what he was actually talking about. :)
netsharc 1 days ago [-]
I found his website, https://www.lemire.me/en/ , and the "2%" brag is the very first sentence, geez.
Being the top x% is what OnlyFans girls brag about, professor...
For posterity: It's rank 34 at the time of this comment
SkiFire13 1 days ago [-]
That lines looks very cringe indeed, but the guy has some crazy good blogposts on SIMD stuff.
1 days ago [-]
SwtCyber 1 days ago [-]
I think the local-model use case is going to become less niche pretty quickly if the models keep getting smaller and more capable. Even if most people do not care about privacy or offline use, the cost argument is pretty strong
dagmx 1 days ago [-]
This feels fluff to me on the part of the author (whose work I don’t want to trivialize) but I don’t think they’ve actually looked deeper than a paper spec sheet on this.
1. Yes it has the same number of cores as a 5070 mobile. It’s also running at a shared peak of 2/3 the bandwidth and a shared peak of 2/3 the TDP. The GPU by itself will likely perform at half the dedicated units performance
2. Apple may not have SVE2 but they do have the AMX (private) and SME. I don’t see why he thinks the SVE2 will give him more performance than the SME.
3. He mentions a single core type but doesn’t mention the total makeup. We already have known for a year how the DGX Spark compares to Apple chips. For CPU it’s roughly equivalent to an M3 Pro and for GPU compute (not rasterization) it’s between an M4 Pro and M4 Max without considering bandwidth.
The real advantage to these is that they run CUDA. That’s it. Otherwise when they launch they’ll be 2-3 generations behind where Apple is and 1 gen behind AMD.
The other super power of the DGX Spark was the NIC for pairing them together. But that’s been removed here too.
storus 1 days ago [-]
> GPU compute (not rasterization) it’s between an M4 Pro and M4 Max without considering bandwidth
You are likely thinking about token generation which is dependent on memory bandwidth where Apple has an edge. Spark's GPU compute is way higher than even M5 Max (17 FP32 TFlops), around 2x FP32 TFlops... It's literally 6144 CUDA cores like desktop 5070, slowed down by slow memory and lower TDP (29.7 vs 31 FP32 TFlops on 5070).
dagmx 1 days ago [-]
That’s only if you consider FP32 specifically. On average the M5 Max will pull ahead for tasks like GPU raytracing (it’s currently the fastest mobile GPU for Blender rendering) and token generation and other things that benefit from the higher memory bandwidth.
I’d also mention that you’re comparing peaks which the RTX Spark won’t be hitting. The top TDP is less than that of the DGX Spark.
I just think anyone calling this a beast and a game changer are conflating/extrapolating from different form factors and constraints
well_ackshually 10 hours ago [-]
> fastest mobile GPU for Blender rendering
cool story, but nobody cares about mobile GPUs for blender. A 4080 eats an M5 Max alive for breakfast. The 5080 in my machine that cost me 1500€ runs circles around an M5 Max that would cost me over 6000€. And when in 5 years the 5080 isn't enough, I can upgrade it to a 7080 or whatever, which will remain compatible.
If you're a professional, soldered products like the RTX Spark or Apple's offering are a dead end. They are literally never worth it.
dagmx 8 hours ago [-]
As a professional in the space, a ton of people DO care about mobile performance. If you’d go to SIGGRAPH in the last few years you’d see how the landscape has really changed.
It’s not going to be the primary place of creation but there’s a lot of usefulness in having a portable workstation or that entire segment of the laptop workspace wouldn’t exist.
In either case, it’s besides the point because the point is talking about the compute levels of a GPU in the same form factor.
Foobar8568 10 hours ago [-]
And nobody cares about 5080.
well_ackshually 9 hours ago [-]
Exactly. A 5080 is just an enthusiast gamer card. If you're part of a large company that requires you to run Blender/3DSMax/etc (read: disney/pixar sized), you're going to have an A6000 in there, or even just a render farm.
Game dev & asset work is probably happy with a 5080 and that's what most rendering/dev machines would have.
The addressable market of "i have 6000 to blow and i need meh performance on anything related to 3D rendering" is small, and benchmarks make it look bigger than it really is.
dagmx 7 hours ago [-]
Ironically the two studios you mentioned don’t actively render on GPUs and it’s an area which shows that even these small SoCs can punch way above their weight if you look at their pure compute power.
Disney’s Hyperion is CPU based and RenderMan XPU is just exiting beta after over a decade.
But while they do stack their workstations with higher end GPUs for artist throughput in viewports it’s mostly just for the higher memory to fit unoptimized scenes in. None of the studios or major films I’ve worked on have had their on desk artists be raster rate gated but just memory gated.
But again, besides the point, because it’s still valuable as a metric to compare with when comparing perf between similar chipsets.
There are already more creatives using their consumer grade hardware to make stuff. And even the studios you mentioned do actually use laptops on the go for parts of their creation pipelines for various things like virtual production scouting etc.
1 days ago [-]
cthalupa 1 days ago [-]
Prefill is another advantage vs. Apple. It's way way way way faster on a spark than it is even on an m5 max.
Same model, same quant, same query, as close to as matched settings as I can get from vllm, and for workloads with large prompts + low cacheability, one of my sparks will often be done responding before the mbp is done with prefill.
wmf 1 days ago [-]
Lemire is very narrowly interested in CPU SIMD so within that niche it may be interesting. As you said, overall the Spark is good but not great.
oofbey 17 hours ago [-]
I cannot fathom why he brings up CPU SIMD as a potential comparative weakness on the NVIDIA Spark when it has teraflops of CUDA sitting right there.
well_ackshually 10 hours ago [-]
Because you won't run your sorting algorithm that runs every frame on a CUDA kernel. CPU performance matters more than however many tflops of CUDA you have under hand as soon as you do silly things like "run an OS" and "use your PC for anything but shitting out tokens"
llm_nerd 1 days ago [-]
It is absolutely fluff, and the only reason this worthless tweet is on the front page of HN is that this audience has a habit of canonizing certain people, and treating each of their bowel movements as prophetic.
Guy suddenly became aware of a chip that the rest of the industry long knew about, seems completely unaware of the competitors, and posts about how it's a BEAST and will be a GAME CHANGER.
Like the DGX Spark was a game changer? Eh, it has mostly been a massive disappointment. An overpriced nvidia laptop isn't going to change the equation an iota.
trympet 18 hours ago [-]
Yes. This reads like a LinkedIn post rehashing old news. I’m not even in the industry.
modeless 1 days ago [-]
The Qualcomm Snapdragon X2 Elite Extreme trounces Nvidia's chip in single core CPU performance. It beats Intel and AMD's best, too. It has unified memory. It's the only CPU in the same league as Apple's M-series in both CPU performance and power efficiency. And it's available in laptops today, not later this year. People are sleeping on Qualcomm.
arjie 1 days ago [-]
Garbage operating system support. If you can’t do Linux support it’s a bit pointless because there’s two platforms for this that matter: Linux and Darwin.
Qualcomm is like AMD was for GPUs for like decades. Lots of announcements and people on the Internet are huge fans based on web pages they’ve read but if you try to make it work it’s a nightmare.
Snapdragon X Elite doesn’t work on Linux so it’s a pointless platform. Enthusiasts have made M1 work better. Literally have old Macs running rather than use Qualcomm.
someguyornotidk 14 hours ago [-]
Yep, very much this. I don't bother looking to them for anything in this space because the hostility they show towards general Linux support appears to be rooted in principle and appears to be very much deliberate. It almost feels like Linux support would run counter to how they want their processors to be used.
Whether this is true or not, it's pretty safe to assume anything based on their stuff is not for me.
zozbot234 13 hours ago [-]
It's the usual issue with Qualcomm SoCs being designed as bespoke embedded platforms as opposed to standard general-purpose compute. So Linux support for these chips ends up being heavily vendor specific, i.e. it's not just Qualcomm it's also the platform OEMs.
I assume because most windows installs are corporate IT garbage that if anyone cared about performance they could just turn off one of the three endpoint protection services or tune the backup service down and get better results than processor upgrades.
raudette 9 hours ago [-]
this
It drives me nuts, I look at cumulative CPU time, and this is all my work laptop does.
satvikpendem 16 hours ago [-]
Alternatively, just use Windows.
TiredOfLife 16 hours ago [-]
It has Linux support from wsl2
ksec 1 days ago [-]
It trounces ARM's old CPU design. The X925 used in this Nvidia chip is 2 years old. X930 or C1 has shipped with Mediatek Dimensity 9500 which is what the Snapdragon 8 Elite Gen 5 / X2 Elite should be compared to. Although Qualcomm still has a lead in performance, but it is increasingly shrinking.
But perhaps more importantly. Nvidia seems to be doing a lot better with its ecosystem. Nvidia has much better distribution channels and partners building on top of their PC Gaming GPU. It also have gaming developers relations that is unmatched by any in the industry.
Qualcomm has so far failed to execute this, both in PC and on there Server CPU side.
Danox 1 days ago [-]
Microsoft is sleeping on Qualcomm with their lousy port of Windows to Arm processors…
hypercube33 1 days ago [-]
I'm not sure they are sleeping. I have an older version and it can run games and other things just fine, its just over priced and not properly cooled. The driver/firmware support from Lenovo / Qualcomm is purely garbage. You're lucky to get a driver update to fix anything. For months it just overheated and video would start corrupting but that got fixed finally. You cant just go to Qualcomm's website and download new drivers even though it looks like you can - they really dont get how modern GPU's work on Windows - a driver updates to optimize for games is really something important because of how Windows is but the experience is pulling teeth. If the systems were Neo priced (500-700 USD) and had a cooling fan I'd be all on board with these systems. Right now, AMD with unified memory is just the better deal for the $1200 (2025) systems to run Windows and an average workload.
sedatk 1 days ago [-]
> with their lousy port of Windows to Arm processors…
What's lousy about it? I use it daily and have zero problems.
criticalfault 1 days ago [-]
and is Qualcomm is sleeping on Linux?
embedding-shape 1 days ago [-]
Seems like not? Judging based on https://github.com/qualcomm-linuxsomething is happening, although I can't say how much. They definitively seem awake at least.
jeroenhd 1 days ago [-]
The problem with these chips on Linux is that something has been happening for months but you still end up needing to download special editions of ARM Linux images to get these devices to work properly.
Some distros still need extracting Qualcomm firmware from Windows to get Linux to work properly. Audio remains a challenge, like x86 Linux decades ago. Apparently camera stuff works these days but produces images of subpar quality.
These issues also occur on normal Linux. My experience with my Lenovo+Intel laptop was that it took three months after release for the firmware to work properly (and the Nvidia drivers took much longer, but that's my fault for buying something containing Nvidia hardware). Intel managed to do what Qualcomm did in months rather than years.
I hope Qualcomm finally sorts this shit out, I really do, but with the prices of computers these days, I'm going to need to see quite the discount before I'll consider buying anything with a Snapdragon.
ChocolateGod 1 days ago [-]
> but you still end up needing to download special editions of ARM Linux images to get these devices to work properly
This is a problem with Linux on ARM generally (Android has had it since inception), it's not a Qualcomm problem.
criticalfault 14 hours ago [-]
it's an UEFI Problem
they seem to have dealt with this for the server hardware
criticalfault 14 hours ago [-]
or acpi.
justincormack 1 days ago [-]
They run a hypervisor under the OS, and dont support actually running directly on the hardware, its very odd.
Elixir6419 1 days ago [-]
one of the biggest issue i see is the devicetree nonsense. It makes every single laptop and bios version very unique and requires a lot of housekeeping. There are also big chunks of work (as i understand it) to be done around hibernate and decent suspend support.
My experience (wanted to use x13s as daily sriver) is that there was good progress for about a year, until jhovold was leading the charge, but something expired and qualcom as far as i can tell forgot that some progress should happen on x1 and x8c as well as x2.
gsnedders 1 days ago [-]
It feels deeply unfortunate that even with Windows on AArch64 requiring ACPI that it still doesn’t suffice for Linux, unlike on x86.
And I know a lot of that lies on the vendors, but it does feel unfortunate (from a standardisation/conformance/certification point of view) that Windows requiring it doesn’t make it easy to boot other OSes!
stefan_ 1 days ago [-]
Yes, Ubuntu on the previous gen Snapdragon X is still trash.
reactordev 1 days ago [-]
10000000x this. They have been sleeping on Arm since windows phone. I just don’t see them ever having an original thought again.
They could have had a 128core arm chip by now.
adabyron 1 days ago [-]
They have original thoughts! It's just that those employees get squashed by other divisions or having to meet short term quarterly profits it seems.
There's also the whole giant trillion dollar company doesn't want to invest and let small ideas grow. They only focus on things that move the needle, which isn't much at the size.
Had Microsoft executed and invested, they could have made a come back imo in both search, mobile & hardware. Unfortunately major lack of leadership or they just don't want those areas.
reactordev 1 days ago [-]
No doubt the individuals have thoughts. It’s just corporate never actually does ANY of them. It’s a MBA mill.
dylan604 1 days ago [-]
Unless the chip was called Copilot, they are not thinking anything about it. If was called Copilot, they'd have already figured out how to shove it down your throat.
bradfa 1 days ago [-]
Qualcomm is a “fool me once, shame on you, fool me twice, you don’t fool me twice” kind of situation. So many horrible experiences in the past that people are going to be hesitant.
Qualcomm are trying harder now it seems. But it will take time to repair their reputation in the PC market.
thewebguyd 1 days ago [-]
They burned me with the first gen Snapdragon X Elite. Before the various laptops with it were out they promised Linux support. Here we are, years alter, still no fully OOTB support. Ironically, the GPU firmware were just mainlined in the kernel 4 months ago, but they still haven't done the same for the 1st gen X elite.
Tuxedo computers tried and didn't succeed either.
I will never buy Qualcomm again. I avoid them on phones as well by just buying Apple. They do not support their hardware beyond the release.
jeroenhd 1 days ago [-]
> I avoid them on phones as well by just buying Apple
To each their own, but I don't recall Apple ever mainlining any of their drivers on Linux. You're rightfully angry on the laptop side of things, but Apple is much worse than Qualcomm when it comes to open source support for their phones.
Qualcomm probably shouldn't have promised Linux support in the first place. Everyone seems to love Apple's hardware even though you're practically stuck with macOS. Had Qualcomm just stuck to Windows-only, they would've probably received a much better reception by the tech press.
mlinhares 1 days ago [-]
Apple doesn’t sell general purpose computers outside of their own hardware so this doesn’t make any sense.
izacus 13 hours ago [-]
It makes absolutely the same amount of sense as the original critique - you just prefer to defend one bad actor because you like the brand.
dismalaf 1 days ago [-]
At least Apple tells you they don't support anything except their own OS, Qualcomm just pretends to offer support.
derefr 1 days ago [-]
Can you say more? I don't have any memory of Qualcomm-related scandals(?), but I just read the news; I've never really been a user of their chips.
re-thc 1 days ago [-]
> Qualcomm are trying harder now it seems.
Not really, the 1st. iteration got stuck in legal land and other delays.
bradfa 1 days ago [-]
They hired a good number of smart people who know how to do open source. So they’re trying. We shall see if it works.
darkwater 1 days ago [-]
Is it well supported under Linux?
modeless 1 days ago [-]
Qualcomm has been upstreaming Linux support for some of their chips but they're not working fast enough and I don't think the latest chips are there yet unfortunately.
diabllicseagull 1 days ago [-]
I've been keeping an eye on the state of Linux on the first gen of X Elite and it's sad that the potential is not fully materialized outside WoA. Take a look at what peeps are going through:
No, not at all, those machines are currently unusable on Linux.
21 hours ago [-]
dismalaf 1 days ago [-]
Too bad Qualcomm provides shit drivers for Linux, never updates any of their drivers (had a Samsung/Qualcomm phone with drivers years behind the equivalent Google Pixel phone), etc... They are the absolute worst actor in the entire computing world, don't care how fast their chip is.
alt227 1 days ago [-]
Why do people care so much about single core performance? We are all professionals here and I bet most of our workloads are multi core. I get that these new arm chips from Apple and Qualcomm are great at one thing at a time, but for professional workloads high end x64 chips still cannot be beaten on the desktop.
Remnant44 1 days ago [-]
I agree with you, but also:
outside of anything else, amdahls law means that as the parallel performance grows, we become _more_ limited by the inherently serial code, and thus single core performance, not less.
Given that single core performance is "harder" (can't just throw more cores/sockets at the problem), it's also critically important.
dagmx 1 days ago [-]
What x86 chips have the same or higher number of cores in the form factors that these chips are available in and are also more performant?
Strix Halo is 16 cores. Intel Core Ultra 9 285HX is 24. Apple is 18. Qualcomm is something similar too but I can’t recall. NVIDIA is 20.
Until you get to threadripper/epyc or Xeon territories (completely different form factors and TDPs) the arm chips are ahead on both power and perf than the x86. And even when you get to those areas, arm is equivalent or out performs them as can be seen by the recent neoverse x3 and Vera benchmarks.
wmf 1 days ago [-]
Nvidia Spark has 10 mid cores and 10 fake cores. Probably all the others are faster.
21 hours ago [-]
modeless 1 days ago [-]
Single core performance is the biggest factor for most day-to-day use of a computer, the stuff I do on a laptop. It's more important than peak multi core performance for web browsing and games. I only care about multi core performance when I'm compiling, and I usually do heavy compiles on a remote machine rather than on my laptop.
hulitu 1 days ago [-]
> Why do people care so much about single core performance?
Because that't the only part this chip excels.
People are comparing apples with oranges since ages.
Razengan 1 days ago [-]
> X2 Elite Extreme
I'll wait for the 365 AI Ultimate Professional Enterprise Edition: Origins version
SecretDreams 1 days ago [-]
People aren't sleeping on Qualcomm, they're tired of Microsoft Windows as a janky ass OS.
re-thc 1 days ago [-]
> People are sleeping on Qualcomm.
Technically speaking, Qualcomm acquired Nuvia, which is where this came from and that company came from ex-Apple engineers wanting to do what Apple said no for their chips.
I have been somewhat surprised at the lack of commentators observing that this is Microsoft and above all NVIDIA launching a device that is fundamentally at odds with the metered cloud model of AI.
When you look at the other announcements and murmurings (better offline BYOK for Copilot, talk of an unmetered AI future) I think it’s clear that these two firms understand that cloud-only AI is not sustainable or inherently in their interests. But their willingness to undermine OpenAI with a product like this is notable.
Yokohiii 9 hours ago [-]
For me this is a push to segment the market into consumer and industrial grade RAM. Even NVIDIA and MS are not stupid enough to think they can keep going with RAM prices exploding. Consumers need hardware to subscribe to their AI stuff.
LLMs will get bigger and even with 128GB (that many wont saturate), you wont run future frontier models. For LLM vendors and integrators it's a handy thing to move lower quality inference to the consumers.
Also running local doesn't have to mean that the models have open weights. MS will likely start to distribute closed models at scale once the hardware is there.
thewebguyd 1 days ago [-]
Yeah, "unmetered intelligence" was probably the most used phrase at MS BUILD this past week. They are going hard into local AI
dofm 1 days ago [-]
I don't think you can interpret this as anything other than a sanctioned rebuke, right? Everyone has a strong visceral sense for what that means.
Copilot just got proper "offline" BYOK support, didn't it? Presumably that was one of the things they were talking about. Though I imagine that has something to do with the fact that Zed has supported that properly for months.
tantalor 1 days ago [-]
Maybe. Or they are simply hedging their bets.
gscott 20 hours ago [-]
They get to keep more Ai usage from big providers, send telemetry, and inject ads.
GodelNumbering 1 days ago [-]
I do not see how it is a 'beast of a' anything. It has 300GB/s memory bandwidth, barely above AMD Strix halo (256GB/s) with the same 128GB RAM and less than half memory bandwidth of M5 Max 128GB (614GB/s). Emphasizing memory bandwidth because most people interested in it I suppose are AI enthusiasts. Also, Windows.
bigyabai 23 hours ago [-]
Unlike the M5 Max, it should have usable context prefill. It's feasible to run 256k token workflows that would take the better half of an hour for TTFT on the M5.
moffkalast 1 days ago [-]
They have a lot of software groundwork ahead of them to make an ARM CPU viable for any kind of desktop use outside direct inference or training usage too.
AMD has the advantage that their x86 machines run everything, Apple maintains the whole MacOS stack, while Nvidia can barely scrape together one Ubuntu release per Jetson generation, it's beyond embarrassing. Maybe they ought to put those agents they keep droning about to some actual work on their OS support.
bigyabai 23 hours ago [-]
> Nvidia can barely scrape together one Ubuntu release per Jetson generation
Why would they do more? It's an LTS distro, the Nvidia drivers are updated for as long as the hardware's compute capability is supported.
Nvidia's ARM drivers are updated constantly, and battle-tested as the backbone in hundreds of thousands of Grace ARM servers.
moffkalast 12 hours ago [-]
Yes it's an LTS, which means it only makes sense to build something on it early in its release cycle, so the platform itself needs to keep up with new releases. Orins will go EoL next year, so good luck with that.
That's not even considering the lazy out tree patchwork support Nvidia does for their products on top of that. Maybe it's different in this case for Windows since it forces a rolling release, but I seriously doubt they'll do it properly instead of forking some version and keeping it around for 10 years like absolute idiots.
bigyabai 6 hours ago [-]
> That's not even considering the lazy out tree patchwork support Nvidia does for their products
For their ARM SOCs? Almost every single ARM OEM on the consumer market is begging you to use out-of-tree blobs for basic firmware support. Nvidia's stance isn't ideal but it's also not unique (or damning) to the rest of their ARM competitors.
Melatonic 3 hours ago [-]
We need a physical architectural change if we really want to improve this. You can only physically wire so much memory next to a chip in a 2 dimensional (flat) design - limited by the edge lengths of the chip (CPU or GPU)
Basically what we need is a chip that also has pins or some type of attachement system on the top (physically) or maybe below where the chip itself connects to the motherboard.
Imagine a CPU you can just plug in a block of HBM memory on top of (or on "bottom" of). This would allow a much larger physical surface area for putting ram cache near the compute cores itself because you would not be limited to edge lengths.
Cooling the whole thing would be a methodology change (might need liquid coolers that sandwhich in between the ram cache and compute and cool both)
1970-01-01 9 hours ago [-]
Local models becoming thousands of dollars instead of millions to run is a story the public genuinely seems to be unaware of. If the order of magnitude falls again, the markets are cooked. The cheap chips barrier is even artificial and unsustainable. The next big story in local AI adoption will be big players doing chip hoarding-as-a-strategy.
wombat-man 8 hours ago [-]
I think this is part of why capex is so high for the big players. Yes data centers are expensive to build but it's also about outbidding for the labor and materials needed to build those. They'd rather build compute they don't need and then lease it then let competitors have their own machines.
iceflinger 8 hours ago [-]
I'm fully convinced now that this is the real reason for the OpenAI RAM wafer deals.
fg137 1 days ago [-]
I don't think this is going to get any traction in the general consumer world, even less relevant than Apple Vision Pro.
(HN reaction to Vision Pro back in 2024 is almost hilarious if not ridiculous, looking at it today. I knew it would be a flop and I was so right.)
monster_truck 22 hours ago [-]
Wild that it took me so much scrolling to find some sense!
Spark DGX also remains a nothingburger, I would be livid if I spent this kind of money and had to waste time chasing down power cap bugs or A/B/C testing each firmware version to find the one that is least slow and also does not fail https://dredyson.com/the-hidden-truth-about-dgx-spark-perfor...
GuestFAUniverse 1 days ago [-]
And who in 2026 is still anal-fixated on a "Windows" PC?
It's just a personal computer. It normally runs multiple operating systems just fine.
Windows PC sounds like people talking about tech who are either payed by M$, or embed pictures into Word documents to send them.
Nobody has to kill the fun those OS agnostic machine allow, by artificially bind them to a shitty OS.
zdragnar 1 days ago [-]
Enterprise, of course. They probably buy more PCs than the rest of the market combined.
Even for personal use, I'd imagine the amount of people dual booting Windows and something else are a very tiny minority.
Saying "Windows PC" is a pretty reasonable way to distinguish between "made by Apple" and "made by someone else" because the market of PCs that aren't made by Apple and don't come with Windows is really, really tiny.
To be honest, this seems like a strange hill to take such an aggressive stance upon.
crazygringo 1 days ago [-]
> And who in 2026 is still anal-fixated on a "Windows" PC?
I'm assuming it's just clarifying this isn't about Macs.
The term "PC" is ambiguous, since it can either refer to all personal computers in its original meaning, or to the IBM PC lineage that is mainly contrasted with Macs. Remember the famous "I'm a Mac, I'm a PC" ads.
When you just say "PC", people today genuinely don't know which meaning you are referring to. And "IBM PC" is antiquated, and "IBM PC clone" is even worse. So "Windows PC" is a pretty decent name.
Do you have a better suggestion? Because "Non-Mac PC" doesn't exactly roll off the tongue. If you say "Windows PC", everyone knows what you mean.
And it's not an "anal fixation", there's no need to be gratuitously insulting.
Aperocky 21 hours ago [-]
Well, there's the other problem, windows sucks.
I prefer Windows XP, or even Windows Vista, to Windows 11 with its copilot. And it's been a downhill race, even macs are more of your own personal machine than Windows today, which is saying a lot.
PC should be a PC, Windows is as they advertised, a Copilot PC.
jeroenhd 1 days ago [-]
Hopefully anyone who wants to run anything other than Windows on an Nvidia-produced device has learned their lessons at this point. Although, a cursed Nvidia Hackintosh would be extremely funny.
For normal people, there are three computer operating systems: Windows, Apple, and ChromeOS. Nvidia isn't going with ChromeOS and Apple hates their guts, so Windows is the only normal operating system they can market.
Their marketing makes clear that these devices aren't the piddly Chromebooks that ruined the desktop experience for so many people (expensive Chromebooks were nice, but rare in practice).
Qualcomm promised Linux support, failed to deliver, and now anybody burnt by their promise won't want to buy their hardware again. If they promise a Windows PC, people won't have reason to complain when Linux or FreeBSD or SerenityOS won't boot on there. Given Qualcomm's failures here, Nvidia is probably doing the right thing.
dylan604 1 days ago [-]
> Although, a cursed Nvidia Hackintosh would be extremely funny.
I did this for years. We ran Resolve color correction suites with external chassis to place multiple Nvidia GPUs in it at a fraction of the cost of the shitty TrashCanMac that was available. Lots of people continued to use the 2012 Cheese Grater MacPro with its older CPUs. The only way to get modern (at the time) compute in a Mac was to use a Hackintosh. Since it wasn't for personal use, not having things like AppStore, Messages, Music, etc wasn't a big deal, so building a Hackintosh was easier.
I built one for personal prosumer use around the time of the 1080s that allowed me more machine for the dollar than Apple offered. Once the M-series chips came out and they were capable of what the Hackintosh was doing for me put me off of building anything newer.
kmac_ 1 days ago [-]
Windows is dying a death by a thousand small, user-unfriendly decisions. This is genuinely sad because the technology underlying Windows is actually very robust and flexible.
So, the partnership is maybe natural, but not prospective. Also, note how Linux is getting popular among gamers. Of course, it's way behind Windows, but the direction of the change is clear.
I'm convinced that Nvidia is not primarily targeting the consumer market and that the ultimate goal for its CPUs is the server space. The company invests effort where the money is, and consumer products account for only a fraction of its total revenue. Maintaining a presence in the consumer market seems more like a way to avoid a complete pivot than a strategic priority.
alkonaut 1 days ago [-]
And this isn't a "Windows PC" in the traditional sense. The reason people run Windows in the enterprise (and for some desktop home uses like games) is still hardware and software compat.
I run it for work because we make windows programs. We use drivers that don't exist on Win-for-ARM yet. So to most people a "Windows PC" is an x64 Windows PC still. The risk for MS if compat isn't good enough for Windows-Arm64 is that people might as well shift from windows entirely if they need new software and harware anyway.
jayd16 1 days ago [-]
A big push specifically for Windows ARM from Nvidia seems like relevant information.
bigyabai 1 days ago [-]
> It's just a personal computer
Your x86 machines were, but these are ARM SOCs. Many of them don't even support UEFI, let alone the upstream Linux kernel.
rvba 1 days ago [-]
Getting rid of UEFI is bad?
speed_spread 1 days ago [-]
However bad UEFI is, it's still better than the fragmented ARM boot wasteland.
bigyabai 1 days ago [-]
Can you quote where I said that?
dylan604 1 days ago [-]
Sir! I'm not an LLM.
SwtCyber 1 days ago [-]
The interesting part to me isn't really the Cortex-X925 vs AVX-512 comparison, but Nvidia trying to make the GPU the center of a Windows PC rather than an add-in card
QQ00 1 days ago [-]
>but Nvidia trying to make the GPU the center of a Windows PC rather than an add-in card
over the last decade, many software (especially the popular and industry standard ones) shifted to GPU accelerated design. it's a push before NVIDIA even tried to capitalize on that.
1 days ago [-]
empiree 3 hours ago [-]
Interesting move. More competition in CPUs is good for everyone. My only question is software support, that's usually where these non-x86 launches stall. Hoping it's more than a spec-sheet flex.
I follow Daniel Lemire and like his contributions, I also understand that the HN thread was created for discussion purposes, but I'd really appreciate having a reference to the spec or a source to the claims made, either here on HN or on the tweet itself.
I dislike the cycle of propagating news and assuming that someone else double-checked it.
- NVIDIA RTX Spark powers the world’s first Windows PCs purpose-built for personal agents, featuring 1 petaflop of AI performance, industry-leading power efficiency, full-stack NVIDIA AI and graphics technology, and up to 128GB of unified memory.
- NVIDIA and Microsoft collaborate to deliver a native Windows experience for personal agents, including new security primitives and NVIDIA OpenShell to run agents securely on primary devices.
- RTX Spark lets creators, AI developers and gamers render ultralarge 90GB+ 3D scenes, edit 12K 4:2:2 video, generate 4K AI videos, run 120B-parameter LLMs with up to 1 million tokens context using agents locally, and play AAA games at 1440p and over 100 frames per second.
- Adobe is rearchitecting Photoshop and Premiere from the ground up for RTX Spark to deliver 2x faster AI and graphics performance.
- RTX Spark-powered slim Windows laptops with all-day battery life and premium displays, as well as compact desktop PCs available this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface and MSI, with models from Acer and GIGABYTE to follow.”
gghh 14 hours ago [-]
Sorry for lazy question but some people here know this off top of their head so asking. Memory bandwidth of this chip?
Last time I check an NVidia situation was for DGX Spark (the GB10 chip), it has regular LPDDR5X which by JEDEC standard cannot go beyond ~270 GB/sec, ie 8533 Mbit/s on a 256 lanes bus.
So yeah Lemire seems to go "OMG unified memory, they're following Apple path..." ok, but Apple pulled off a much faster interconnect, 800 GB/s ballpark, and I'm trying to understand (not really, I'm asking you to try understand, he he) how is this laptop faring in that regard.
easygenes 12 hours ago [-]
This is the same chip and same memory. Only difference is it is going in a laptop, so will be more thermally limited.
gghh 11 hours ago [-]
Gotcha, thanks for explaining. Indeed I noticed it was 128G of memory just like the DGX.
kcb 1 days ago [-]
The same chip that's been available in the DGX Spark for like 8 months now...why are we pretending like its the next big thing.
siliconc0w 24 hours ago [-]
I can't really see wide adoption of local LLMs unless prices really start to climb. It makes sense to use cheaper hosted smaller models like Sonnet or even Kimi but these won't run a Kimi-class model and that is really the floor for non-toy agentic tasks. Spending 5k to avoid a $20 subscription really only makes sense for niche security reasons.
zzzoom 23 hours ago [-]
I'd bet on the inverse: China scaling DRAM production until the price crashes, and the whole US stock market that is propped on top of that scarcity going down with it.
maipen 12 hours ago [-]
Tarrifs.
mathgladiator 23 hours ago [-]
I think we really haven't even seen how GenAI can influence new products and games. Have you consumed Dungeon Crawler Carl?
satvikpendem 15 hours ago [-]
Was it made with AI?
mathgladiator 7 hours ago [-]
No, but AI is a central theme and also the key way to build such a dynamic experience.
Melatonic 3 hours ago [-]
What does he mean at the end when he talks about Intel processors - is he referencing something supported in AVX-512 that is unique to AMD or ?
tosh 1 days ago [-]
nb: poster is Daniel Lemire (https://lemire.me), who is very skilled in getting performance out of compute hardware (e.g. via simd, cache usage etc)
infecto 1 days ago [-]
As he likes to share often, "He ranks among the top 2% of scientists globally (Stanford/Elsevier 2025) and is one of GitHub's top 1000 most followed developers. "
tosh 1 days ago [-]
based on citations and github stars? or what's the context there?
infecto 1 days ago [-]
I was adding further citation based on his own claims. Not sure what context is missing.
tempodox 1 days ago [-]
Still, Microslop has repeatedly proven their ability to slow everything down to a crawl no matter how powerful the hardware. If you want it to be fast, don’t use Windows.
seanalltogether 1 days ago [-]
Is it really unified memory? AMD Strix Halo is "unified" but you still have to allocate memory separately for cpu vs gpu. Apple Silicon is true unified memory.
flakiness 1 days ago [-]
My understanding is that this is the limitation from Windows not from AMD SoC. There are several internet resources to "enable unified memory support" on linux eg [1].
As a side note, qualcomm chip set on Android has been doing this for years (like Apple) so it's not super unique thing. It's more like there was no need before.
Even then the "reserved" section is a carve out guaranteed chunk to allow stuff that might need contiguous physical memory (display scan out buffers and page tables, for example) and similar.
The GPU can still happily use all the rest of the memory for other use cases - which tend to be the bulk of allocations anyway. Though there might be performance implications - for example "moving" buffer ownership to the GPU would need to evict CPU caches, and often 4k pages and tlb lookups can be a pretty inefficient situation for GPU-style accesses.
That's been pretty standard for any SoC for decades. And "differences" to apple's SoC are more implementation details.
Keyframe 1 days ago [-]
yes, but more due to OS limitations than hardware. You can use their GTT which is then _true_ UMA where GPU can grab whatever it wants from the memory pool.
This isn't the first time we have UMA on the PC, btw. When SGI did their PC workstations, their 320 and 540 PC workstations had what they called Cobalt graphics chipset and crossbar with their IVC architecture. They bypassed AGP at the time completely. It was quite unique to see strict UMA on a PC. Haven't seen it since until these new systems we're seeing now on PCs and Mac.
eigenspace 1 days ago [-]
That's a software question, not a hardware question.
Some software assumes pre-defined set-aside pools of memory reserved for video purposes, but the chip does actually have access to the whole pool.
ApatheticCosmos 1 days ago [-]
Strix halo is unified memory. The memory allocation set in BIOS is overridden by the operating system if it has the capability.
SwtCyber 1 days ago [-]
For local models, the useful part is not just having 128GB attached to the package. It is whether the GPU can practically use that memory without the usual VRAM-style constraints
glitchc 1 days ago [-]
Memory bandwidth is what matters, unified or otherwise. Discrete GPUs don't have unified memory either.
fc417fc802 1 days ago [-]
> you still have to allocate memory separately for cpu vs gpu
That's an API issue not a hardware issue. Regardless, I believe the major APIs permit seamlessly sharing pointers at this point? (I have no experience doing that though.)
joe_mamba 1 days ago [-]
>AMD Strix Halo is "unified" but you still have to allocate memory separately for cpu vs gpu.
IIRC that's due to maintain BIOS and Windows (+games & apps) backwards compatibility, but memory access speeds are the same.
ankurdhama 1 days ago [-]
It is unified in the sense that the OS can dynamically assign memory to CPU and GPU. Apple silicon is not a alien tech that other silicon vendors cannot implement.
comandillos 24 hours ago [-]
I dont really get the hype with all the N1X thing when in reality this is the same almost 1 yr old GB10 that was released with the DGX Spark and proved to be quite a disappointment
mohamedkoubaa 23 hours ago [-]
The commenters here seem to have forgotten that computers can do things other than inference
Schnitz 23 hours ago [-]
How is this different from something like the AMD Ryzen AI Max that can already be purchased and supports 128GB unified memory? Seriously curious.
nine_k 23 hours ago [-]
Maybe CUDA support, or something else specific to NVidia?
embedding-shape 1 days ago [-]
> up to 6,144 state-of-the-art CUDA cores
A RTX Pro 6000 has ~24K 5th generation tensor cores, I'm guessing this would then be 1/4 of the count but 6th generation? Wasn't clear from the images.
gravypod 1 days ago [-]
What is more important than core count is how the caching architecture is laid out. They could lay out those 6k cuda cores in a layout which provides much larger blocks of cache to smaller number of cores. That would increase the memory bandwidth which would be better for inference.
embedding-shape 1 days ago [-]
Sounds like the memory bandwidth is worse though;
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.
gravypod 1 days ago [-]
Having slower memory may not actually lead to lower memory bandwidth. The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores. These could be filled with read operations from the bulk system memory. You can start executing and then page the next batch of data in while compute is working. For LLMs you don't have much random memory access, you can sequence your accesses in blocks.
If these chips become popular I am sure you will see LLM architectures taking advantage of the parallelism.
cthalupa 1 days ago [-]
> The cuda cores can be broken up into compute complexes which larger blocks of memory directly attached to the cores.
Perhaps in theory, but for the gb10 stuff the memory is all on the CPU die and connected to the GPU die via nvlink-c2c
dh2022 1 days ago [-]
A beast if a Windows PC to do what? Run Trams, Excel, Outlook, and a browser all at the same time? We could do that just fine in 2010…
mrweasel 1 days ago [-]
Just give it a year or two and Windows will drag that sucker into the mud and run everything just a sluggish as ever.
The idea that any hardware performance increase will be eaten up by terrible software is an evergreen. A computer that could serve as the single server for a medium size enterprise 20 years ago, is no longer able to serve as a desktop for a receptionist. I'm not even sure we're talking diminishing returns anymore, we're probably past the point of maximum yield and into the negative returns at this point.
mariopt 1 days ago [-]
I think most people are not understanding what this kind of laptop will provide.
Before we get local AI, we'll be using hybrid AI.
Running big models locally is unrealistic ($$$$$) but, if you imagine an Agentic Workflow where some bits run on the cloud and other smaller tasks locally, it's an amazing deal. You don't need Opus/Code/DeepSeek/Kimi/etc to do basic stuff that models like Gemma4:12b/Qwen-27b can do locally with much less latency.
Having a laptop where I can use a remote big model and combine it with 5 local domain specific models, is something I would love to do today. Imagine using OpenCode and you've a small model deciding which tasks run locally, then decides if you've a good local model for XYZ task or if we use a cloud model.
My main concern is: Is this hardware powerfull enough to allow local quick models switch? Unlikely but I hope I'm wrong
Gareth321 1 days ago [-]
Given the incredible progress of local models, on present trajectory I think we see comparable levels of performance to frontier models in two years on 128GB unified RAM and 6-bit quantisation. Note how the frontier models are now hitting superior benchmarks with only 200,000 tokens. I think we still have a long way to go with distillation.
zuzululu 15 hours ago [-]
it should be but im more excited i can game on it
Waterluvian 1 days ago [-]
It’s an opportunity for them to start doing away with the whole ATX thing where owners had freedom to mix and match at their own pleasure.
burnt-resistor 1 days ago [-]
They'll ship a welded-shut box that requires an activation key to power on. Users will get to pick color sleeve it uses though.
VortexLain 1 days ago [-]
I really hope this will have proper GNU/Linux support, otherwise it will end up the same way Qualcomm ARM PCs did.
QQ00 1 days ago [-]
guaranteed it will be like Qualcomm arm, it's a partnership with Microsoft after all. we may see a community project to make Linux work on it but it will not have an official first class support and many things will likely not work properly.
AmazingTurtle 1 days ago [-]
while unified memory may offer better performance than unsoldered DDR system memory, it still won't be as great as 1.8TB/s bandwidth on high end consumer GPUs right now.
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
dangus 1 days ago [-]
I think it’s an interesting theory but a bit too conspiracy theory-ish.
Nvidia just wants to sell stuff to everyone.
And I think for professionals doing local AI work, products like Strix Halo and Apple Silicon are a competitive threat.
A big part of maintaining the leading software ecosystem is ensuring you have competitive hardware for all your users.
I also think the RTX Spark product is relatively low effort for Nvidia. Grab a Mediatek CPU and slap an Nvidia GPU on the die. Sure, that’s oversimplifying it, but still.
PedroBatista 1 days ago [-]
Don't want to be too harsh, maybe I'm missing something, but the CPU is at least 2 years old, internally it has been a complete shitshow and that's a minor hiccup when compared to the firmware and software situation.
It's an interesting "newcomer" and the more the better but calling this a "beast" and a "game changer" is ridiculous to say the least.
Then there is the price..
Animats 1 days ago [-]
How much is this supposed to cost, fully populated with 128GB of RAM? How much would this laptop cost?
It's not that the NVidia chip has that much RAM built in, after all. It's that it can address that much. RAM is sold separately.
cthalupa 1 days ago [-]
The dgx spark is the same chip and those are in the low 3s to 5 range for most of them depending on manu, storage config, etc. The dgx sparks also have connectx 7 cards in them to support the 200gbps networking for RoCE.
So I would expect the mini PCs to come in less than the sparks. Laptops I assume will be close in price with the addition of all the other laptop stuff.
forrestthewoods 1 days ago [-]
No. It’s all integrated. Not something you can buy separately or upgrade later.
Animats 1 days ago [-]
See [1]. There's not 128GB of on-chip memory. "Integrated" memory in this context means that the GPUs and CPUs all use the same memory. There are on-chip caches, of course.
It is all in integrated into one monolith “superchip” package. The 128gb of RAM isn’t going to be purchased separately or be upgradable. At least according to all indicators. Which is what I was responding to.
wmf 1 days ago [-]
$4,000-5,000
amacbride 1 days ago [-]
It's effectively the same as the GB10 in the DGX Spark (Blackwell architecture, 6,144 CUDA cores, perf-wise comparable to an RTX 5070).
I've found it very useful for running big models, but it's not a screaming powerhouse in terms of raw compute.
adamnemecek 1 days ago [-]
They are early versions, wait 4 years.
vegabook 23 hours ago [-]
I ran two gens of Jetson board and I have zero confidence in this. NVDA is printing in the data centre and everything else has no staying power.
bigyabai 23 hours ago [-]
I've ran two gens of Jetson, two gens of Nvidia/Linux dGPU and one gen of Tegra. The dGPUs have much better drivers than the devkit hardware does.
ozgrakkurt 1 days ago [-]
Says running local llms isn’t relevant. Than says it is decent for games, which is just correct if you compare any gpu remotely similarly priced. I don’t understand what is the point he is making
proxysna 1 days ago [-]
It's just a DGX spark with faster memory and a windows boot?
JBiserkov 23 hours ago [-]
Offtopic, but "Twitter. Now that's a name I've not heard in a long time. A long time".
noveltyaccount 1 days ago [-]
> “Our goal is to deliver unmetered intelligence to every home and every desk with Windows,” said Satya Nadella, chairman and CEO of Microsoft. “RTX Spark marks a real breakthrough towards that vision.”
I expect computers with this chip will be about $4000. If Microsoft can deliver on local AI models that can orchestrate Windows and have solid real world intelligence, that will be an inexpensive business purchase compared to pay as you go tokens. I'm excited to see how this plays out.
yoyohello13 23 hours ago [-]
If it runs well with Linux, I’m sold. A Windows pc will never see the inside of my network.
seabrookmx 22 hours ago [-]
The DGX spark desktop shipped with Ubuntu, but I haven't seen if it has a bunch of Nvidia specific repos for drivers etc. needed to make it function.
Assuming all that stuff is upstreamed (and they aren't using oddball webcam/input devices etc) it should have much better support than Qualcomm.
Fingers crossed!
alberth 1 days ago [-]
Is this essentially an Apple M-Series chip in concept?
trynumber9 20 hours ago [-]
No, it stems from a lineage of Tegra chips that pre-date the M-Series.
This chip was called GB10. One of its predecessors, GV10 was shipped in 2018.
It was a 256 bit, unified-memory system on a chip with a Volta GPU and 12 ARM Cores. GB10 is a 256 bit, unified-memory system on a chip with a Blackwell GPU and 10+10 ARM cores.
Aperocky 21 hours ago [-]
Why is it only for Windows PC, can we not run Linux or at minimum SteamOS?
daft_pink 1 days ago [-]
Sounds good, but how much does it cost? Is this going to be an affordable laptop or $6000.
shadowpho 1 days ago [-]
I mean fast 128gb of ram is like $2k so it’ll be $3k after overhead for just the ram portion of device.
YasuoTanaka 1 days ago [-]
128GB of unified memory is a dream come true for local LLMs. VRAM has been the ultimate bottleneck for developers.
adrian_b 1 days ago [-]
The competitor for this NVIDIA CPU will not be the now old AMD Strix Halo, but its successor (launched recently), which supports up to 192 GB of unified memory. Thus 128 GB is no longer SOTA.
While this NVIDIA system is inferior from the point of view of the memory capacity, its main advantage is that the top models will have a bigger GPU, i.e. with 6144 or 5120 FP32 execution units, compared to 2560 for the AMD GPU (compared to the NVIDIA CPU, the AMD CPU has a better multi-threaded performance for legacy programs, and a much better multi-threaded performance for the applications that use AVX-512).
However, these top models with big GPUs will also be much more expensive than the competing AMD system, while also being much more expensive than a laptop or mini-PC with an equivalent discrete NVIDIA GPU (which has the disadvantage of having direct access only to a much smaller, even if faster, memory).
christkv 1 days ago [-]
I don’t think there is much improvement in compute for the new strix halo revision. The next one supposedly adds rdna4 cores or similar and more memory channels
adrian_b 10 hours ago [-]
There is no improvement in the CPU or GPU, except for minor increases in the clock frequency.
The memory interface is a little faster, but the greatest improvement is +50% in the memory capacity, both over the old Strix Halo and over NVIDIA Spark.
However, even the Strix Halo CPU was better than the NVIDIA/Mediatek CPU.
NVIDIA has only the advantage (in its more expensive variants) of a GPU equivalent with RTX 5070.
It remains to see which will be the prices of the NVIDIA Spark models with big GPUs, but the rumors are that they grow from around $3000 upwards, with the upper limit for 128 GB DRAM and uncut GPU being unknown yet.
It also remains to be seen whether the variants with the biggest GPU can use it effectively when having a rather low memory bandwidth for such a big GPU.
zamadatix 1 days ago [-]
I have a 128 GB LPDDR5X machine. It's a great workstation laptop (which is why I got it) but the memory bandwidth is just awful if you're wanting to use it for AI. An old Epyc CPU will fair better both in terms of being able to run full sized larger models as well as having higher memory bandwidth, and that's not a recommendation to go that route either as it's still not worth it.
avocadoking 1 days ago [-]
It could help with exploding external LLM costs.
Interesting to see how the adaption will be, which will mainly depend on the price.
SwtCyber 1 days ago [-]
This is what makes it interesting to me as well
zackify 1 days ago [-]
[dead]
cyberziko 1 days ago [-]
good to know, hope the price will be affordable, having a pc becoming a luxury :)
dgellow 1 days ago [-]
I’m not sure if you’re aware but there is a supply chain shortage for pretty much everything needed for a PC that isn’t expected to be solved this year or next year. There is no way that can be affordable
8note 1 days ago [-]
or moreso, the available supply has been eaten by rampant speculation, and hyperscalers have overpurchased vs the datacenters they can actually get built and power
crims0n 1 days ago [-]
Certainly not in the year of our lord, 2026. Maybe in a few years though.
1 days ago [-]
BoredPositron 1 days ago [-]
Mediatek and Nvidia the horsemen of abandoning hardware after a year. The Jetson family still left a bad taste in my mouth.
thewebguyd 1 days ago [-]
Qualcomm is too. They mainlined the GPU firmware for the X Elite 2nd gen, but still have not done so for their 1st gen X Elite which they promised full Linux support for and failed to deliver, and have now moved on pretending they never said that.
burnt-resistor 1 days ago [-]
How dare you question the golden goose egg-laying algorithm for trillions in stock valuation!
oldnetguy 23 hours ago [-]
SGI had unified memory back in 1996.
htk 1 days ago [-]
The M1 Max from 2021 has better memory bandwidth. The M3 Max can be specced to 128GB.
Nothing new here, apart from being able to use CUDA on a less power hungry system.
bigyabai 22 hours ago [-]
The M1 Max has an unusably slow GPU for inference. TTFT on real-world contexts can be over 10 minutes.
> Nothing new here, apart from being able to use CUDA on a less power hungry system.
CUDA has been running on ARM SOCs since the Tegra K1, 12 years ago. Nvidia is not new to ARM, nor is CUDA.
cryo32 1 days ago [-]
Yeah when laptops are shipping 8Gb and Microsoft is suddenly interested in native apps, nope.
Tech companies have strangled their own market.
thewebguyd 1 days ago [-]
Laptops shipping with less RAM is exactly the reason to be interested in native apps again. Every app being a chrome/EdgeWebView process is the problem.
npn 1 days ago [-]
Is this somehow satire? This is just the dgx spark with keyboard and monitor in a convenient format. Since it has more stuff, I'm sure that the price mark up will increase too.
Up to $5000 because why not?
With that money you can build a real PC with rtx 5090!
thewebguyd 1 days ago [-]
Not with 128GB (less OS) available to the GPU you can't. The unified memory is the point with this machine (and the dgx spark).
derefr 1 days ago [-]
> The game changer is the unified 128 GB memory. That is the path Apple took years ago. Instead of separate memory for the CPU and GPU, everything shares a single pool. It is increasingly popular.
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.
Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.
What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.
I guess you'd need to do a bit of software work:
1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);
2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")
...but once that's done, it's done.
Rohansi 17 hours ago [-]
Theoretically, maybe? But they are completely different interfaces so it would surely get complicated. It's also approaching the current behavior in non-unified memory systems where you have two pools of memory with different performance characteristics. You'll realistically want the CPU to always use low latency memory and the GPU to use high bandwidth memory with very little moving between them.
It's going to be amazing. Almost twice as fast for only 10 times the heat. Consumers aren't concerned with efficiency they only care about performance.
danielovichdk 1 days ago [-]
A hardware company that propose to buy more hardware from them.
Must be a new business model.
....
Step into my office
Why ?
Because you are fucking fired
1 days ago [-]
dcreater 1 days ago [-]
They announced RTX spark days ago. Why is this post linking to a "leak" tweet on the frontpage now?
epolanski 1 days ago [-]
Not gonna lie, I'm buying one of the 128GB ram ones for local inference if price is human.
jmyeet 1 days ago [-]
This is the RTX Spark [1].
The obvious comparison here is the M5 Max where you can buy a Macbook Pro with 128GB of also unified memory. Obviously CUDA cores are specific to NVidia so it's hard to directly compare but I've seen claims that the M5 Max is roughly equivalent to ~4000 CUDA cores. This obviously depends on workload and whether the CPU supports the precision you want to use (eg FP4).
The M5 Max has memory bandwidth of 819GB/s. The RTX Spark I believe is ~600. So it might be slightly better than the current generation of Macs but likely worse than the expected M5 Ultras of the new Mac Studios (likely Q3 2026).
For comparison, a 5090 has >20k CUDA cores and 1800GB/s memory bandwidth with 32GB of VRAM. The RTX 6000 Pro (at ~$10k) has 96GB of VRAM, same bandwidth and ~24k CUDA cores.
We have to see what RTX Spark systems sell for but the DGX Spark is in the Mac Studio price range (~$4k).
I do think Apple has a real opportunity here but there offerings aren't quite there yet. The M5 Ultras might be a really attractive option for local LLMs. I expect them to be in high demand.
> I've seen claims that the M5 Max is roughly equivalent to ~4000 CUDA cores
Who claimed that? The M5 is still a raster focused GPU, dedicated matmul blocks be damned. For some workloads that napkin math might work out, but for many others it's a wild overshoot. Time-to-first-token still favors CUDA, and real-world training workloads aren't getting anywhere near Apple Silicon.
All of the memory bandwidth in the world is useless if you spend 15 minutes processing 64k tokens worth of context prefill. This is where CUDA shines.
thrance 1 days ago [-]
Will it support Linux?
1 days ago [-]
2OEH8eoCRo0 1 days ago [-]
Are their enterprise orders slowing down? Why use precious maxed out fab capacity on consumer stuff when it could be an enterprise chip?
zamadatix 1 days ago [-]
It uses LPDDR5X instead of VRAM and will still sell for a premium while pushing their presence even further in every side of the AI market. This was one area AMD was ahead in and now Nvidia is probably better off making this to compete on that front while still being better off than making a 5090.
fc417fc802 1 days ago [-]
That doesn't answer the question. If the high margin enterprise GPUs are saturating the fab capacity you wouldn't expect them to be pushing this. But IIRC those all have oodles of integrated HBM at this point so I wonder if fab capacity for that has become a bottleneck.
zamadatix 1 days ago [-]
I believe it does - the reasons why are exactly differences like LPDDR5X vs HBM3e. Not every fab is capable of making any type of chip another fab makes. If you can make a product with different chips and still sell it for a premium why would you not just because the fabs for your DC product's chips are busy?
Looking at it more, I believe the story repeats with the TSMC processes used for the CPU vs chips like GB200 as well.
Even if none of the above were the case, the question still isn't "why not make the enterprise GPU" it's "why not make the higher margin per chip area product". If the NV1/GB10 take less die space and cost a lot it's not immediately apparent the enterprise GPU actually nets Nvidia more $ per die or not. That's why it's relevant these will be sold at a premium.
1 days ago [-]
dofm 1 days ago [-]
It already is an enterprise chip. This is about Microsoft not having the equivalent of an M3 Max or whatever laptop.
And maybe for NVIDIA and MS it is also about them quietly betting that local models are, in fact, going to be good enough for most tasks pretty soon.
easygenes 11 hours ago [-]
Mostly a strategy move to protect the CUDA moat… Apple would take over mobile inference in a clean sweep without competition.
thewebguyd 1 days ago [-]
This is an enterprise offering. It'd take a guess its to try and stop the bleeding over to macOS. This launch, plus WSL containers, their own de-bloat winget config, mxc, etc. all seems like they're saying "pls stop leaving for macOS, see, Windows can be a great dev machine too."
wmf 1 days ago [-]
This chip was designed before the shortages. I think they'll order just enough units to say they released it but not enough to put a dent in Rubin.
jqpabc123 1 days ago [-]
I am not sure how many people will run AI models locally. It still seems like a niche application to me.
I'd say this relates directly to the cost of running AI models remotely.
And we won't know what the actual cost will be until AI vendors recover the huge pile of cash they've dumped into development (plus interest).
chpatrick 1 days ago [-]
I think it's niche now because getting the hardware to run it is expensive and the quantized models don't work as well. If those improve then it would be a no brainer to pay one off for the hardware instead of a fortune for API calls.
dofm 1 days ago [-]
I am not really convinced that four bit quantisation is that bad; almost certainly six will be enough. But Google are making claims for their QAT tech in Gemma that they are surely using or testing in Gemini that it preserves nearly source model quality while reducing footprint.
The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.
Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.
The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.
1 days ago [-]
jqpabc123 1 days ago [-]
AI vendors are attempting to offer the whole apple. And they are spending huge sums of money in the process.
But most businesses don't really care about most of the apple --- they only need their special bite out of it.
For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.
dofm 1 days ago [-]
I think it is likely to appeal to video and photo editors who want to use AI tools (the press release has a quote from Blackmagic Design, as well as from Adobe, who I think have no stomach for their own cloud AI).
But I don’t know about specialised: this could run quite large models with MoE.
dgellow 1 days ago [-]
Performances of local models are pretty bad compared to what AI vendors offer, token generation is just too slow to be that useful. And you need to allocate GBs of memories, something that will stay very expensive to buy for a long time.
Running local models will stay niche for a while, unless we see breakthroughs
jqpabc123 1 days ago [-]
Dumb idea --- how about if we limit local models to specific domains --- medicine for example.
Most doctors don't care much about engineering or accounting or software development or 10000 other things that big vendor models address.
This area is yet to be really explored. Nvidia aims to provide the hardware to do so.
CamperBob2 1 days ago [-]
That's a fairly obvious idea, not dumb at all, but unfortunately it doesn't seem to pan out. Trying to specialize an LLM in one area harms its 'cognition' in all areas. For instance, if you train a coding model without all the Shakespeare and soap operas and Wikipedia and pirated Stephen King books and ancient Roman history and whatever, you end up with a worse coding model.
The article is not backed up by reality. Why would use anything but a domain-specific LLM, if they actually worked?
The author is probably confusing RAG with pretraining. You can RAG on PubMed but you can't arrive at a competitive model by pretraining solely on it.
6 hours ago [-]
sometimelurker 1 days ago [-]
cant wait til someone figures how to run Linux on one of these
easygenes 11 hours ago [-]
That happened a year ago when these shipped as the DGX Spark with only Linux pre installed.
einpoklum 1 days ago [-]
Intel's basic architecture keeps accelerators away from main system memory, unlike, for example, IBM's POWER architecture where the CPU and GPU are equal 'users' of memory. It's not a great breakthrough to suggest something different. The problem is - it's different, and not compatible with a lot, or most, or all, existing hardware. Also, there are some security concerns, as @stego-tech noted.
shevy-java 1 days ago [-]
And it will be expensive - right?
Nvidia is milking the market now. We need more competition again - currently we have a mafia control the prices, not just Nvidia but all the AI companies. The price increases should be paid for them, not by us. "Free market" is being manipulated by them here.
emsign 1 days ago [-]
They are useless if RAM prices are this high. $800 laptops with maximum 8GB are currently the norm, Windows 11 can't run on them decently. No matter how fast the SoC is with overpriced RAM they are slow. Systems that can make good use of them with 64-128GB are not affordable anymore thanks to Nvidia and co. This is a smokescreen. They'll probably sell them packaged as compute modules anyway.
alt227 1 days ago [-]
> Windows 11 can't run on them decently
Windows 11 can run just fine on 8Gb of memory, what cant is Google Chrome.
llm_nerd 1 days ago [-]
Does this person know that this is the same GB chip in the DGX Spark? It isn't some proposed thing, it's a chip loads of people have on their desk right now, and there are endless benchmarks of it.
Decent single core (a long ways from Apple level, but decent), but it makes up for it in cores to provide M5 level performance, CPU wise. Memory bandwidth it is kind of starved, at 1/6th many GPUs.
They got Microsoft to customize Windows for the RTX Spark, and will likely have to brutally throttle it when running as a laptop (it's literally a 140W TDP chip), and that's neat. It's going to be a very expensive laptop.
SwtCyber 1 days ago [-]
This is probably the better way to frame it: not "Nvidia is proposing a new CPU system" but "Nvidia is trying to move an existing GB/Spark-class platform into a Windows PC form factor"
Apreche 1 days ago [-]
I heard the memory bandwidth is not just slower than on a GPU, as expected, but is significantly slower than Apple’s unified memory.
MrBuddyCasino 1 days ago [-]
CPU/GPU is decent (800 GB or so), memory is slowish (300GB or so). Some Apple M are slower, some are faster.
dagmx 1 days ago [-]
Where did you get those numbers from?
DGX Spark has a maximum of 273 GB/s bandwidth in ideal scenarios (hard to reach)
That puts it between an M5 (153) and M5 Pro (307)
MrBuddyCasino 1 days ago [-]
The 900 GB/s is from the NVLink-C2C interconnect, if you were wondering about that. They quote "up to 900 GB/s of bidirectional bandwidth between GPU and CPU".
Mind you thats not to/from memory, which indeed only has 273 GB/s.
dagmx 1 days ago [-]
Ah I see. But the only C2C equivalent on the Apple side is the UltraFusion which is 2.5TB/s if I recall correctly.
MrBuddyCasino 1 days ago [-]
Yes its not an "Apple M killer" at all. Also, the available official performance numbers are partially overstated (1 Petaflop is only possible for sparse FP4 models, "in theory").
Perhaps a sobering rule of thumb: if it was actually useful, you couldn't buy them because someone would scoop them all up to shove them in a DC and make money with it.
MrBuddyCasino 1 days ago [-]
Plus John Carmack has reviewed it, he was not amazed.
throwaway5752 1 days ago [-]
"Major banana producer suggests shifting more ice cream store menus to banana splits, and increasing the amount of bananas per serving"
theturtle 1 days ago [-]
[dead]
sylware 12 hours ago [-]
[dead]
sisve 1 days ago [-]
> I am not sure how many people will run AI models locally. It still seems like a niche application to me.
Bill Gates had a quote some years ago...
People have still not learned how fast we improve our tech and how much cheaper thing gets I guess :)
dgellow 1 days ago [-]
Memory isn’t getting cheap soon, and you need a lot of it for local models
sisve 1 days ago [-]
All depends. The current technology will be cheaper in a year or two. The best cutting edge stuff will properly be even more expensive. But in 10 years time... we can run current SOTA models (or models that are equally good ) on our local hardware
dgellow 1 days ago [-]
Ah yes, if you count in decades, for sure I expect to run them locally
chaostheory 1 days ago [-]
We had a thing called globalism that drastically reduced costs. Globalism right now is on life support. Given geopolitics, I don’t see how it’s going to survive.
sisve 11 hours ago [-]
You are right. I haven't considered that enough. I do agree with you that globalism will be decreased even in the context of a decade, but I think it will be on a level that still give us some cost reduction. But I do think (at least hope) geopolitics will be better in 5-7 years.
PeterStuer 1 days ago [-]
"I am not sure how many people will run AI models locally. It still seems like a niche application to me."
Clip me :). You are currently living through the final stages of unrestricted computing in the hands of the 'public'. Our regimes are going to pull up the drawbridge in the name of 'safety'. Download the open models asap and prepare for an airgapped computing environment. That will be your frontier in not extremely neutered AI in the near future.
I am so hoping I'm completely wrong on this btw.
1 days ago [-]
shevy-java 1 days ago [-]
They have bought the governments, so what you model will probably become true. They just inflate the prices right now.
The reality is even cutting edge games and consumer workloads don’t actually take full use of the PCIe bandwidth of the GPU or the bandwidth of its GDDR memory. Even local AI use cases don’t substantially or meaningfully benefit from faster memory, at least to average consumers.
A unified memory pool does two things:
1) Lets systems optimize utilization based on need, rather than be confined to specific pools
2) Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
So at a time when memory is expensive, unified pools make more sense. Even when memory becomes cheap and plentiful again, it’s just practical at this point to allocate a larger overall pool instead of managing discrete sets.
The one big drawback is security. A shared memory pool means side-channel attacks against memory from the GPU or CPU could potentially compromise the other as well, meaning memory-safe designs are going to be critical to security going forward (which is good for Rust adherents, I figure).
The trouble with this is that the different types of memory have different characteristics. Latency for ordinary system memory is actually better than it is for GDDR, because GDDR is optimized for bandwidth. RTX 5090 has 1.8TB/s of memory bandwidth with a 512-bit memory bus. The same bus width for DDR5-9600 would have better latency but only a third of the bandwidth.
CPU workloads are generally bounded by latency and GPU workloads are generally bounded by bandwidth, which is why they use two different types.
> Reduce overall memory cost, by letting system builders purchase a single type of memory in bulk instead of having to figure out GDDR vs DDR memory placement (important for SFF/portable machines)
The trouble with this is cost. In principle you could get the same 1.8TB/s of memory bandwidth as the RTX 5090 has, with the better latency of DDR5, by using DDR5 with a 1536-bit bus. This is indeed with multi-socket servers do, two sockets with 768-bit in memory channels per socket, but now check how much those system boards cost.
But the remaining alternatives are both worse. If you use GDDR for the unified memory then GDDR costs more than DDR and you're going to have significantly worse latency for the CPU. If you use DDR without a 3-4 times wider bus than the already-wide GPU then the GPU gets starved for bandwidth.
It also has way better throughput because it's physically surrounding the chip itself and wired in a way that maximises this.
The real problem is interconnect speed and latency. We have made tons of progress elsewhere but AI is exposing that the interconnect in many systems is just not great. Even future PCIE 6.0 is fairly bandwidth constrained compared to 8 channels of DDR memory or the way we solder GDDR next to the chip.
We moved on from AGP and older formats to PCI-E and I think it's time to do that again. And maybe even "slot" based implementations in general for both RAM (system and graphics) and GPUs.
We need consumer and workstations in summary to use pin based stuff like LPCAMM ram. And the interconnect on the motherboard itself needs to be both wider (more bandwidth) and lower latency. This might require moving on from motherboard being 2 dimension only (a flat board) to something like an L shape to gain more physical board space.
Since CPUs are highly optimized, both increasing the latency of the main memory and increasing the size of L3 will probably lead to larger L3 latency.
And yes, a L4 cache can be one way out of that problem. Another way is making the L3 cache lines wider and working the hell out of improving your management algorithm.
It's not a theoretically impossible problem. It's also not something you can solve automatically with a bit more money or some simple decisions. It's possible this is the best architecture available, but it's not certain by any means.
Everything is ultimately a compromise of some sort, and modern Unified Memory feels like one of the better compromises out there given the current plateauing of hardware scaling, the growing costs associated with memory and NAND, and the shifting complexity from hardware (more instruction sets, more accelerators, more cores) to software (more abstraction layers, more machine learning).
Transitioning over to wild speculation here, I think that most likely this will be treated as part of an absurdly large L3 (ala 3D V-Cache) or as an additional L4. In either case I expect the latency and power tradeoffs introduced to be tolerated as "good enough" even for the highest end consumer gear. (Actually I wonder if some sort of special case cache would be feasible, with memory addresses flagged by the graphics driver and regular CPU related stuff skipping over it entirely. But by then we've squarely entered the territory of vaguely unhinged rambling on my part.)
Alternatively if the performance caveats are deemed to be important enough to justify the added complexity it wouldn't surprise me to see the HBM treated as an independent memory pool analogous to that of a dGPU. That wouldn't change the current status quo with respect to the GPU APIs but it would significantly ameliorate the memory bandwidth bottleneck for inference workloads and from a software perspective is a drop in replacement. You'd still write the code targeting the dGPU with explicit swapping to RAM but when run on an appropriate APU it would get a massive speedup for free instead of suddenly being starved for bandwidth while also performing unnecessary copy operations.
Game dev here. For anyone reading this - it’s not because we’re lazy, it’s because _it’s really hard to do_.
One of the biggest differences between the current generation consoles and the current gen PCs is unified memory.
A unified pool of memory suddenly makes that simultaneously easier, but also far more flexible, which frees up developer time and bandwidth to focus on other, more important tasks.
The problem is that when you need something in gpu you have to go through RAM first (unless you have DMA which is a more recent addition). That doesn’t just add latency it also adds an extra step of cache invalidation, so you have to plan for that from the highest level of gameplay. If you need to prepare for a GPU memory miss _and_ a CPU memory miss as a worst case all the time, it’s very hard to make good use of the bandwidth in the best case
I'm not a game developer, but it would also seem to be a link between resource usage by the engine, and whatever content the production side are making. For all the commentary about how brilliant the id software engines are, if you examine the levels you pass through they're also very efficient with what they demand out of the engine - it's like an orchestra playing well together, not one instrument that means you can do anything.
That spec is also a throughput measured per second whereas our frame rates are much higher than 1/s. At 60hz, that’s now between 140 and 800 textures a frame. If you miss _one_ you don’t get that back.
A single main character in a game can be 2-5 regular textures, plus all of the extra mapping textures we have these days. Now do landscapes, environments, props, background videos, and it all adds up. 4k textures are pretty universally used. If you look at a tiny object up close we need a higher res texture to be able to show it neatly.
You also have memory pressure - raytracing makes heavy use of VRAM so you have to make the tradeoff of how much do you want to allocate to caching lighting, vs how much you want to keep textures and geo around.
Lastly, as you say, actually keeping up with 360GB/s from the CPU side is tough. If you require any transformation or CPU operations that’s just not going to happen. If you need to pull from disk, even on an NVMe drive reading synchronously, the max throughput is < 10% of that, and that assumes you are actually reading 360GB from disk. If you pause to do anything else, you’ll significantly slow that down. Players also generally don’t like it if we thrash their NVMe disks :)
Plus, DLSS can greatly reduce the bandwidth requirements for 4K gaming.
Let me put it this way: what I care about is how quickly data arrives after a bunch of shader threads request it. Throughput is one way for hardware to reduce that time. The other way is to hide the latency (GPUs do a lot to keep themselves busy while waiting for memory), but those strategies can only do so much.
Lower memory throughput almost always leads to a longer runtime of GPU calls in practice, and thus lower update rates.
I promise you want our games to look as good as you want them to look.
LPCAMM2/SOCAMM2 exist, heck I think Framework is using LPCAMM2 in one of their new laptops.
Heck, I'm willing to bet that a lot of manufacturers would rather go that route than soldered in, if for no other reason than the relative cost of warranty work between the two.
However, people probably need to stop being obsessed with ultrathin laptops for that to happen.
I've never been able to understand this. Once we made it down to ~20 mm (which for the record still accommodates dual-stacked SO-DIMMs, a 2.5 inch bay, and a user replaceable battery but not an RJ45 jack) I don't understand what the practical impact of any further reduction is supposed to be. Regardless of how thin you make it the thing will still be a massive rectangle that you can't flex or press on.
There's very wide variation between laptops in how noticeably they'll flex or yield or creak when pressed. Laptops with a build quality that actually feels solid are far from being ubiquitous or even a majority.
Doubling the thickness of my MacBook Air would probably make it regress on that solid feeling, unless the weight was also significantly increased.
And regardless of whether current laptop form factors could accommodate a 2.5" drive, there's no use in doing so. That drive form factor is entirely obsolete for laptops and is just a waste of space and materials, and has been for about a decade.
I'm not sure why you seem to think that making something thicker would reduce the stiffness or strength. It's generally the opposite - see the concept of a torsion box. Anyway that wasn't the point. The point was that regardless of how thin you make the thing it will forever remain a cumbersome and delicate item that you have to treat with care when packing so what meaningful positive impact does shaving off those last few mm have? It's never made any sense to me.
Unified memory doesn't have to be soldered on or serviceable. That's a choice Apple made because it fit their product vision, but it's not mandatory in the slightest.
CPUs don't slot in for a reason
I would much prefer two SODIMM sockets with the option to go to 32MB shared video memory, or DDR4/DDR5. Give me OPTIONS!
So, it does not have to be soldered.
I don't know how linear or sensitive CPU and GPU benchmarks are to such a 20% slowdown, but i don't think Apple wants to pay it. And it looks like the next generation will be even closer to the SOC.
We're also hitting the limit of DDR5 here (before moving to multiplexed)
I would guess if you had LPCAMM2 located physically around the CPU (one or two on each of the 4 CPU edges) you could also reduce that latency.
If you wanted to get sleep right and improve battery life, that was the trade off.
Thought getting sleep right was something that happened before MS decided they need to be able to wake your PC any time they want and not hardware related much.
It’s possible if you’re willing to go with much slower RAM than GPUs like but CPUs often use. Thats what integrated graphics laptops have done for a long time right?
But can you get high end CPU and GPU performance with unified memory and maintain user upgradable memory in a reasonable way? Thats what I don’t know.
LPCAMM and similar solutions exist, but have never been demonstrated running at speeds that match what the leading soldered memory systems are using; there's always been some speed penalty. I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus to match Apple's Max tier SoCs, so it's somewhat of an open question whether those solutions can offer upgradability at the high end of the market for unified memory systems.
LPCAMM2 supports up to 9600MT/s, which appears to be the same speed Apple is using.
> I'm not sure we've ever seen a system demonstrated using LPCAMM or similar for a 512-bit bus
Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs. It's basically down to most existing DDR5 system boards/sockets having been designed before anyone was trying to run LLMs on consumer hardware, e.g. AM5 has a 128-bit memory bus and you're not changing that without a new socket. But every memory generation gets a new socket anyway, and the existing Threadripper Pro socket has a 512-bit memory bus as well.
Moreover, making the bus wider is "easy" -- the main problem with it is that it adds cost. Apple's least expensive machines use the same 128-bit memory bus as most PCs and the ones with the 512-bit bus cost as much as Threadripper if not more.
The difference here is in what the standard defines on paper vs what is actually shipping in products and readily available off the shelf. Who's selling a whole system with LPCAMM2 certified for 9600MT/s? Intel's current-gen Panther Lake top of the line laptop chips are rated for 9600MT/s when using soldered LPDDR5x but only 7467MT/s when using LPCAMM2, according to their current datasheet: https://www.intel.com/content/www/us/en/content-details/8721...
That puts the current Intel-with-LPCAMM2 supported memory speed at 1.5 years and counting lag behind Apple's shipping memory speeds. Intel's own shipping memory speed moved past 7467MT/s a few months earlier than even Apple's.
> Servers commonly use a 768-bit DDR5 memory bus per socket even without LPCAMM and LPCAMM allows shorter traces than traditional DIMMs.
> Moreover, making the bus wider is "easy"
Citations needed. Servers aren't anywhere close to 9600MT/s yet; Intel and AMD are at 6400MT/s. The trace length advantages offered by LPCAMM2 don't necessarily mean the traces for the sixth or eighth channel would be short enough for 9600MT/s (which again, is not yet available even in a 128-bit configuration in shipping hardware). Adding more channels to even a LPCAMM2 configuration means adding more trace length, because only two modules can actually be adjacent to the CPU socket. (Maybe you could get to 512-bit with modules on the front and back of the board while maintaining trace lengths short enough to reach meaningfully higher speeds than regular DDR5, but so far nobody is doing that or even talking about it.)
But for throughput served with 12 channels have pretty high theoretical even with slower
Does it need to be leading, though? Being median is just fine for what high-RAM systems are intended to be used for.
"Abdul Jabar, couldn't have made these prices, with a sky hook."
Actually the opposite is true. Socketed RAM can be made to overclock and adjust timings, while soldered ram, no. Two Lenovo's one soldered ( Carbon X1 ), one T590, one slot: Crucial 16GB, 260-pin SODIMM, DDR4 PC4-19200. Exact same processor, the X1 is DDR3 soldered on 532.0 MHz PC3-1066. The T590, has DDR4, PC4-19200, 1200Mhz.
Both have a Core i7 8665U... and the T590 is much faster, with socketed ram.
If you look at eg. an Intel laptop chip, you'll see they design and build a memory PHY that can interface with either DDR5 or LPDDR5x. They don't support splitting it to have one controller operating with DDR5 and the other with LPDDR5x, for fairly obvious reasons: more complex hardware, harder for software/operating systems to manage optimally, and not a lot of benefits to drive demand and justify the expenses. The speed difference between LPDDR5x and DDR5 isn't really large enough to use LPDDR5x as an L4 cache; it would be more like two different NUMA nodes, with complications for laptop power management.
If you want somebody to build a chip with more than the usual 128-bit bus and make some of the memory controllers use LPDDR and some DDR5, then you're asking for a significant increase in chip cost due to the extra memory PHYs and pin count. That cost is only justified if almost all products using the bigger chips are going to actually take advantage of the full complement of memory controllers.
What happened to PCIe 8 and CXL?
PCIe6 is a much larger change than 'just bump up the transfer rate', the encoding changed too (on top of the new code length, it's no longer NRZ,) so everyone needed to design and validate both the new encoding block, negotiation, etc etc.
That said, I'm guessing PCIe7 will be a 'smoother' transition from PCIE6, i.e. we might see 7.0 products in 2027. That will theoretically get you ~240GB/sec, on an x16 link, or hypothetically a little less than the hypothetical max of a current Strix Halo. (I'm guessing however, that PCIe protocol overhead will make the difference larger.)
Most systems barely need more gpu memory than what is required for video, browsing etc.
Just because we found a new usecase doesn't flip that on its head.
Besides, I want to keep doing what I'm doing today. So if I need 128GB today and my local AI needs 128 GB then I'd need 256 GB to keep doing the same work.
The argument rather seems to be that we shouldn't use such expensive memory on the GPU. Which might be true if you only want to do inference on it.
It is ambitious, and absurd... like all CEOs that eventually go loopy. =3
I'm honestly a little confused by what you mean here. Why would we want to maximize those things? Games are about consistent output under the frame deadline, not full saturation of the hardware.
Why would anyone try to saturate a 5090 with their game? The addressable market is tiny and you'd have to hope their full spec runs as well as or better than your test rig or they'll still not hit framerate.
You're also likely not going to maximize all of bandwidth, compute, etc. because one of them will likely be your bottleneck. And it might be different depending on the GPU, too.
Certainly not more from main memory, and maybe not more from the vram either depending on how the pipeline goes.
It's not a linear slider.
The question is ultimate shape of knowledge compression and bandwidth optimization at which we arrive I suppose.
More details: https://rocm.docs.amd.com/en/docs-7.2.0/how-to/system-optimi...
What do you mean by this? Memory bandwidth is fundamental to the speed of an local AI model
Shared memory existed since the first CPU with an embedded GPU came to market and you could set in BIOS how much memory goes to what component.
I do have an opinion about how unified memory could be different, but I want a proper explanation.
In unified memory, all the memory is host memory and data can go from program to GPU with zero copy movements. The addresses of buffers can be shared via appropriate MMU translation support, so that the application and graphics subsystem are communicating effectively through the basic RAM cache coherency protocols over the same buffers.
Edit to add: Aside from the zero copy transfer potential, it also means dynamic allocation strategies can shift the balance between host and graphics allocations on the fly. Individual image and message buffers can be allocated on the fly instead of setting a static split between the two worlds.
Unified memory is what Apple is doing, other phones do, and many low end built in GPUs have done in PCs for ages. There is only one physical memory pool. Both the CPU and GPU can access it at full speed.
This means no copying between pools of memory. No speed penalty accessing the CPU memory from GPU or vice versa. If the GPU only needs 2 GB to draw the desktop it only uses 2 GB of the pool. Or it can use 45 GB if it needs it and the CPU doesn’t. But all memory has to be the same speed, and that ain’t cheap given how fast GPUs like things. I don’t know if expandable memory is possible, and they use the same bus do they compete for bandwidth. Seems theoretically easier to program for to me.
The opposite is what’s been common in graphics cards since the 2D era. CPU and GPU have their own memory and can talk over PCI/AGP/PCI-E. This is what I think they mean by shared memory, if it’s not what’s the point in touting unified?
In this model if the GPU uses 2 GB of its 12 GB total, the other 10 isn’t available to the OS at full speed and I’m not aware of any operating systems that would use it for programs/cache by default. If the GPU needs 45 GB… too bad. You have to page things in and out of GPU memory over the much slower system bus. Starting a game means loading assets into main memory then transferring them to the GPU (newer tech can accelerate this). But the CPU can have slower memory than the GPU saving money. Memory expansion on the CPU side easy. And the CPU saturating its memory bus has no effect on the speed of the GPU memory bus because it’s physically separate. More complicated memory model but it’s the one everyone uses used to.
Which is better is a matter of opinion and workload needs.
> I don’t know if expandable memory is possible
It technically is. These new systems (mostly) get their high bandwidth by using more channels (wider bus) of normal RAM modules. A system that has LPCAMM2 sockets should allow using the same LPDDR5X memory but you'd need a socket per two channels. A typical PC only supports two channels so having four (two sockets) would double the bandwidth.
I'm not sure what you mean by this. Memory bandwidth is the main bottleneck for single-user decode. The bottleneck is actually more severe for end-user inference than cloud inference, because end users don't have the option to increase arithmetic intensity by computing tokens for multiple clients in the same pass.
One thing we've learned from Apple is the viability of spamming more LPDDR5X channels (up to 1024-bit total bus width on M3U) as a means of achieving high bandwidth while keeping the cost/capacity reasonable.
GDDR tries to push out as much bandwidth as possible, because that really matters for (traditional) GPU workloads. A constant but insignificant (= correctable) error rate is considered completely fine for GDDR, because that sacrifice allows the memory to be pushed much farther.
Meanwhile most (traditional) SDRAM workloads don't give a hoot about bandwidth but really care about latency. And ideally you want no errors, hence ECC RAM being so venerated.
If you unify memory, you're gonna have to choose to sacrifice one of those workloads or go suboptimal for both.
Weirdly enough this mostly matters for non-gaming workloads. The Apple M-series are absolute monsters in gaming, completely crushing the RTX XX90 editions in performance-per-watt, but as soon as memory bandwidth becomes paramount the M-series falls heavily behind.
The 5090 ($2k MSRP but realistically $3-3.5k) is almost the same as the RTX 6000 Pro (~$10k). Same memory bandwidth (1800GB/s). Slightly different CUDA cores (21k vs 24k). Big difference? VRAM (32GB vs 96GB).
NVidia ultimately doesn't want to upset this segmentation so the RTX Spark will never undermine their other offerings. This is why I think Apple has a real market opportunity if they choose to embrace it.
They seem to? Intel Arc is the cheapest option by far for a discrete card with 32GB VRAM.
(I still kinda want to get one tho.)
It’s like they both want to rely on market segmentation for VRAM too but fail to realize that it’s their only potential inroad right now.
Needs 320 GB Vram
The biggest advantage with NVIDIA is CUDA.
AMD is selling every MI card it makes, and the market wants more of them.
Quick background: doing AI inference requires three things. Lots of memory, lots of memory bandwidth, and of course plenty of compute that has access to that memory.
Quick reference: nVidia 5090 has 1,792 GB/sec bandwidth. 3090 gets about 1000 GB/sec. DGX Spark and AMD 395 whatever get about 275 GB/sec.
Apple M1 Max gets 400GB/sec, M5 Max gets 614GB/sec. Ultra variants get 2x that bandwidth, base variants get 1/2 that bandwidth. However... their compute is rather weak.
Right now, Apple's offerings are juuuuuust fast enough to run dense 27B models at usable speeds at like, 10% of the performance/watt of nVidia. They're world-leading general purpose CPUs but not killer GPUs.
By all accounts, these Windows PCs nVidia is touting seem to have DGX Spark like performance, which is less than impressive. Same with the upcoming AMD AI-oriented consumer stuff.
The other context here is that running your own AI at home is just starting to become feasible in terms of open model availability and the ability to run it at usable speeds. Many are interested in it for reasons of privacy, security, and cost certainty vs. buying tokens.
nVidia and AMD can't make their consumer offerings too good at AI, because that risks interfering with their higher-margin data center sales.(And, let's face it. Even if nVidia did release a 6090 with 64-128GB of memory for an affordable price, consumers wouldn't get their hands on them anyway because people would just start filling data centers with them)
So.
Now you see Apple's opportunity, right? No data center sales to interfere with. No relationship with nVidia or AMD to worry about.
They could choose to make an absolute beast of a home AI machine. The M5 Ultra, if announced, might be that. It's admittedly a niche market, but people are already buying 64GB+ Macs faster than Apple can make them and they're fetching high prices on the used market as well.
The only real questions are if this market is even something Apple would find time to care about, and if they could secure enough DRAM to make a go at it. They are enormous obviously but they're feeling the RAM pinch just like everybody.
If there's an M5 Ultra it'll be interesting to see what they've optimized it for.
Even if a Mac isn’t the fastest in raw numbers it may be faster if it can load the whole model in its ram (went up to 512 GB before shortages) than a couple 32 GB cards could with the data having to be constantly loaded over PCI-E. Because unified memory means the Apple GPUs can access all 512 GB at full speed.
My understanding is this is the advantage that’s pushing huge Mac Studio demand. Because it was the only way to give GPUs so much memory at price points anywhere near.
Yeah you can do way better once you’re in the 5 digits. But below that Apple had a specific advantage for some.
Yes, a Mac with 128GB+ will let you load some pretty big models.
However, you're still not going to be able to run them at usable speeds. Here are some M5 Max benchmarks on a Qwen 27B model w/ 290K context.... 12 tokens/sec output.
https://www.reddit.com/r/oMLX/comments/1swztoh/m5_max_128gb_...
And that's a 27B model. So yes, a M5 Max 128GB will let you load some pretty big models - can probably fit 120B in there with room left over for context. But the M5 Max still doesn't have the compute to make it practical, at least from an interactive usage standpoint - 120B dense model is going to be like an order of magnitude slower than 27B. You have to understand the computation going on here. LLMs are basically a huge many-to-many operation, and those operations themselves are pretty heavy.
So back to my previous post... you need three things. You need fast memory, you need a lot of it, and you need GPU compute with direct access to that fast memory. The M5 Max has like, 1.5 of the 3.
The M5 Ultra (if it ever exists) could kinda hit all 3, although actually getting your hands on one will be quite the lottery ticket.
This is true, but also, people who made this investment found that they're still not very usable for those HUGE models. Don't take my word for it though. Lots of benchmarks out there. r/localllama is pretty active too.BUT you just can't compete with NVidia performance for LLM workloads (mostly inference) for two reasons:
1. The memory bandwidth just can't compete with a 5090 (1800GB/s). The best current Mac is ~900GB/s. That directly caps tokens/sec and might be manageable but there's another problem; and
2. The raw FLOPS just can't compete with even a 5090. It probably needs to natively support FP4/FP8 to at least maintain a number format parity with NVidia. But beside that, NVidia just has more raw FLOPS.
According to Google, an M5 Max does ~70 FP16 TFLOPS while a 5090 does 380. If Apple can close that gap to at least be competitive and also hold larger models in shared VRAM, that would be a competitive advantage and it would directly attack NVidia's market segmentation.
The Mac Studio last came out March last year. So we may get an update in Q3. Many are pinning their hopes on this. But it might not happen until next year. When it was released the M4 was the state of the art and it came with either the M4 Max or M3 Ultra (which, as I understand it, is basically 2 M3s stuck together, kind of). What people are hoping for is an M5 Ultra with >1000GB/s of memory bandwidth, ideally 200+ FP16 TFLOPS and hopefully FP4/FP4 support.
You can chain Mac Studios together into a cluster with TB5 too.
But it's reasonably likely that the next Mac Studio will be only incrementally better than the last generation.
These days, more like >$4.1K (at least in the US).
Increase RDMA cross-bar linking from 4x to 8x = a lotta ports, a switch, or a stacking interface.
Regular RAM size/speed scaling: 512GB -> 1TB Mac Studios. Wider RAM and RDMA paths * clocks.
Given the low power envelope of today's Mac Studios, and bandwidth limits, lots of room to scale up, if Apple chooses. My fantasy: 2x cores, 2x RAM sizes, 2x RDMA devices, 2-4x RAM & RMDA bandwidth.
My M5 Max 128gb MBP decodes faster than one of my Sparks, but the Spark's prefill is so much faster it can often answer the same query before the mac's prefill is finished. If you have large prompts, low cacheability, etc., a spark might be a very good options.
Not to mention you get can get two sparks and the MBP will be 85%+ of the cost at half the RAM.
I'm kind of tempted to pick one up. Leave running big models to my dual dgx setup, and all the misc. random stuff on an rtx.
Seems niche to be both uncacheable and long context?
Most consumers will never really care about, let alone see, the difference in PCIe or memory bandwidth impacts from such a shift to unified memory pools. We might (being, at least in my case, a huge nerd), but I’m increasingly of the opinion that if modern blockbuster games are built for upscaling/reconstruction anyhow, then suddenly such sacrifices to performance seem acceptable relative to the gains in efficiency.
No copy unified memory will help with that but you do pay the read speed costs.
However, I couldn’t care less about faster CPU when:
1. It limits my ability to upgrade my system
2. Windows gets increasingly bloated and slower
Funny that it is getting credit only now.
O2 was popular in systems where large textures or textures generated dynamically (like mapping external video input to texture) was important
The ps4 was the prime example of this, and how it could run so many great games.
M1 knocking from 2020.
Gamed changed, past tense, six years ago. This is catch-up.
Most other SGIs had single or low double-digit megabytes of texture memory, whereas the O2 could host one gigabyte of unified memory and use a huge chunk of that for textures.
That was because unlike other GPUs at the time, O2's didn't have dedicated memory but shared the memory with CPU - way slower, but zero copies and bigger.
Arguably early home computers and workstations also used "unified memory" :D
Sorry, I meant before the M1 came out. And you and I both know that "unified memory" doesn't refer to allocating ram to the gpu for zero-swap sharing.
Vega series is 2017.
https://rocm.docs.amd.com/projects/HIP/en/docs-6.3.0/how-to/...
I don't think the M1 specifically focused on inference. Their goal was to replace Intel/AMD/Nvidia with their own chips, and since the previous Macs shipped discrete GPUs, they had to match or beat those so they don't ship something slower.
As a Rust adherent, please do not put words in our mouths or set up unrealistic expectations for other people by linking together concepts at a very shallow level.
Language level memory safety has no answer for hardware security flaws which is what side channel attacks are. No programming language can provide memory privacy if another chip in your machine can read your memory. Just like no programming language can protect your application from a kernel vulnerability of the kernel it’s running on.
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
This is the 2026 edition of Ken Olsen: "There is no reason anyone would want a computer in their home"
Digging into this:
> In conclusion, there is evidence that Ken Olsen did doubt the need for computers in the home, but the evidence is based primarily on the testimony of David Ahl who was perturbed when the personal computer project he championed at DEC was not supported by Olsen in 1974.
> Olsen’s resistance may have been similar to that expressed by another DEC executive, Gordon Bell. In 1980 Bell thought home terminals would act as gateways to remote computers which would provide appropriate services.
* https://quoteinvestigator.com/2017/09/14/home-computer/
It was supposedly said in 1977: most computers at that time were not small, and so it would not be surprising that people would not expect the general public to desire a large, power-hungry, noise-y apparatus in their house.
And, like the overly large machines of 1977, models are getting faster, leaner, and better. It's happening a lot quicker, though.
Just because you can do more and more things at home (thanks Moore and Dennard), doesn't preclude needing things also done remotely. The number of at-home systems seems to have fed a growing number of remote systems (especially once always-on connectivity became ubiquitous).
It's basically the angle Apple is going for: do as much locally (for the sake of privacy), and then offload when it becomes "too much".
You already can if you’re willing to spend many thousands of dollars on a beast of a machine. I’m talking about middle tier desktops and laptops here. Maybe eventually even phones.
The only way hosted stays strongly competitive in that world is if they can keep pushing the frontier or by playing the classic social media and SaaS games of network effect building and integrations.
Many people might still use hosted, of course, but what I really mean is that their multiples won’t be justified and they will have little to no moat. AI will become commoditized, like a sophisticated next generation form of an encyclopedia with search.
People take these quotes out of context all the time. Said in a business context, there was no need, at that time, for someone to have a personal computer.
There's no business justification in 1977 for a personal computer department at a business. It's similar to the gates quote about RAM (I think it was 64KB?).
These statements aren't meant to be forever quotes. Their business plan quotes.
640, and Bill Gates said he either never said that, or at least never remembered having said it. I think there is no evidence anywhere that he did.
https://www.computerworld.com/article/1563853/the-640k-quote...
The early popularity of Minitel, the continued popularity of ssh/tmux, and the web browser itself indicates that bespoke client applications are not the only way. He wasn’t directionally wrong.
Even for a lot of LLM type tasks, 128gb is likely more than enough to control a lot of PC configuration and automation with natural language.
Nobody ever said that, at least not as an assertion or prediction. The actual instances of similar language are from multiple people describing their earlier thoughts before they learned it wasn’t true.
Local models aren’t deterministically equivalent in capabilities to foundation models. Home computers are turing complete; just like a mainframe. They are just slower. Often not slower enough to matter.
I think there’s a sweet spot currently with munging your data blindly on the server so that your client device battery still lasts all day.
Meanwhile Apple and others push on with making client side models more efficient so that eventually the server costs and complexities go away.
If asked to choose between photo editing done within 3s using cloud provider vs an average of 30s using local compute, most consumers will choose the former without hesitation.
Most users' usage is also going to fall nicely in the free tier of a typical freemium pricing model, like ChatGPT today.
People who talk endlessly about local inference have no idea about user workflows and usability.
You could run a pretty good home server on $50 of gear and yet we never saw any real adoption of OwnCloud/NextCloud style products as an alternative to Google Drive/Photos or Apple Cloud.
Why should LLM/Transformers be any different? Especially when you need a proper expensive GPU to run them instead of a Raspberry Pi?
On-device AI is going to be important, I think. It doesn't have to take the form of a chatbot UI to be useful.
Maybe if you ask them that question, but if you show them two products, they'll definitely prefer the faster one. 30 seconds is a long time to watch a progress bar.
You don't think the commercials of Google's AI photo features aren't going to have an impact on Apple users of their phones can do a worse version of that feature and it takes longer?
People definitely aren't going to accept more expensive + slower ...
Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.
2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.
3. Qwen doesn't like quantization at all.
Settings: RTX 5090, 5-bit weights (Unsloth), FP8 KV cache.
Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.
Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.
Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.
If you have reasonable amounts of RAM to cache the most likely experts, that's not true at all. Qwen 27B is marginally faster on a nearly empty context, then falls behind as context length increases due to the different attention mechanisms. Prefill for Qwen is much faster, but you're still comparing vastly different model sizes and capabilities. DeepSeek Flash is the best deal overall.
> completely fit in a high-end consumer or mid-end pro GPU
Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).
Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)
> as context length increases
What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.
> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.
For any given workload/session? Empirically, yes, that's what has been found across different models. There's quite a bit of predictability that makes caching helpful.
> Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
There are ways of masking some of that latency, though it requires some architecture-specific cleverness which is less directly applicable to a generic engine like llama.cpp.
> Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal
The llama.cpp folks are working on adding support, and the ds4 project is working on CUDA support for streaming inference, targeting the DGX Spark.
> From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens.
DeepSeek V4 seems to do quite well on recall tasks even with large context. That's one plausible benefit of its compressed attention mechanism, compared to earlier models. Some degradation will likely still be there, but it's not necessarily obvious.
As for why people are calling Qwen 27B "borderline unusable" that may have to do with it being a dense model which makes for an increased compute intensity and pushes users towards discrete GPU platforms, since those tend to have the most compute overall as far as consumer hardware is concerned. I might agree that Qwen 27B is quite ideally tailored towards these platforms, but that does come with some limitations.
But yeah, the Qwen line is pretty impressive on commodity hardware.
To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.
I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?
"Go add a gh action to compile and deploy this thing and run its tests" is one I've found it's good at. Yes I know how to make a gh pipeline but it's always a hassle to remember what goes where.
Cranking out unit tests is okay. It's good at summarizing things so it's not half bad at writing jsdoc/xmldoc comments.
I have a hard time believing running a model on a laptop will be cheaper than running it in a datacenter. Why wouldn't economies of scale apply here as with every other computation?
Local may or may not be cheaper than remote now, depending on the details, but the factors you describe won't affect the math nearly as much as they will once that subsidization ends.
Qwen3.6 is practically indistinguishable to Sonnet 4.6 at least in my personal experience. And sonnet 4.6 is not that cheap.
The vision NVIDIA is selling is pure marketing IMHO
You're going to need to analyze the problem much more deeply because it sound like the standards you are implicitly applying would result in "economically, everything should be centrally hosted" but that is clearly not the result that obtains. Even a modern mid-grade cell phone is no slouch; you may not be running a current-gen frontier AI on it but you certainly can do a lot of other rather intense things locally that would have been laughable 10 years ago, like suprisingly high powered games.
But they also want to taste the sweet fruit of AI so the only way to do this that a CISO will approve is on local air gapped hardware. It’s a niche but still a billion dollar niche.
I suspect personal privacy and need to run AI workflows to handle the litany of administration tasks of a household will be what result in regular need for local AI.
Apple is already out front with this on a personal, individual level, but they are not obviously headed toward multiuser/family-level ~biz admin with a persistent server running local LLM.
This made me laugh. I can only image how insufferable this person is to deal with.
Do you think he's in mensa too?
Where you will need games to be rewritten for ARM to get full performance, just like on Apple's M series chips.
Especially on Dwarfstar.
Not everything I want to use an LLM for requires "PhD level intelligence", and increasingly I'm finding more uses that involve sharing my personal data.
Yesterday my local model helped me when looking for a doctor who is in-network for my insurance. I threw it a screenshot from the providers search results and it looked up reviews for all of them.
I own the DVDs so I'm OK upscaling/editing my own copies for my own use. But if I ran the task on an ai service I would no doubt trigger copyright issues.
anyone whose addicted to token theoughput is losing the operational knowledge and offline capabilities.
if you arent moving to the AMD 395 or MACs then youre hitching aride on the expensive calory ride
But watching everyone flounder because claude goes down or forcing you on API costs.
I'm programming things that'd take me days with a PC that, without OpenAI's VRAM shenagans, would cost you $2k.
It's more than just 'this is what I could do' it's definitely about 'this is what anyone could do with a new PC purchase'.
You're doing what the IT industry has been addicted to for decades: number goes up.
No, I have a hands on experience with bigger models, and understand the advantages of using them.
You also probably believe you need to 'escape the permanent underclass'
You assume a lot. Sometimes it’s good to simply ask a question.
Lol yeah seriously, that stinks "I ask AI to generate a huge amount of bullshit and upload it to pad irrelevant stats".
Absolute loser.
As to why he now has this on his blog? I also cringe when I read it. I presume someone told him he should self-promote more, and this is his lame attempt to do so. He's almost certainly the most cited person in his department, but it's entirely possible that none of his colleagues actually know this. Cut him some slack. Self-promotion is not his strength. He's a nerd's nerd, and not a marketer. I'll mention to him that his attempt here might be backfiring when I'm next in contact with him.
He doesn't just have it on his blog, he has it EVERYWHERE. Sometimes 2 or 3 times on the same page.
It sounds like he's gotten bad advise about how to market himself /or/ this is being marketed to people who have bigger checks to write and whom he believes will be responsive to this kind of marketing. As an academic, it rubs me very wrong - I think it's detrimental to the field when we get into h-index stacking contests or citation count comparisons. But I don't know what incentives he's responding to, which seems important for putting this stuff in context.
(as an aside, it turns out that polars + fastexcel is about 10x faster than pandas + openpyxl for searching that dataset, if anyone else is curious what he was actually talking about. :)
Being the top x% is what OnlyFans girls brag about, professor...
And it's not exactly brain surgery, is it? https://www.youtube.com/watch?v=THNPmhBl-8I
Citation needed
1. Yes it has the same number of cores as a 5070 mobile. It’s also running at a shared peak of 2/3 the bandwidth and a shared peak of 2/3 the TDP. The GPU by itself will likely perform at half the dedicated units performance
2. Apple may not have SVE2 but they do have the AMX (private) and SME. I don’t see why he thinks the SVE2 will give him more performance than the SME.
3. He mentions a single core type but doesn’t mention the total makeup. We already have known for a year how the DGX Spark compares to Apple chips. For CPU it’s roughly equivalent to an M3 Pro and for GPU compute (not rasterization) it’s between an M4 Pro and M4 Max without considering bandwidth.
The real advantage to these is that they run CUDA. That’s it. Otherwise when they launch they’ll be 2-3 generations behind where Apple is and 1 gen behind AMD.
The other super power of the DGX Spark was the NIC for pairing them together. But that’s been removed here too.
You are likely thinking about token generation which is dependent on memory bandwidth where Apple has an edge. Spark's GPU compute is way higher than even M5 Max (17 FP32 TFlops), around 2x FP32 TFlops... It's literally 6144 CUDA cores like desktop 5070, slowed down by slow memory and lower TDP (29.7 vs 31 FP32 TFlops on 5070).
I’d also mention that you’re comparing peaks which the RTX Spark won’t be hitting. The top TDP is less than that of the DGX Spark.
I just think anyone calling this a beast and a game changer are conflating/extrapolating from different form factors and constraints
cool story, but nobody cares about mobile GPUs for blender. A 4080 eats an M5 Max alive for breakfast. The 5080 in my machine that cost me 1500€ runs circles around an M5 Max that would cost me over 6000€. And when in 5 years the 5080 isn't enough, I can upgrade it to a 7080 or whatever, which will remain compatible.
If you're a professional, soldered products like the RTX Spark or Apple's offering are a dead end. They are literally never worth it.
It’s not going to be the primary place of creation but there’s a lot of usefulness in having a portable workstation or that entire segment of the laptop workspace wouldn’t exist.
In either case, it’s besides the point because the point is talking about the compute levels of a GPU in the same form factor.
Game dev & asset work is probably happy with a 5080 and that's what most rendering/dev machines would have.
The addressable market of "i have 6000 to blow and i need meh performance on anything related to 3D rendering" is small, and benchmarks make it look bigger than it really is.
Disney’s Hyperion is CPU based and RenderMan XPU is just exiting beta after over a decade.
But while they do stack their workstations with higher end GPUs for artist throughput in viewports it’s mostly just for the higher memory to fit unoptimized scenes in. None of the studios or major films I’ve worked on have had their on desk artists be raster rate gated but just memory gated.
But again, besides the point, because it’s still valuable as a metric to compare with when comparing perf between similar chipsets.
There are already more creatives using their consumer grade hardware to make stuff. And even the studios you mentioned do actually use laptops on the go for parts of their creation pipelines for various things like virtual production scouting etc.
Same model, same quant, same query, as close to as matched settings as I can get from vllm, and for workloads with large prompts + low cacheability, one of my sparks will often be done responding before the mbp is done with prefill.
Guy suddenly became aware of a chip that the rest of the industry long knew about, seems completely unaware of the competitors, and posts about how it's a BEAST and will be a GAME CHANGER.
Like the DGX Spark was a game changer? Eh, it has mostly been a massive disappointment. An overpriced nvidia laptop isn't going to change the equation an iota.
Qualcomm is like AMD was for GPUs for like decades. Lots of announcements and people on the Internet are huge fans based on web pages they’ve read but if you try to make it work it’s a nightmare.
Snapdragon X Elite doesn’t work on Linux so it’s a pointless platform. Enthusiasts have made M1 work better. Literally have old Macs running rather than use Qualcomm.
Whether this is true or not, it's pretty safe to assume anything based on their stuff is not for me.
It drives me nuts, I look at cumulative CPU time, and this is all my work laptop does.
But perhaps more importantly. Nvidia seems to be doing a lot better with its ecosystem. Nvidia has much better distribution channels and partners building on top of their PC Gaming GPU. It also have gaming developers relations that is unmatched by any in the industry.
Qualcomm has so far failed to execute this, both in PC and on there Server CPU side.
What's lousy about it? I use it daily and have zero problems.
Some distros still need extracting Qualcomm firmware from Windows to get Linux to work properly. Audio remains a challenge, like x86 Linux decades ago. Apparently camera stuff works these days but produces images of subpar quality.
These issues also occur on normal Linux. My experience with my Lenovo+Intel laptop was that it took three months after release for the firmware to work properly (and the Nvidia drivers took much longer, but that's my fault for buying something containing Nvidia hardware). Intel managed to do what Qualcomm did in months rather than years.
I hope Qualcomm finally sorts this shit out, I really do, but with the prices of computers these days, I'm going to need to see quite the discount before I'll consider buying anything with a Snapdragon.
This is a problem with Linux on ARM generally (Android has had it since inception), it's not a Qualcomm problem.
they seem to have dealt with this for the server hardware
My experience (wanted to use x13s as daily sriver) is that there was good progress for about a year, until jhovold was leading the charge, but something expired and qualcom as far as i can tell forgot that some progress should happen on x1 and x8c as well as x2.
And I know a lot of that lies on the vendors, but it does feel unfortunate (from a standardisation/conformance/certification point of view) that Windows requiring it doesn’t make it easy to boot other OSes!
They could have had a 128core arm chip by now.
There's also the whole giant trillion dollar company doesn't want to invest and let small ideas grow. They only focus on things that move the needle, which isn't much at the size.
Had Microsoft executed and invested, they could have made a come back imo in both search, mobile & hardware. Unfortunately major lack of leadership or they just don't want those areas.
Qualcomm are trying harder now it seems. But it will take time to repair their reputation in the PC market.
Tuxedo computers tried and didn't succeed either.
I will never buy Qualcomm again. I avoid them on phones as well by just buying Apple. They do not support their hardware beyond the release.
To each their own, but I don't recall Apple ever mainlining any of their drivers on Linux. You're rightfully angry on the laptop side of things, but Apple is much worse than Qualcomm when it comes to open source support for their phones.
Qualcomm probably shouldn't have promised Linux support in the first place. Everyone seems to love Apple's hardware even though you're practically stuck with macOS. Had Qualcomm just stuck to Windows-only, they would've probably received a much better reception by the tech press.
Not really, the 1st. iteration got stuck in legal land and other delays.
https://discourse.ubuntu.com/t/ubuntu-concept-snapdragon-x-e...
outside of anything else, amdahls law means that as the parallel performance grows, we become _more_ limited by the inherently serial code, and thus single core performance, not less.
Given that single core performance is "harder" (can't just throw more cores/sockets at the problem), it's also critically important.
Strix Halo is 16 cores. Intel Core Ultra 9 285HX is 24. Apple is 18. Qualcomm is something similar too but I can’t recall. NVIDIA is 20.
Until you get to threadripper/epyc or Xeon territories (completely different form factors and TDPs) the arm chips are ahead on both power and perf than the x86. And even when you get to those areas, arm is equivalent or out performs them as can be seen by the recent neoverse x3 and Vera benchmarks.
Because that't the only part this chip excels.
People are comparing apples with oranges since ages.
I'll wait for the 365 AI Ultimate Professional Enterprise Edition: Origins version
Technically speaking, Qualcomm acquired Nuvia, which is where this came from and that company came from ex-Apple engineers wanting to do what Apple said no for their chips.
So it's almost same CPU design (origins).
Is there a desktop version ? For real work ?
https://nvidianews.nvidia.com/news/nvidia-microsoft-windows-...
I have been somewhat surprised at the lack of commentators observing that this is Microsoft and above all NVIDIA launching a device that is fundamentally at odds with the metered cloud model of AI.
When you look at the other announcements and murmurings (better offline BYOK for Copilot, talk of an unmetered AI future) I think it’s clear that these two firms understand that cloud-only AI is not sustainable or inherently in their interests. But their willingness to undermine OpenAI with a product like this is notable.
LLMs will get bigger and even with 128GB (that many wont saturate), you wont run future frontier models. For LLM vendors and integrators it's a handy thing to move lower quality inference to the consumers.
Also running local doesn't have to mean that the models have open weights. MS will likely start to distribute closed models at scale once the hardware is there.
Copilot just got proper "offline" BYOK support, didn't it? Presumably that was one of the things they were talking about. Though I imagine that has something to do with the fact that Zed has supported that properly for months.
AMD has the advantage that their x86 machines run everything, Apple maintains the whole MacOS stack, while Nvidia can barely scrape together one Ubuntu release per Jetson generation, it's beyond embarrassing. Maybe they ought to put those agents they keep droning about to some actual work on their OS support.
Why would they do more? It's an LTS distro, the Nvidia drivers are updated for as long as the hardware's compute capability is supported.
Nvidia's ARM drivers are updated constantly, and battle-tested as the backbone in hundreds of thousands of Grace ARM servers.
That's not even considering the lazy out tree patchwork support Nvidia does for their products on top of that. Maybe it's different in this case for Windows since it forces a rolling release, but I seriously doubt they'll do it properly instead of forking some version and keeping it around for 10 years like absolute idiots.
For their ARM SOCs? Almost every single ARM OEM on the consumer market is begging you to use out-of-tree blobs for basic firmware support. Nvidia's stance isn't ideal but it's also not unique (or damning) to the rest of their ARM competitors.
Basically what we need is a chip that also has pins or some type of attachement system on the top (physically) or maybe below where the chip itself connects to the motherboard.
Imagine a CPU you can just plug in a block of HBM memory on top of (or on "bottom" of). This would allow a much larger physical surface area for putting ram cache near the compute cores itself because you would not be limited to edge lengths.
Cooling the whole thing would be a methodology change (might need liquid coolers that sandwhich in between the ram cache and compute and cool both)
(HN reaction to Vision Pro back in 2024 is almost hilarious if not ridiculous, looking at it today. I knew it would be a flop and I was so right.)
Spark DGX also remains a nothingburger, I would be livid if I spent this kind of money and had to waste time chasing down power cap bugs or A/B/C testing each firmware version to find the one that is least slow and also does not fail https://dredyson.com/the-hidden-truth-about-dgx-spark-perfor...
It's just a personal computer. It normally runs multiple operating systems just fine.
Windows PC sounds like people talking about tech who are either payed by M$, or embed pictures into Word documents to send them.
Nobody has to kill the fun those OS agnostic machine allow, by artificially bind them to a shitty OS.
Even for personal use, I'd imagine the amount of people dual booting Windows and something else are a very tiny minority.
Saying "Windows PC" is a pretty reasonable way to distinguish between "made by Apple" and "made by someone else" because the market of PCs that aren't made by Apple and don't come with Windows is really, really tiny.
To be honest, this seems like a strange hill to take such an aggressive stance upon.
I'm assuming it's just clarifying this isn't about Macs.
The term "PC" is ambiguous, since it can either refer to all personal computers in its original meaning, or to the IBM PC lineage that is mainly contrasted with Macs. Remember the famous "I'm a Mac, I'm a PC" ads.
When you just say "PC", people today genuinely don't know which meaning you are referring to. And "IBM PC" is antiquated, and "IBM PC clone" is even worse. So "Windows PC" is a pretty decent name.
Do you have a better suggestion? Because "Non-Mac PC" doesn't exactly roll off the tongue. If you say "Windows PC", everyone knows what you mean.
And it's not an "anal fixation", there's no need to be gratuitously insulting.
I prefer Windows XP, or even Windows Vista, to Windows 11 with its copilot. And it's been a downhill race, even macs are more of your own personal machine than Windows today, which is saying a lot.
PC should be a PC, Windows is as they advertised, a Copilot PC.
For normal people, there are three computer operating systems: Windows, Apple, and ChromeOS. Nvidia isn't going with ChromeOS and Apple hates their guts, so Windows is the only normal operating system they can market.
Their marketing makes clear that these devices aren't the piddly Chromebooks that ruined the desktop experience for so many people (expensive Chromebooks were nice, but rare in practice).
Qualcomm promised Linux support, failed to deliver, and now anybody burnt by their promise won't want to buy their hardware again. If they promise a Windows PC, people won't have reason to complain when Linux or FreeBSD or SerenityOS won't boot on there. Given Qualcomm's failures here, Nvidia is probably doing the right thing.
I did this for years. We ran Resolve color correction suites with external chassis to place multiple Nvidia GPUs in it at a fraction of the cost of the shitty TrashCanMac that was available. Lots of people continued to use the 2012 Cheese Grater MacPro with its older CPUs. The only way to get modern (at the time) compute in a Mac was to use a Hackintosh. Since it wasn't for personal use, not having things like AppStore, Messages, Music, etc wasn't a big deal, so building a Hackintosh was easier.
I built one for personal prosumer use around the time of the 1080s that allowed me more machine for the dollar than Apple offered. Once the M-series chips came out and they were capable of what the Hackintosh was doing for me put me off of building anything newer.
So, the partnership is maybe natural, but not prospective. Also, note how Linux is getting popular among gamers. Of course, it's way behind Windows, but the direction of the change is clear.
I'm convinced that Nvidia is not primarily targeting the consumer market and that the ultimate goal for its CPUs is the server space. The company invests effort where the money is, and consumer products account for only a fraction of its total revenue. Maintaining a presence in the consumer market seems more like a way to avoid a complete pivot than a strategic priority.
I run it for work because we make windows programs. We use drivers that don't exist on Win-for-ARM yet. So to most people a "Windows PC" is an x64 Windows PC still. The risk for MS if compat isn't good enough for Windows-Arm64 is that people might as well shift from windows entirely if they need new software and harware anyway.
Your x86 machines were, but these are ARM SOCs. Many of them don't even support UEFI, let alone the upstream Linux kernel.
over the last decade, many software (especially the popular and industry standard ones) shifted to GPU accelerated design. it's a push before NVIDIA even tried to capitalize on that.
I dislike the cycle of propagating news and assuming that someone else double-checked it.
“News Summary:
- NVIDIA RTX Spark powers the world’s first Windows PCs purpose-built for personal agents, featuring 1 petaflop of AI performance, industry-leading power efficiency, full-stack NVIDIA AI and graphics technology, and up to 128GB of unified memory.
- NVIDIA and Microsoft collaborate to deliver a native Windows experience for personal agents, including new security primitives and NVIDIA OpenShell to run agents securely on primary devices.
- RTX Spark lets creators, AI developers and gamers render ultralarge 90GB+ 3D scenes, edit 12K 4:2:2 video, generate 4K AI videos, run 120B-parameter LLMs with up to 1 million tokens context using agents locally, and play AAA games at 1440p and over 100 frames per second.
- Adobe is rearchitecting Photoshop and Premiere from the ground up for RTX Spark to deliver 2x faster AI and graphics performance.
- RTX Spark-powered slim Windows laptops with all-day battery life and premium displays, as well as compact desktop PCs available this fall from ASUS, Dell, HP, Lenovo, Microsoft Surface and MSI, with models from Acer and GIGABYTE to follow.”
Last time I check an NVidia situation was for DGX Spark (the GB10 chip), it has regular LPDDR5X which by JEDEC standard cannot go beyond ~270 GB/sec, ie 8533 Mbit/s on a 256 lanes bus.
So yeah Lemire seems to go "OMG unified memory, they're following Apple path..." ok, but Apple pulled off a much faster interconnect, 800 GB/s ballpark, and I'm trying to understand (not really, I'm asking you to try understand, he he) how is this laptop faring in that regard.
As a side note, qualcomm chip set on Android has been doing this for years (like Apple) so it's not super unique thing. It's more like there was no need before.
[1] https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
The GPU can still happily use all the rest of the memory for other use cases - which tend to be the bulk of allocations anyway. Though there might be performance implications - for example "moving" buffer ownership to the GPU would need to evict CPU caches, and often 4k pages and tlb lookups can be a pretty inefficient situation for GPU-style accesses.
That's been pretty standard for any SoC for decades. And "differences" to apple's SoC are more implementation details.
This isn't the first time we have UMA on the PC, btw. When SGI did their PC workstations, their 320 and 540 PC workstations had what they called Cobalt graphics chipset and crossbar with their IVC architecture. They bypassed AGP at the time completely. It was quite unique to see strict UMA on a PC. Haven't seen it since until these new systems we're seeing now on PCs and Mac.
Some software assumes pre-defined set-aside pools of memory reserved for video purposes, but the chip does actually have access to the whole pool.
That's an API issue not a hardware issue. Regardless, I believe the major APIs permit seamlessly sharing pointers at this point? (I have no experience doing that though.)
IIRC that's due to maintain BIOS and Windows (+games & apps) backwards compatibility, but memory access speeds are the same.
A RTX Pro 6000 has ~24K 5th generation tensor cores, I'm guessing this would then be 1/4 of the count but 6th generation? Wasn't clear from the images.
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
Also "cheap while delivering enough" certainly sounds like someone is trying to temper expectations. It sounds like something sitting in-between GPU+VRAM inference and CPU+RAM one, not as a step above/besides GPU+VRAM.
If these chips become popular I am sure you will see LLM architectures taking advantage of the parallelism.
Perhaps in theory, but for the gb10 stuff the memory is all on the CPU die and connected to the GPU die via nvlink-c2c
The idea that any hardware performance increase will be eaten up by terrible software is an evergreen. A computer that could serve as the single server for a medium size enterprise 20 years ago, is no longer able to serve as a desktop for a receptionist. I'm not even sure we're talking diminishing returns anymore, we're probably past the point of maximum yield and into the negative returns at this point.
Before we get local AI, we'll be using hybrid AI.
Running big models locally is unrealistic ($$$$$) but, if you imagine an Agentic Workflow where some bits run on the cloud and other smaller tasks locally, it's an amazing deal. You don't need Opus/Code/DeepSeek/Kimi/etc to do basic stuff that models like Gemma4:12b/Qwen-27b can do locally with much less latency.
Having a laptop where I can use a remote big model and combine it with 5 local domain specific models, is something I would love to do today. Imagine using OpenCode and you've a small model deciding which tasks run locally, then decides if you've a good local model for XYZ task or if we use a cloud model.
My main concern is: Is this hardware powerfull enough to allow local quick models switch? Unlikely but I hope I'm wrong
nvidias master plan may be making it the new normal to have "only" 400GB/s bandwidth, thus gatekeeping local model usage further behind "more memory but not as fast as the cloud can do it"
Nvidia just wants to sell stuff to everyone.
And I think for professionals doing local AI work, products like Strix Halo and Apple Silicon are a competitive threat.
A big part of maintaining the leading software ecosystem is ensuring you have competitive hardware for all your users.
I also think the RTX Spark product is relatively low effort for Nvidia. Grab a Mediatek CPU and slap an Nvidia GPU on the die. Sure, that’s oversimplifying it, but still.
It's an interesting "newcomer" and the more the better but calling this a "beast" and a "game changer" is ridiculous to say the least.
Then there is the price..
It's not that the NVidia chip has that much RAM built in, after all. It's that it can address that much. RAM is sold separately.
So I would expect the mini PCs to come in less than the sparks. Laptops I assume will be close in price with the addition of all the other laptop stuff.
[1] https://www.nvidia.com/en-us/products/rtx-spark/
It is all in integrated into one monolith “superchip” package. The 128gb of RAM isn’t going to be purchased separately or be upgradable. At least according to all indicators. Which is what I was responding to.
I've found it very useful for running big models, but it's not a screaming powerhouse in terms of raw compute.
I expect computers with this chip will be about $4000. If Microsoft can deliver on local AI models that can orchestrate Windows and have solid real world intelligence, that will be an inexpensive business purchase compared to pay as you go tokens. I'm excited to see how this plays out.
Assuming all that stuff is upstreamed (and they aren't using oddball webcam/input devices etc) it should have much better support than Qualcomm.
Fingers crossed!
While this NVIDIA system is inferior from the point of view of the memory capacity, its main advantage is that the top models will have a bigger GPU, i.e. with 6144 or 5120 FP32 execution units, compared to 2560 for the AMD GPU (compared to the NVIDIA CPU, the AMD CPU has a better multi-threaded performance for legacy programs, and a much better multi-threaded performance for the applications that use AVX-512).
However, these top models with big GPUs will also be much more expensive than the competing AMD system, while also being much more expensive than a laptop or mini-PC with an equivalent discrete NVIDIA GPU (which has the disadvantage of having direct access only to a much smaller, even if faster, memory).
The memory interface is a little faster, but the greatest improvement is +50% in the memory capacity, both over the old Strix Halo and over NVIDIA Spark.
However, even the Strix Halo CPU was better than the NVIDIA/Mediatek CPU.
NVIDIA has only the advantage (in its more expensive variants) of a GPU equivalent with RTX 5070.
It remains to see which will be the prices of the NVIDIA Spark models with big GPUs, but the rumors are that they grow from around $3000 upwards, with the upper limit for 128 GB DRAM and uncut GPU being unknown yet.
It also remains to be seen whether the variants with the biggest GPU can use it effectively when having a rather low memory bandwidth for such a big GPU.
Nothing new here, apart from being able to use CUDA on a less power hungry system.
> Nothing new here, apart from being able to use CUDA on a less power hungry system.
CUDA has been running on ARM SOCs since the Tegra K1, 12 years ago. Nvidia is not new to ARM, nor is CUDA.
Tech companies have strangled their own market.
Up to $5000 because why not?
With that money you can build a real PC with rtx 5090!
> The memory is not as fast as dedicated GPU memory, but it is cheap enough while delivering enough bandwidth to run AI models locally.
So, the reason "dedicated GPU memory" is fast, isn't because it's "dedicated"; it's because the types of memory built into GPU cards — GDDR and HBM — are designed for throughput over latency.
Which is to say, GDDR and HBM memory could be shared with the CPU in UMA while still being "fast" (for GPU use-cases.) In fact, the PS4/5 and Xbox 360 / One X / Series consoles have UMA architectures that use GDDR memory as their main memory, with no regular DDR memory to be found.
What I don't understand: why don't we see UMA architectures where there's both regular DDR and GDDR/HBM memory mapped into the address space of the CPU+GPU? That seems like the best of both worlds: you'd have some memory that's "tuned" for random-access CPU usage (regular DDR), and some memory that's "tuned" for streaming GPU usage (GDDR/HBM), but either type of memory can still be put to the use it wasn't "tuned" for, just with slightly-worse performance.
I guess you'd need to do a bit of software work:
1. a bit of work in the OS kernel / malloc library to get CPU workloads to "prefer" allocating DDR memory over the GDDR/HBM memory until they've exhausted DDR memory (or maybe not, if you just tell the kernel the GDDR/HBM memory is something like a zswap thinpool);
2. and a bit of work in supported ML frameworks, to teach them about a hybrid strategy between UMA "allocate anywhere, it's all the same" and NUMA "keep assets in VRAM if possible; if you spill assets to RAM, then they must stream into VRAM on access" (i.e. "at allocation time, allocate as if the system were NUMA, VRAM first then spilling to RAM; but at execution time, use the UMA codepaths, no need to copy RAM into VRAM.")
...but once that's done, it's done.
Nvidia going from GPU to CPU now?
https://theretroweb.com/chipsets/182
https://www.nvidia.com/en-us/drivers/uli-m6117c/
We aren't so naive as to move from a locked IP ISA like x86 to another locked IP ISA such as ARM.
Right?
A powerful new chapter for Windows PCs, accelerated by Nvidia RTX Spark
https://news.ycombinator.com/item?id=48352693
Nvidia RTX Spark
https://news.ycombinator.com/item?id=48352939
Must be a new business model.
....
Step into my office
Why ?
Because you are fucking fired
The obvious comparison here is the M5 Max where you can buy a Macbook Pro with 128GB of also unified memory. Obviously CUDA cores are specific to NVidia so it's hard to directly compare but I've seen claims that the M5 Max is roughly equivalent to ~4000 CUDA cores. This obviously depends on workload and whether the CPU supports the precision you want to use (eg FP4).
The M5 Max has memory bandwidth of 819GB/s. The RTX Spark I believe is ~600. So it might be slightly better than the current generation of Macs but likely worse than the expected M5 Ultras of the new Mac Studios (likely Q3 2026).
For comparison, a 5090 has >20k CUDA cores and 1800GB/s memory bandwidth with 32GB of VRAM. The RTX 6000 Pro (at ~$10k) has 96GB of VRAM, same bandwidth and ~24k CUDA cores.
We have to see what RTX Spark systems sell for but the DGX Spark is in the Mac Studio price range (~$4k).
I do think Apple has a real opportunity here but there offerings aren't quite there yet. The M5 Ultras might be a really attractive option for local LLMs. I expect them to be in high demand.
[1]: https://news.ycombinator.com/item?id=48352939
Who claimed that? The M5 is still a raster focused GPU, dedicated matmul blocks be damned. For some workloads that napkin math might work out, but for many others it's a wild overshoot. Time-to-first-token still favors CUDA, and real-world training workloads aren't getting anywhere near Apple Silicon.
All of the memory bandwidth in the world is useless if you spend 15 minutes processing 64k tokens worth of context prefill. This is where CUDA shines.
Looking at it more, I believe the story repeats with the TSMC processes used for the CPU vs chips like GB200 as well.
Even if none of the above were the case, the question still isn't "why not make the enterprise GPU" it's "why not make the higher margin per chip area product". If the NV1/GB10 take less die space and cost a lot it's not immediately apparent the enterprise GPU actually nets Nvidia more $ per die or not. That's why it's relevant these will be sold at a premium.
And maybe for NVIDIA and MS it is also about them quietly betting that local models are, in fact, going to be good enough for most tasks pretty soon.
I'd say this relates directly to the cost of running AI models remotely.
And we won't know what the actual cost will be until AI vendors recover the huge pile of cash they've dumped into development (plus interest).
The hardware for 50 tokens per second with a four bit quantisation of Gemma 4 26B or the sparse Qwen 3.6 is not really that expensive: it’s a secondhand M1 Max.
Beyond that, I agree. I think moving planning tasks to local is a now thing, not that it really has much impact on token spend. I also think many small coding tasks are fully within the grasp of the above two models.
The main issue right now is that the software landscape is rather confusing, but I reckon uncomplicated Gemma 4 26B QAT support with MTP is a few weeks away.
But most businesses don't really care about most of the apple --- they only need their special bite out of it.
For example, doctors mainly care about medicine. Nvidia is attempting to provide the hardware needed for local, specialized models.
But I don’t know about specialised: this could run quite large models with MoE.
Running local models will stay niche for a while, unless we see breakthroughs
Most doctors don't care much about engineering or accounting or software development or 10000 other things that big vendor models address.
This area is yet to be really explored. Nvidia aims to provide the hardware to do so.
I'm not sure anyone really understands why.
The author is probably confusing RAG with pretraining. You can RAG on PubMed but you can't arrive at a competitive model by pretraining solely on it.
Nvidia is milking the market now. We need more competition again - currently we have a mafia control the prices, not just Nvidia but all the AI companies. The price increases should be paid for them, not by us. "Free market" is being manipulated by them here.
Windows 11 can run just fine on 8Gb of memory, what cant is Google Chrome.
Decent single core (a long ways from Apple level, but decent), but it makes up for it in cores to provide M5 level performance, CPU wise. Memory bandwidth it is kind of starved, at 1/6th many GPUs.
They got Microsoft to customize Windows for the RTX Spark, and will likely have to brutally throttle it when running as a laptop (it's literally a 140W TDP chip), and that's neat. It's going to be a very expensive laptop.
DGX Spark has a maximum of 273 GB/s bandwidth in ideal scenarios (hard to reach)
That puts it between an M5 (153) and M5 Pro (307)
Mind you thats not to/from memory, which indeed only has 273 GB/s.
Perhaps a sobering rule of thumb: if it was actually useful, you couldn't buy them because someone would scoop them all up to shove them in a DC and make money with it.
Bill Gates had a quote some years ago...
People have still not learned how fast we improve our tech and how much cheaper thing gets I guess :)
Clip me :). You are currently living through the final stages of unrestricted computing in the hands of the 'public'. Our regimes are going to pull up the drawbridge in the name of 'safety'. Download the open models asap and prepare for an airgapped computing environment. That will be your frontier in not extremely neutered AI in the near future.
I am so hoping I'm completely wrong on this btw.