Rendered at 23:44:56 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
pyentropy 8 hours ago [-]
Take a look at the harmony repo which specifies the internal OpenAI format - the effort level is specified in the context after the <|start|> tag - https://github.com/openai/harmony
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
I think you have the right answer but I'm struggling to understand: does changing the effort change the prompt at the start of the conversation? I wonder why come up with this way at all? Why not just add a parameter at the end or something? At least it won't break cache.
Maybe like: add a secret suffix to your chat in the conversation to think more like
conversation....
Hey please help
[think more]
pyentropy 2 hours ago [-]
I'm considering the possibility that it's good to break the prefix and cache because the LLM itself was rewarded (during post-training) with different prefixes/system prompts, each containing reasoning traces of the correct size.
I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.
aabdi 9 hours ago [-]
Different models do slight variants.
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
sometimelurker 7 hours ago [-]
they use multitoken prediction behind the scenes, that might interact with the CoT in a strange way. maybe for different thinking modes they have different MTP models? if so thats interesting
pyentropy 7 hours ago [-]
The number of tokens you predict at time (multi or not) has nothing to do with whether the model wants to emit any, some or a lot of reasoning tokens in reasoning tag -- similar to how branch prediction will not really change the for loop iteration count.
__patchbit__ 10 hours ago [-]
At a guess. May be associated with token length context window. Down selecting is consistent with warning message, forcing cutoff to context window. The technical term cache being a synonym. Increasing the headroom for more "thinking" should allow the implementation to access more resources without warning about the cache breaking.
bjourne 5 hours ago [-]
LLMs work by generating the most likely continuation to a prompt. But they can also generate multiple likely continuations. This create multiple branches which in turn can generate even more branches. The LLM can then evaluate the branches, prune the unpromising ones, and merge the best ones. More branches means more tokens, means more effort.
simianwords 5 hours ago [-]
this has nothing to do with the thinking effort however
bjourne 4 hours ago [-]
Yes, it does. Breadth of search is exactly what the effort setting controls.
Note that inference libs also have parsers that put hard limits on reasoning tokens with separate counters (similar to how you can put a limit on token generation per completion versus waiting for an <eos>). For that, take a look at vllm reasoning docs.
https://docs.vllm.ai/en/latest/features/reasoning_outputs/#a...
https://developers.openai.com/api/docs/guides/reasoning
Maybe like: add a secret suffix to your chat in the conversation to think more like
I might be very very wrong though and LLMs disagree with me, insisting that cache is preserved and the system message doesn't have to change (even though it often contains effort level in context) if effort level changes across turns, and that all you have to do is tell the inference lib that parses think tags to early-close think tags that are too long.
Usually it’s done in post training to enforce behavior based on prompt. Ie. System prompt with thinking:max or low or wtv.
Enforcement then goes via constrained decoding, checking for think token start and end with max lengths, or other variations
See https://developers.openai.com/cookbook/articles/openai-harmo... and src/openai/types/shared/reasoning_effort.py