[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

wwwjn · 2026-01-02T16:59:35Z

Stack from ghstack (oldest at bottom):

Add job_config.py to extend current JobConfig. Now an issue is trainer's config and generator's config are not symmetric, eg Parallelism and Generation.parallelism
Use job config system as the centralized / source-of-truth config, loading config from run_configs/qwen3_0.6b.toml file.
Refactor the generator to use EngineArgs() and LLMEngine(), instead of LLM()
Rename simple_rl_multiprocess -> simple_grpo to be more descriptive
Clean up unused code branch

Test: (trainer ddp = 2, n_generator =1)

Following-up refactors:

Refactor2: vllm model register - using setup.py and plugin instead of import
Refactor3: Weight updater, by directly passing state_dict (DTensor) between trainer and generator
Refactor4: Use torchtitan Trainer, modularize each component

[ghstack-poisoned]

allenwang28

I like this direction, thanks! Mostly nits here

allenwang28 · 2026-01-02T22:05:21Z

torchtitan/experiments/rl/unified/README.md

 ```
-Right now we only support VLLM_COMPAT mode, which could achieve trainer and generator bitwise identical. We are working on support UNIFIED mode,
-which uses a unified model definition for trainer and generator.
+We uses a unified model definition for trainer and generator, which could achieve trainer and generator bitwise identical.


Suggested change

We uses a unified model definition for trainer and generator, which could achieve trainer and generator bitwise identical.

We use a unified model definition for the trainer and generator, ensuring bitwise-identical models to address a class of subtle correctness bugs in RL for LLMs.

nit

allenwang28 · 2026-01-02T22:18:59Z

torchtitan/experiments/rl/unified/actors/generator.py

-            math_reward_function if use_real_dataset else trivial_reward_function
-        )
+        # Reward function. TODO: Use a real reward function
+        self.reward_fn = trivial_reward_function


I guess the RL job definition would need to define a callable here, is this the idea for later?

allenwang28 · 2026-01-02T22:24:11Z

torchtitan/experiments/rl/unified/actors/generator.py

        sampling_params = SamplingParams(
            temperature=temperature,
            max_tokens=max_new_tokens,
-            n=n_samples_per_prompt,


n_samples_per_prompt fulfills the same purpose as group_size I assume. I see below that we're preferring to submit a prompt multiple times instead of relying on vLLM. Is this due to batch invariance or something else? I'd assume that letting vLLM do it is better performance wise

allenwang28 · 2026-01-02T22:28:19Z

torchtitan/experiments/rl/unified/infer.py

-        type=int,
-        default=1,
-        help="Number of GPUs for tensor parallelism (default: 1 for single GPU)",
+def infer():


Suggested change

def infer():

def generate():

nit, but infer() makes me think of like getting the logits

allenwang28 · 2026-01-02T22:30:19Z

torchtitan/experiments/rl/unified/simple_grpo.py

    # Create process meshes
-    trainer_mesh = this_host().spawn_procs(per_host={"gpus": 2})
+    trainer_mesh = this_host().spawn_procs(
+        per_host={"gpus": trainer_ddp_size * trainer_tp_size}


in the future we should deduce the total number of GPUs needed for a given trainer parallelism

config sys v1

5bc3c98

[ghstack-poisoned]

pytorch-bot bot added the ciflow/8gpu label Jan 2, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jan 2, 2026

This was referenced Jan 2, 2026

config sys for simple_grpo v1 #2192

Closed

config sys for simple_grpo v2 #2193

Closed

[rl] refactor model registery #2194

Open

wwwjn closed this Jan 2, 2026

Update on "config sys v1"

9727ee8

[ghstack-poisoned]

wwwjn reopened this Jan 2, 2026

wwwjn mentioned this pull request Jan 2, 2026

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2188

Closed

wwwjn changed the title ~~config sys v1~~ [rl] Using JobConfig as the centralized config system for inference and simple GRPO Jan 2, 2026

allenwang28 reviewed Jan 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

Uh oh!

wwwjn commented Jan 2, 2026 •

edited

Loading

Uh oh!

allenwang28 left a comment

Uh oh!

allenwang28 Jan 2, 2026

Uh oh!

allenwang28 Jan 2, 2026

Uh oh!

allenwang28 Jan 2, 2026

Uh oh!

allenwang28 Jan 2, 2026

Uh oh!

allenwang28 Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	We uses a unified model definition for trainer and generator, which could achieve trainer and generator bitwise identical.
	We use a unified model definition for the trainer and generator, ensuring bitwise-identical models to address a class of subtle correctness bugs in RL for LLMs.

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

Are you sure you want to change the base?

[rl] Using JobConfig as the centralized config system for inference and simple GRPO #2191

Uh oh!

Conversation

wwwjn commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allenwang28 left a comment

Choose a reason for hiding this comment

Uh oh!

allenwang28 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

allenwang28 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

allenwang28 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

allenwang28 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

allenwang28 Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wwwjn commented Jan 2, 2026 •

edited

Loading