- Move qwen3.5-397b-int4-autoround.yaml to recipes/4x-spark-cluster/
per maintainer request (multi-node recipes in separate directory)
- Add PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to recipe env
- Optimize Ray for GB10 UMA (128GB shared CPU/GPU memory):
- Disable Ray dashboard (saves ~1.2 GiB per node)
- Limit Ray object store to 1 GiB (default 30% of RAM = 33 GiB)
- Disable pre-started idle workers (saves ~8 GiB on head node)
- Set --num-cpus 2 and --disable-usage-stats on all nodes
- Net effect: ~40+ GiB freed across 4-node cluster for model/KV cache