put in a phase shift interpolation system and do a phase alignment and calibration routine.
Think string of inverters with a mux to pick the delay.
Or, synchronize using phase shifted clock and then re-sync in phase.
Try to adjust metal lines to compensate for variance in transistors will come up to bite you.
Serpetine path balancing is a possible improvement as well, and yes you can fiit it if you want (willing to wager?)