Large language models systematically favor agreement over correctness due to flaws in RLHF training. This technical deep-dive explores why?