Computing-in-memory (CIM) architectures have successfully enhanced convolutional neural network (CNN) performance, but the automation of high-performance CIM-based transformer accelerators is still challenging. Specifically, the design space of hardware design and mapping is extremely large due to the complex model structure and data flow. To address this problem, we propose Harmony, a hardware and mapping co-exploration framework to optimize the hybrid CIM-based vision transformer accelerator. We define a universal design space representation for implementing vision transformers in CIM-based accelerators that support hybrid and heterogeneous features. The corresponding design space comprises the hardware configuration of CIM macros and their spatial mapping scheme. Furthermore, we propose the knowledge-guided grid search (KGGS) algorithm and improved genetic algorithm (IGA) to boost exploration efficiency. The orthogonal experiment and dominance analysis of KGGS could obtain the exploration probabilities of different parameters and ensure its stability, while the unique order crossover and swapping mutation of IGA could retain relative order to avoid legalization processes during the iteration. Performance experimental results show that Harmony achieves 48% area reduction, 13% latency reduction, 32% energy reduction, and 1.27x energy efficiency on average compared with the baseline. The accuracy experiment demonstrates that our hybrid architecture achieves a trade-off between accuracy and performance compared with all-SRAM CIM-based accelerators.