The rising complexity of large-scale heterogeneous architectures, such as those composed of off-the-shelf processors coupled with fixed-function logic, has imposed challenges for traditional simulation methodologies. While prior work has explored trace-based simulation techniques that offer good trade-offs between simulation accuracy and speed, most such proposals are limited to simulating chip multiprocessors (CMPs) with up to hundreds of threads. There exists a gap for a framework that can flexibly and accurately model different heterogeneous systems, as well as scales to a larger number of cores. We implement a solution called HetSim, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores. HetSim operates in four stages - compilation, emulation, trace generation and trace replay. Given (i) a specification file, (ii) a multi-threaded implementation of the target application, and (iii) an architectural and power model of the target hardware, HetSim generates performance and power estimates with no further user intervention. HetSim distinguishes itself from existing approaches through emulation of target hardware functionality as software primitives. HetSim is packaged with primitives that are commonplace across many accelerator designs, and the framework can easily be extended to support custom primitives. We demonstrate the utility of HetSim through design-space exploration on two recent target architectures - (i) a reconfigurable many-core accelerator, and (ii) a heterogeneous, domain-specific accelerator. Overall, HetSim demonstrates simulation time speedups of 3.2$\times$-10.4$\times$ (average 5.0$\times$) over gem5 in syscall emulation mode, with average deviations in simulated time and power consumption of 15.1% and 10.9%, respectively. HetSim is validated against silicon for the second target and estimates performance within a deviation of 25.5%, on average.