Due to the growing complexity of Integrated Circuits (ICs), automating HDL (Hardware Description Language) code generation is becoming important. Although large language models (LLMs) have become very proficient in generating computer programs, they have not been very successful in producing efficient VHDL code. A major reason for this is the lack of a suitable VHDL training set to train the LLM models. In this paper, we present a VHDL dataset built from 356 GitHub repositories comprising 39000 VHDL files. We systematically preprocess each file to extract key structural components—libraries, packages, entities, architectures, components, process blocks, and inter-file dependencies based on UUT, entities and component instantiations inter-file dependencies—using VHDL-specific heuristics and regular expressions, delivering rich contextual insights essential for fine-tuning and evaluating LLMs in VHDL code generation. Furthermore, to enable structured evaluation with VHDLBench, we introduce a module masking procedure that selectively removes a key module—entity, architecture, component, or a specific code segment—creating paired samples: a masked code segment and its corresponding removed snippet. This approach allows users to assess two key capabilities in LLMs: Code Structure Learning (CSL), which tests the model's ability to generate coherent in-file structures, and Masked Module Completion (MMC), which evaluates how well the model infers missing modules and captures inter-file dependencies.
By providing a comprehensive VHDL dataset and introducing structured evaluation metrics, this work lays the groundwork for enhancing the automation of VHDL code generation. The rich contextual insights derived from our preprocessing approach offer a valuable resource for fine-tuning LLMs in hardware design applications. We believe this work will contribute significantly to advancing the efficiency and accuracy of VHDL code generation, paving the way for more streamlined and scalable development of Integrated Circuits.