Real-time semantic understanding for drone swarms demands robust multi-object perception and high-level reasoning under tight size, weight, power, and bandwidth constraints. Existing approaches either offload computation to cloud or ground stations, achieving rich semantics at the cost of latency and fragile links, or rely on lightweight on-board trackers that provide only low-level bounding boxes and trajectories. This paper bridges that gap with an edge-native pixels-to-semantics pipeline that combines a streaming FPGA front-end with an on-device large language model (LLM). A near-sensor FPGA accelerator on a Zynq UltraScale+ MPSoC performs real-time multi-object tracking and emits a compact, structured state encoding stable IDs, kinematic summaries, and interaction cues; a fine-tuned, quantized LLM running on the on-board ARM cores consumes this state to infer swarm-level behaviors, inter-object relations, and potential risks, without processing raw pixels or relying on network connectivity. On a KV260 platform, the complete pipeline achieves an end-to-end visual latency of 66.7~ms per frame at $320\times240$, operates at approximately 2.9~W, and uses less than 2\% of the device's LUT/FF resources and a single BRAM block. We compare our system against hybrid and full-offloading architectures in terms of latency, bandwidth, and power, and against representative FPGA-based visual front-ends in terms of hardware footprint. Finally, we evaluate the on-board LLM against an online ChatGPT 5.1 Auto baseline on a real multi-UAV rounding sequence, showing that the self-deployed model achieves zero error in corrected-count and mission-type prediction while the general-purpose model exhibits substantial errors, demonstrating a practical low-SWaP solution for semantic scene understanding in drone swarms.