Preliminary Program

CELLM: Confidential and Efficient Lightweight LLM Inference in TEEs

Hui Feng, Ben Dong, Qian Wang
University of California, Merced

Abstract

Large Language Models (LLMs) are widely used in cloud environments but face serious risks to model confidentiality and data privacy. Trusted Execution Environments (TEEs) provide confidential computing through strong isolation and data encryption, yet most TEEs cannot meet the heavy computation and memory demands of LLM inference. In this work, we evaluate lightweight LLMs in TEE-enabled environments using Intel Trust Domain Extensions (TDX). We analyze the memory usage of LLM models to guide private-memory allocation and compare tokens per second (TPS) across CPU-only, CPU–GPU hybrid, and TEE-based settings. We also apply quantization techniques to further accelerate inference in TDX environment. Our results show that for lightweight LLMs (up to 7B parameters), TPS on TDX is up to 4x higher than on CPU. In addition, INT4 quantization provides up to 3x higher throughput and reduces storage by approximately 70% compared with FP16.