InfoQ Homepage Articles Best Practices to Build Energy-Efficient AI/ML Systems

AI, ML & Data Engineering

Best Practices to Build Energy-Efficient AI/ML Systems

May 09, 2025 19 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Listen to this article - 0:00

Key Takeaways

For organizations using AI/ML technologies, it is crucial to systematically track the carbon footprint of ML lifecycle and implement best practices in model development and deployment stages.
Tacking the energy demands has challenges like lack of standardized methods to calculate energy consumptions and the complexity In accurately measuring AI's carbon footprint.
Emissions can be classified into two types: operational emissions which refer to the energy cost of operating training models and inference and the cost of ML hardware support; and lifecycle emissions which include the embedded carbon emitted during the manufacturing of all components involved, ranging from chips to data center buildings.
Best practices to build a sustainable ML lifecycle include prioritizing efficient model selection, model optimizations to reduce complexity, choosing efficient hardware (CPU, GPU, and NPU), and cloud hosting versus on-premise infrastructure.
There are open-source tools like CodeCarbon and MLCarbon to track and reduce energy consumption. Cloud platforms such as Google Cloud Platform (GCP) and Amazon Web Services (AWS) enable sustainability in AI workloads by offering tools to minimize carbon footprints.

Introduction

In the last few years, AI adoption rate has dramatically increased leading to complex and highly compute intensive AI/ML systems. Organizations constantly update their ML infrastructures to support model training and deployment, which consumes vast energy and resources, prompting data centers to upgrade their facilities to meet these escalating demands. Some big tech companies such as Google, Microsoft, and Amazon are also exploring nuclear energy as a potential solution to power their AI infrastructure, but this is for the future.

In current ML systems, the difficulty is accurately measuring and keeping track of energy requirements throughout the model lifecycle. So, it is crucial to systematically track the carbon footprint of the ML lifecycle continuously and to implement best practices in model development and deployment stages. By striking a balance between performance and energy efficiency throughout the ML process, researchers, and practitioners can contribute to more sustainable innovations in AI.

However, tracking the energy demands is not easy, because it comes with several challenges:

Lack of carbon footprint awareness in AI systems.
Lack of standardized methods to calculate the energy consumptions.
Accurately measuring AI's carbon footprint is complex, making it hard to compare different models and track progress.
With performance vs. energy efficiency, the latter is always a trade-off.
Shifting to green energy efficient sources is hampered by geographic and technical limitations.
Performance and cost efficiency are the primary key performance indicators (KPIs) for software development. Energy is not.

The LLMs have become more complex and capable over the last few years and they have also seen exponential growth in size. The advent of transformer-based architecture and Mixture of Experts (MoE) architectures have further accelerated the growth in size of models to a great extent. By 2022 Large language models like GPT-3 have seen significant advancements supporting 175 billion parameters. Since 2022, accelerated growth continued with language models like GPT-4, which have even larger architectures. This trend continues with newer models.

Figure 1: shows the models and dataset sizes used for training. Image Source: Epoch AI, ‘Data on Notable AI Models’. Published online at epoch.ai. Retrieved from epoch.ai [online source]. Accessed 1 May 2025.

The above figure illustrates how AI models have grown in size and complexity just in a span of three to four years. Let’s look at the steady model size growth for a few, in Table 1.

Model Type	Parameters
GPT-1	114 Million
GPT-2	1.5 Billion
GPT-3	175 Billion
GPT-4	1 Trillion
Llama 2	70 Billion (highest)
Llama 3.1	405 Billion (highest)

Table 1: Model and its Parameters

Many natural language processing tasks have greatly improved in performance due to the large sized models but these models obviously needed more computational resources, memory and energy to train and run, raising questions and concerns over their impact on the environment.

Looking deeper into the model itself and understanding the terms Floating Point Operations(FLOPS) and Parameters helps us correlate the relationship between model size and compute required. While FLOPS measures a model's computing complexity, parameters represent the model's size and capacity. As models grow to mammoth sizes, FLOPs and parameter values typically increase, leading to higher accuracy and energy consumption during training and inference. This reminds us of the importance of balancing model performance and efficiency.

Researchers are exploring ways to create more efficient models that can achieve similar or better performance with fewer parameters by incorporating techniques such as model compression, knowledge distillation, and the development of more sophisticated architectures that can more effectively leverage smaller parameter counts.

Carbon footprint of ML lifecycle

The stages of ML lifecycle are data processing, training, and inference. Data processing involves tasks like data collection and feature engineering, which are computationally lighter than training and inference. Training involves intensive computations to build accurate models, which are then deployed and used repeatedly for inference across many applications. Some models are run 24x7, which leads to a higher overall energy demand. Everyone thinks training needs massive data and resources so that the training phase would be more energy intensive than inference, but that is not true.

Compared to training, inference accounts for a more significant portion of energy consumption because its cumulative energy use across billions of users to answer questions adds up quickly. Big tech companies reported that the majority of ML energy use comes from the inference stage rather than training large language models.

Emissions can be classified into two types related to ML training.

Operational emissions refer to the energy cost of operating training and inference, as well as the cost of ML hardware support, including datacenter overhead like cooling. Lifecycle emissions include the embedded carbon emitted during the manufacturing of all components involved, ranging from chips to data center buildings, as well as operational emissions. These emissions create a big problem that is very hard to solve.

Software engineers can gain more knowledge about operational emissions than lifecycle emissions. Just raising awareness and monitoring of operational emissions helps tremendously with sustainability. There are open-source tools and cloud platform features to track and reduce energy consumption. AI/ML Engineers can focus on optimizing model size, selecting efficient hardware, and choosing energy-efficient data centers in locations with access to renewable energy sources, making informed decisions from model building to model deployment and maintenance that can help significantly reduce energy consumption.

How to measure the carbon footprint of your code

Several tools and frameworks exist with unique features and use cases for measuring the carbon footprint of ML models. In this article, we’ll look into the following two frameworks:

CodeCarbon is an open-source lightweight Python library that estimates emissions during ML model training. It tracks energy used by local machines and cloud environments. It supports integration with ML frameworks like PyTorch, Keras, and Scikit-learn. CodeCarbon allows you to track multiple training runs, compare different model sizes, and monitor GPU vs. CPU emissions across different regions.

The tool can be installed using the following command. The setup instructions are also mentioned on the CodeCarbon GitHub repo:

pip install codecarbon

Let’s look at an example of how varying dataset sizes and model complexities impact the emissions.

I trained three Random Forest classifiers of varying complexities using scikit-learn and measured their carbon emissions using CodeCarbon. The experiments were conducted on an Apple M1 Pro processor with 10 CPU cores. To simulate different computational loads, I created three scenarios with increasing complexity. Each model was trained using scikit-learn's RandomForestClassifier with varying n_estimators (number of trees) and dataset sizes. The emissions were tracked using CodeCarbon's EmissionsTracker, which measured the energy consumption during model training and converted it to CO2 equivalent emissions. The results are shown in Table 2 and Figure 2, and the code is available as a notebook script on GitHub.

Load and instantiate an EmissionsTracker object and pass it as a parameter to function calls to start and stop emissions tracking of machine learning code.

def compare_model_complexities():
   scenarios = [
       {
           'name': 'Basic RF (100 trees)',  # More specific naming
           'n_samples': 10000,
           'n_estimators': 100,
           'max_depth': 5
       },
       {
           'name': 'Intermediate RF (500 trees)',
           'n_samples': 50000,
           'n_estimators': 500,
           'max_depth': 10
       },
       {
           'name': 'Complex RF (1000 trees)',  # Clearer description
           'n_samples': 100000,
           'n_estimators': 1000,
           'max_depth': None
       }
   ]
  
   results = []
   for scenario in scenarios:
       # Generate dataset for this scenario
       X, y = make_classification(
           n_samples=scenario['n_samples'],
           n_features=50,
           n_informative=40,
           random_state=42
       )
       X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
      
       # Train and track emissions
       tracker = EmissionsTracker(
           project_name=f"rf_{scenario['name']}",
           output_dir="emissions",
           log_level="warning",
           measure_power_secs=15
       )
      
       tracker.start()  #emissions start tracking
       start_time = time.time()
      
       model = RandomForestClassifier(
           n_estimators=scenario['n_estimators'],
           max_depth=scenario['max_depth'],
           random_state=42
       )
      
       model.fit(X_train, y_train)
       accuracy = model.score(X_test, y_test)
      
       duration = time.time() - start_time
       emissions = tracker.stop()  #emissions stop tracking
      
       results.append({
           'scenario': scenario['name'],
           'samples': scenario['n_samples'],
           'n_trees': scenario['n_estimators'],
           'duration': duration,
           'emissions': emissions,
           'accuracy': accuracy,
           'emissions_per_sample': emissions/scenario['n_samples']
       })
  
   return pd.DataFrame(results)

# Run comparison
results_df = compare_model_complexities()

To analyze the results, simply print the CSV file using pandas dataframes.

results_df = pd.read_csv('emissions/emissions.csv')
print("Emissions details:", results_df)

CodeCarbon has an in-built logger that logs data into a CSV file named emissions.csv in the project root directory for each run. However, the outputs can also be sent to a monitoring platform such as Prometheus and an observability platform such as LogFire for better tracking.

scenario	samples	n_trees	duration	emissions	accuracy
Basic RF (100 trees)	10000	100	1.527904	0.000002	0.851
Intermediate RF (500 trees)	50000	500	74.29512	0.000103	0.9295
Complex RF (1000 trees)	100000	1000	566.5934	0.000637	0.96075

Table 2: Model Complexity vs Performance vs Emissions for RF configurations

Figure 2: Model Complexity vs Performance vs Emissions for RF configurations.

A simple Random Forest with 100 trees trained on 10k samples emitted a negligible 0.000002 kg CO2eq with a short run time of 1.53 seconds, delivering an accuracy of 85.1%.
An intermediate RF with 500 trees trained on 50k samples emitted 0.000103 kg CO2eq with a longer runtime of 74.29 seconds, delivering an accuracy of 92.95%. Here, we see a noticeable increase in emissions correlating with higher computational demands.
A complex Random Forest with 1000 trees trained on 100k samples emitted 0.000637 kg CO2eq with a veritable runtime of 566.99 seconds, delivering an accuracy of 96%. This reflects the increased computational cost due to the larger sample size and more trees contributing to increased emissions.

The results demonstrate that while model accuracy improves with complexity, the environmental cost increases dramatically. The Intermediate model represents the best balance of accuracy vs emissions. It showed substantial improvement in accuracy with a moderate increase in emission. The findings suggest that practitioners should carefully consider whether marginal improvements in model performance justify the associated environmental costs.

Common Issues & Solutions

The overall setup is a simple process. However, having a structured approach delivers actionable insights for environmental impact optimization in ML workflows. While setting up the codecarbon and running, some of the common issues one could see are while running the calculations:

"Another instance of codecarbon running". This lock file error can be resolved by implementing automatic cleanup. Use cleanup_tracker() before starting to ensure a clean tracking environment and clean up between runs.
"No emissions data recorded". This error is due to an insufficient computation load. Ensure ML workloads are measurable with a minimum recommended training time greater than 5 minutes.
Sometimes, the wait time to see the emission data is too long due to the tracker's default behavior when writing emissions data when it is stopped. However, for longer runs, you can save intermediate data by calling the flush() method.

CodeCarbon effectively tracks operational emissions generated during code execution, including energy consumed by CPUs, GPUs, and RAM, and works well for smaller ML models. However, it's not currently well-suited for large-scale LLM emissions tracking. It cannot be extended to dense or MoE or LLMs because it disregards critical model architectural parameters.

MLCarbon, an open-source carbon footprint modeling tool, considers architectural parameters and is the most comprehensive framework for LLMs. It supports the end-to-end phases of an LLM's lifecycle: training, inference, experimentation, and storage. This includes operational carbon emissions (from energy use during operation) and embodied carbon emissions (from hardware manufacturing and infrastructure).

Let's review how MLCarbon can be used with an example and then discuss its key results.

Example Scenario:

Imagine a machine learning or a data science engineer in the early stages of designing a new large language model for text summarization. They are considering using a Transformer-based architecture and are debating between two potential model sizes: one with approximately 30 billion parameters and another, more computationally intensive, with around 100 billion parameters. They plan to train these models on a large text corpus and have access to a cloud computing platform offering NVIDIA A100 GPUs in a data center with a reported Power Usage Effectiveness (PUE) of 1.15 and an estimated carbon intensity of 0.35 tCO2eq/MWh. They anticipate a training duration of 30 days for the smaller model and 60 days for the larger one, using a certain number of GPUs.

The engineer would need to gather the following information to use MLCarbon to estimate the operational carbon footprint before commencing training. Simply clone the GitHub repo, update the database.csv file with the relevant information, and execute the llmcarbon_tutorial.py script to generate the emissions data.

Parameters for database.csv file include:

LLM Architecture: Transformer-based (though MLCarbon might allow for direct input of parameter count).
Parameter Count: 30 billion (for the first scenario) and 100 billion (for the second scenario).
Training Dataset Size (in tokens): An estimate based on their corpus.
Hardware Configuration: NVIDIA A100 GPUs, and the planned number of GPUs for each model size.
Training Duration: 30 days for the 30B parameter model, 60 days for the 100B parameter model.
Data Center Specification: PUE of 1.15 and carbon intensity of 0.35 tCO2eq/MWh.

For GPT3 dense model, the values look like this:

GPT3,dense,175,,300,0.429,1.1,V100,300,330,125,24.6,0.197,10000,314,14.8,552.1,1287

The table 3 below provides a clear and concise breakdown of the values and their corresponding explanations.

Key	Explanation
name	Model Name
type	Model type, "dense" (vs "MoE" for mixture of experts)
parameter # (B)	Model parameters in billions, sourced from model papers
token # (B)	Training tokens in billions, source details in papers
CO2eq/KWh	Grid data for training location
PUE	Power Usage Effectiveness (PUE), sourced from data center specifications or cloud provider reports
computing device	Training infrastructure details, such as GPU/TPU type
device TPD (W)	Thermal Design Power in watts, based on hardware documentation
avg. system power (W)	Average system power consumption in watts, based on GPU specifications
peak TFLOPs/s	Peak theoretical FLOPS achieved, sourced from training logs
achieved TFLOPs/s	Actual FLOPS achieved during training, from training logs
hardware efficiency	Calculated as achieved_TFLOPs / peak_TFLOPs
device #	Number of GPUs/TPUs used, sourced from training logs
total zettaFLOPs	Total computational effort in zettaFLOPs, based on training infrastructure details
training days	Total training duration in days, sourced from technical papers
actual tCO2eq	Actual carbon emissions in tons

Table 3: Model configuration keys in database file

MLCarbon would then use its internal models (FLOP model, hardware efficiency model, and operational carbon model) to estimate the energy consumption (in MWh) and the resulting operational carbon emissions (in tCO2eq) for both potential model configurations. Procuring and saving the above information needs to be automated so engineers can have easy access to this data to gain insights on the carbon footprint before training or inference phases, because if software engineers have to put too much manual effort into gathering this data, it will not positively impact the process of measuring carbon footprint.

The results would show the delta value (predicted_value / actual_value) - 1. Negative values indicate under-prediction of emissions, Positive values indicate over-prediction of emissions, and values closer to 0 indicate better accuracy. For GPT 3, the results are 0.00225, showing the best performance with the smallest error.

While tools like CodeCarbon might be more valuable for post hoc analysis of environmental impact in ML, tools like MLCarbon address the gap in the preemptive evaluation of environmental impact during the ML model selection phase by allowing engineers to estimate the carbon footprint of different model architectures and hardware configurations before model development and resource-intensive training. However, even with such predictive capabilities, further reductions in emissions and greater sustainability can be achieved by actively strategizing model selection and optimization alongside leveraging energy-efficient cloud resources.

Best practices to build a sustainable ML lifecycle

To further reduce emissions and enhance sustainability in machine learning, a holistic approach that combines thoughtful model selection, continuous model optimization, and the strategic use of energy-efficient cloud computing is crucial. AI/ML practitioners should be aware of the strategies below to reduce operational emissions and build sustainable ML systems.

1. Prioritizing Efficient Model Selection

The choice of the initial model architecture significantly impacts energy consumption and carbon emissions. As highlighted in the paper, "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink", selecting efficient ML model architectures such as sparse models rather than dense ones, can lead to computation reductions by approximately 5 to 10 times.

Before choosing a closed-source or open-source LLM, engineers should consider this question: Can the given problem be solved with a less compute or less resource intensive model? For example, consider implementing Distil BERT over BERT and GPT-3.5/4 Turbo over GPT-3/4. Distil BERT is a compressed version of BERT and retains 97% of its performance with 40% fewer parameters and 60% faster inference time, making it more energy efficient. Turbo models use fewer resources per inference. They are cost-effective and suitable for high-volume tasks where efficiency is critical.

2. Model optimizations to reduce complexity

Reducing the computation needed for a given model while maintaining the performance is a significant challenge. Showing optimization techniques and benefits in table 4.

So, consider building task-specific models, Task-specific models are designed to focus on one domain specific applications compared to general-purpose models. Having this domain knowledge makes them really efficient, great for reducing the latency and overall compute costs.

Using low-compute power models, e.g., Small Language Models (SLMs), is another optimization strategy that minimizes resource consumption while maintaining performance for specific tasks. This approach is particularly useful in environments with limited computational capacity, such as edge devices. SLMs are like subsets of language models built on transformer architectures with far fewer parameters. Because they require fewer parameters, these models often require less training data and fewer computational resources, reducing training times and costs. They are very suitable for resource constrained environments like edge devices, but can handle multiple tasks within the domain.

Other techniques, such as fine-tuning architectures and pruning, can lead to more efficient models without compromising performance. Fine-tuning involves adjusting pre-trained models to specific tasks and enhancing performance while reducing the need for extensive training from scratch. It can restore accuracy and robustness even after model compression and pruning.

Pruning removes unimportant parameters and connections from a model, reducing its size and computational requirements and improving inference speed. When done carefully, pruning can maintain or even improve model efficiency without significant loss in accuracy. For example, a pruned model might retain 90% of its original accuracy while using only 10% of its original parameters. However, the trade-offs are finding the right balance between model size reduction and performance preservation. It may require fine-tuning after pruning to regain some of its lost accuracy.

Quantization is another technique that reduces computational load and energy consumption. It reduces the precision of the model weights without significantly affecting model performance, thus reducing computational load.

Optimization Techniques and Energy Benefits

Optimization Technique	Type	Energy Benefits	Use Cases
Task-Specific Models	Model Design	Reduced training time and computational load	Text summarization, recommendation systems
Low-Compute Power Models	Model Design	Efficient on edge devices, lower energy usage	TinyML applications, IoT devices
Fine-Tuning Architectures	Model Optimization	Improved performance with minimal retraining	Transfer learning, domain-specific tasks
Pruning	Model Optimization	Reduced model size and computational overhead	Deep learning models for deployment
Quantization	Model Optimization	Lower precision reduces computational load	Edge AI, large-scale inference
HBM (High Bandwidth Memory)	Hardware Optimization	Faster Tensor I/O, reduced energy consumption	Training/inference acceleration

Table 4: Optimization Techniques and Energy benefits

3. Choosing efficient hardware

Choosing energy-efficient hardware, such as TPUs or GPUs optimized for deep learning workloads rather than CPUs, can help reduce energy consumption without compromising the ML system's performance. The table below compares hardware types (CPU, GPU, and NPU) used in AI workloads, focusing on their functions, processing capabilities, performance levels, and energy usage. The comparison shows the importance of choosing hardware that balances performance with energy usage, aiding sustainability efforts in AI development. Please note that, for energy usage, while GPUs have high energy consumption overall, they are more energy-efficient than CPUs for certain AI workloads, such as AI training. Refer to Table 5 for quick comparison.

Hardware type	CPU	GPU	NPU
Function	General Purpose Computing	High Performance Computing	AI Inference Acceleration
Processing	Serial	Parallel	Massive Parallelism for Neural Network Operations
Performance	Moderate	High	Very High
Energy Usage	High	Very High	Low

Table 5: Hardware Type and Efficiency

4. On-Premise Vs. Cloud

Leveraging energy-efficient cloud computing generally improves datacenter energy efficiency compared to on-premise solutions due to custom warehouses designed for better PUE and carbon-free energy (CFE). It requires users to choose datacenter locations with cleaner energy mixes. Pick the highest CFE or lower-carbon region to build new applications for computation and optimizing model efficiency.

Cloud platforms such as Google Cloud Platform (GCP) and Amazon Web Services (AWS) enable sustainability in AI workloads by offering tools to minimize carbon footprints. GCP allows users to select low-carbon regions based on metrics like CFE percentage and grid carbon intensity, with regions like Montréal and Finland achieving near 100% CFE. AWS reduces the carbon footprint of AI workloads by optimizing infrastructure, transitioning to renewable energy, and leveraging purpose-built silicon chips, achieving up to 99% carbon reductions compared to on-premises setups. Both platforms help organizations align AI operations with sustainability goals. However, many engineers and organizations overlook the CFE or PUE metrics these cloud platforms provide while choosing regions. Looking at sustainability metrics like CFE is not always front-of-mind for users when selecting regions; they often prioritize performance, cost, and meeting business metrics.

The Google article, "The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink", demonstrates that by strategically applying the 4Ms(Model, Machine, Mechanization, and Map) best practices, the energy usage can be reduced by up to 100x and CO2 emissions by up to 1000x in machine learning training.

The 4Ms are the best practices proposed by Google which significantly reduce the environmental impact of machine learning training and development process. Firstly, Model focuses on selecting efficient ML model architectures, like sparse models, which can decrease computation by 5 to 10 times. Secondly, Machine involves utilizing processors optimized for ML, such as TPUs or recent GPUs, which can improve performance per Watt by factors of 2 to 5 times compared to general-purpose processors. Thirdly, Mechanization refers to using cloud computing, which offers better data center energy efficiency and can reduce energy costs by 1.4 to 2 times compared to on-premise data centers. Lastly, Map emphasizes choosing cloud computing locations with cleaner energy sources, potentially reducing the gross carbon footprint by 5 to 10 times. Applying all four of these practices together can substantially reduce energy consumption and CO2 emissions.

Conclusion

The increasing complexity and the growth of AI models, particularly LLMs show how the AI systems are becoming energy-intensive, it is critical to measure the energy footprint across the entire ML lifecycle, from data processing through training and, critically, inference. Measuring footprint enables us to take informed decisions during model development and deployment phases leading to building energy-efficient AI/ML systems. Tools like CodeCarbon and MLCarbon provide valuable means to measure and track these emissions. However, it's important to note that measurement alone is insufficient.

The path towards sustainable AI lies in a multifaceted strategy. In addition to tracking emissions, thoughtful model selection, reducing computational load, choosing energy-efficient hardware, and strategically leveraging cloud infrastructure in low-carbon regions can further reduce the energy footprint. By embracing these best practices, we can build energy-efficient AI/ML systems and drive innovation responsibly.

About the Author

Lakshmithejaswi Narasannagari

Show moreShow less

Related Editorial
Popular across InfoQ

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Best Practices to Build Energy-Efficient AI/ML Systems

Write for InfoQ

Key Takeaways

Introduction

Carbon footprint of ML lifecycle

How to measure the carbon footprint of your code

Common Issues & Solutions

Best practices to build a sustainable ML lifecycle

1. Prioritizing Efficient Model Selection

2. Model optimizations to reduce complexity

Optimization Techniques and Energy Benefits

3. Choosing efficient hardware

4. On-Premise Vs. Cloud

Conclusion

About the Author

Lakshmithejaswi Narasannagari

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter