Sagemaker debugger docs This section walks you through the Debugger profiling report section by section. With Amazon SageMaker AI, data scientists and developers can quickly build and train machine learning models, and then deploy them into a production-ready hosted environment. To learn more about Debugger, see Amazon SageMaker Debugger. Amazon SageMaker Debugger provides transparent visibility into training jobs and saves training metrics into your Amazon S3 bucket. This troubleshooting guide aims to help you understand and resolve common issues that might arise when working with the SageMaker Python SDK. The collections_to_save argument takes in a tensor configuration through the CollectionConfig API, which requires name and parameters arguments. Configure the Debugger-specific parameters when constructing a SageMaker estimator to gain visibility and insights into your training job. These attributions can be provided for specific predictions and at a global level for the model as a whole. Configuring SageMaker Debugger Regardless of which of the two above ways you have enabled SageMaker Debugger, you can configure it using the SageMaker python SDK. For more information In the following sections, notebooks and code samples of how to use Debugger rules to monitor SageMaker training jobs are provided. The following figure shows how this process works in the case that your model is deployed to a real-time endpoint. Construct a Run instance. The following videos provide a tour of Amazon SageMaker Debugger capabilities using SageMaker Studio and SageMaker AI notebook instances. Amazon SageMaker Training Compiler is a feature of SageMaker Training and speeds up training jobs by optimizing model execution This notebook shows how to: * Host a machine learning model in Amazon SageMaker and capture inference requests, results, and metadata * Schedule Clarify bias monitor to monitor predictions for bias drift on a regular basis. The following screenshot shows a collage of the Debugger profiling report. Debugger's debugging functionality for model optimization is about analyzing non You can use the Debugger built-in rules, provided by Amazon SageMaker Debugger, to analyze metrics and tensors collected while training your models. If you're registering a model within Model Registry, you can use the integration to add auditing information. functional API operations. smdebug retrieves and filters the tensors generated from Debugger such as gradients, weights, and biases. If you want access to the hook to configure certain things which can not be configured through the SageMaker SDK, you can retrieve the hook as follows. This guide walks you through the content of the SageMaker Debugger Insights dashboard under the following tabs: System Metrics Experiments Run class sagemaker. For SageMaker AI XGBoost training jobs, use the Debugger CreateXgboostReport rule to receive a comprehensive training report of the training progress and results. Use SageMaker Debugger to create output tensor files that are compatible with TensorBoard. Amazon SageMaker Studio Classic provides an experiments browser that you can use to view lists of experiments and runs. Amazon SageMaker Studio Classic is a web-based integrated development environment (IDE) for machine learning (ML). In a single visual interface, you can do the Inference Pipelines SageMaker Workflow SageMaker Model Building Pipeline SageMaker Model Monitoring SageMaker Debugger SageMaker Processing Configuring and using defaults with the SageMaker Python SDK Run Machine Learning code on SageMaker using remote function FAQ Use Version 2. SageMaker Clarify provides feature attributions based on the concept of Shapley value . This site is based on the SageMaker Examples repository on GitHub. Debugger automatically generates output tensor files that are compatible with TensorBoard. SageMaker Debugger helps you develop and optimize model performance and computation. To learn about SageMaker Model Monitor, see Data and model quality monitoring with Amazon SageMaker Model Monitor. SageMaker AI Debugger offers the Rule API operation that monitors training job progress and errors for the success of training your model. The following estimator class methods are useful for accessing your SageMaker training job information and retrieving output paths of training data collected by Debugger. It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you . nn modules instead. TensorBoard can be accessed in SageMaker AI either programmatically through the sagemaker. For a tutorial on what you can do after creating the trial and how to visualize the results, see SageMaker Debugger - Visualizing Debugging Results. huggingface. With Amazon SageMaker AI, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. In the following sections, notebooks and code samples of how to use Debugger rules to monitor SageMaker training jobs are provided. SageMaker Canvas An auto ML service that gives people with no coding experience the ability to build models and make predictions with them. To learn more, see SageMaker Debugger interactive report. Amazon SageMaker Example Notebooks Welcome to Amazon SageMaker. To activate or update the Debugger monitoring configuration for a training job that is currently running, use the SageMaker AI estimator extension methods. 4xlarge instance to process and render the visualizations. Debugger supports profiling functionality for performance optimization to identify computation issues, such as system bottlenecks and underutilization, and to help optimize hardware resource utilization at scale. Let’s start first with a quick introduction into model pruning. Amazon SageMaker Debugger built-in rules can be configured for a training job using the DebugHookConfig, DebugRuleConfiguration, ProfilerConfig, and ProfilerRuleConfiguration objects through the SageMaker CreateTrainingJob API operation. It will create a multi GPU multi node training using Horovod. IAM role For a SageMaker training container to start with the SSM agent, provide an IAM role with SSM permissions. Learn how to create an estimator for default system monitoring and customized framework profiling with different profiling options using Debugger. Run(experiment_name, run_name=None, experiment_display_name=None, run_display_name=None, tags=None, sagemaker_session=None, artifact_bucket=None, artifact_prefix=None) A collection of parameters, metrics, and artifacts to create a ML model. Configure Debugger with Amazon CloudWatch Events and AWS Lambda to take action based on Debugger rule evaluation status. rst file with your own content under the root (or /docs) directory in your repository. In this tutorial, you will learn how to use SageMaker Debugger and its built-in rules to debug your model. Despite the SDK providing a simplified workflow, you might encounter various exceptions or errors. You'll learn how to perform various tasks related to SageMaker HyperPod, whether you prefer a visual interface or working with commands. Examples on how to use SageMaker Debugger. 10 Release Notes SMDebug Library Release Notes This notebook shows how to: * Host a machine learning model in Amazon SageMaker and capture inference requests, results, and metadata * Schedule Clarify bias monitor to monitor predictions for bias drift on a regular basis. experiments. The following topics walk you through tutorials from the basics to advanced use cases of monitoring, profiling, and debugging SageMaker training jobs using Debugger. 0, 1. Amazon SageMaker Python SDK is an open source library for training and deploying machine-learned models on Amazon SageMaker. Mar 24, 2025 · How SageMaker Debugger works Archive of my old post from my blog. 0. * Schedule Clarify explainability monitor to monitor predictions for feature attribution drift on a regular basis. For more information about the Debugger-specific parameters, see SageMaker AI Estimator in the Amazon SageMaker Python SDK. Amazon SageMaker AI with TensorBoard To offer greater compatibility with the open-source community tools within the SageMaker AI Training platform, SageMaker AI hosts TensorBoard as an application in SageMaker AI domain. Explore the Debugger features and learn how you can debug and improve your machine learning models efficiently by using Debugger. The following sections outline the process needed to automate training job termination using using CloudWatch and Lambda. Use the CollectionConfig API operation to configure tensor collections. You need to specify the right image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how to set up the JSON strings to The preceding topics focus on using Debugger through Amazon SageMaker Python SDK, which is a wrapper around AWS SDK for Python (Boto3) and SageMaker API operations. (Optional) Install SageMaker and SMDebug Python SDKs To use the new Debugger profiling features released in December 2020, ensure that you have the latest versions of SageMaker and SMDebug SDKs installed. The following screenshot shows the full view of the SageMaker AI Data Manager tab in the TensorBoard application. Model Monitor uses rules to detect drift in your models and alerts you when it happens. In the following topics, you'll learn how to use the SageMaker Debugger built-in rules. Amazon SageMaker Debugger is a new feature which offers capability to debug machine learning and deep learning models during training by identifying and detecting problems with the models in real time. Download the SageMaker Debugger profiling report while your training job is running or after the job has finished using the Amazon SageMaker Python SDK and AWS Command Line Interface (CLI). Available deep learning frameworks are Apache MXNet, TensorFlow, PyTorch, and XGBoost. The following table outlines a variety of sample notebooks that address different use cases of Amazon SageMaker XGBoost algorithm. The Amazon SageMaker Studio user interface is split into three distinct parts. The rule_parameters argument is to adjust the default key values of the built-in rules listed in List of Debugger built-in rules. Use You can also use the SageMaker AI console UI to open the TensorBoard application. This section provides guidance on managing SageMaker HyperPod through the SageMaker AI console UI or the AWS Command Line Interface (CLI). To learn more about how to configure the DebugHookConfig parameter, see Use the SageMaker and Debugger Configuration API Operations to Create, Update, and Debug Your Training Job. For more information about monitoring training jobs using CloudWatch, see Monitor Amazon SageMaker. Train a model using the input training dataset. Amazon SageMaker Debugger ¶ Amazon SageMaker Debugger allows you to detect anomalies while training your machine learning model by emitting relevant data during training, storing the data and then analyzing it. To use Debugger with customized containers, you need to make a minimal change to your training script to implement the Debugger hook callback and retrieve tensors from training jobs. 5. Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger To see an example using Debugger in a SageMaker training job, you can reference one of the notebook examples in the SageMaker Notebook Examples GitHub repository. For any hook configuration you customize for saving output tensors, Debugger has the flexibility to create scalar summaries Note The SageMaker Debugger Insights dashboard runs a Studio Classic application on an ml. The smdebug client library is an open source library that powers SageMaker Debugger by calling the saved training data from training jobs. Amazon SageMaker Debugger automates the debugging process of machine learning training jobs. Amazon SageMaker Training Compiler is a feature of SageMaker Training and speeds up training jobs by optimizing model execution Introduction to SMDebug SMDebug: Amazon SageMaker Debugger Client Library Table of Contents Overview Install the smdebug library Debugger-supported Frameworks How It Works Examples SageMaker Debugger in Action Further Documentation and References License Release Notes SMDebug Library 1. The API calls the Amazon SageMaker CreateTrainingJob API to start model training. Amazon SageMaker AI is a fully managed machine learning service. When you initiate a SageMaker training job, SageMaker Debugger starts monitoring the resource utilization of the Amazon EC2 instances by default. Use the Amazon SageMaker Debugger dashboard in Amazon SageMaker Studio Classic to analyze computational performance of your training job on Amazon EC2 instances. class sagemaker. Debug training jobs in real time, detect non-converging conditions, improve model performance using Amazon SageMaker Debugger. sagemaker and Rule. Refer to the SageMaker developer guide’s Get Started Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger Make sure you determine which output tensors and scalars to collect, and modify code lines in your training script using any of the following tools: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or SageMaker Debugger. This page gives information about the distinct parts and their components. Batch size In distributed training, as more nodes are added, batch sizes should increase proportionally. Amazon SageMaker Debugger tutorialsThe following topics walk you through tutorials from the basics to advanced use cases of monitoring, profiling, and debugging SageMaker training jobs using Debugger. DebugHookConfig Configuration information for the Amazon SageMaker Debugger hook parameters, metric and tensor collections, and storage paths. For example, if you used an ML model for college admissions, the explanations could help determine Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger You can use the training and Debugger rule job status in the CloudWatch logs to take further actions when there are training issues. If you want to adjust the built-in rule parameter values and customize tensor collection regex, configure the base_config and rule_parameters parameters for the ProfilerRule. Find more information and references about using Amazon SageMaker Debugger in the following topics. You can also utilize this for MXNet, PyTorch, and XGBoost estimators. Code Editor Code Editor extends Studio so that you can write, test, debug and run your analytics and machine learning code in an environment based on Visual Studio Code - Open Source ("Code-OSS"). Amazon SageMaker Debugger allows you to detect anomalies while training your machine learning model by emitting relevant data during training, storing the data and then analyzing it. SageMaker Debugger cannot collect model output tensors from the torch. These tools can help ML modelers and developers and other internal stakeholders understand model characteristics as a whole prior to deployment and to debug predictions provided by the model after it's deployed. The profiling report is generated based on the built-in rules for monitoring and profiling. While a training job is running, do the following: Amazon SageMaker Model Monitor automatically monitors machine learning (ML) models in production and notifies you when quality issues happen. Explaining how Sagemaker debugger works. This class initializes a TrainingCompilerConfig instance. To learn more about the programming model for analysis using the SageMaker Debugger SDK, see SageMaker Debugger Analysis. When you close a SageMaker Debugger Insights tab, the corresponding Amazon SageMaker Debugger Support for TensorFlow Amazon SageMaker Debugger python SDK and its client library smdebug now fully support TensorFlow 2. In case you need to manually configure the SageMaker API operations using AWS Boto3 or AWS Command Line Interface (CLI) for other SDKs, such as Amazon SageMaker Clarify provides tools to help explain how machine learning (ML) models make predictions. In a single visual interface, you can do the Using Amazon SageMaker Debugger with your own PyTorch container ¶ Amazon SageMaker is a managed platform to build, train and host machine learning models. Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger Amazon SageMaker AI is a fully managed machine learning service. Studio Classic includes all of the tools you need to take your models from data preparation to experimentation to production with increased productivity. To enable remote debugging for your training job, SageMaker AI needs to start the SSM agent in the training container when the training job starts. 3, and 1. Each SageMaker Debugger Insights tab runs one Studio Classic kernel session. This notebook demonstrates how we can use SageMaker Debugger and SageMaker Experiments to perform iterative model pruning. If you want to use another markup, choose a different builder in your settings. rst or README. Save tensors using Debugger built-in collections You can use built-in collections of tensors using the CollectionConfig API and save them using the DebuggerHookConfig API. Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger Amazon SageMaker Debugger ¶ Amazon SageMaker Debugger allows you to detect anomalies while training your machine learning model by emitting relevant data during training, storing the data and then analyzing it. There are two options to open the TensorBoard application through the SageMaker AI console. The report is automatically aggregated depending on the output tensor regex, recognizing what type of your training job is among binary classification, multiclass classification, and regression. The API uses configuration you provided to create the estimator and the specified input training data to send the CreatingTrainingJob request to Amazon SageMaker. Model Monitor is integrated with SageMaker Clarify to improve visibility into potential bias. The Debugger rules monitor training job status, and a CloudWatch Events rule watches the Debugger rule training job evaluation status. Debugger provides pre-built tensor collections that cover a variety of regular expressions (regex) of parameters if using Debugger-supported deep learning frameworks and machine learning algorithms. For more information, see Update the Details of a Model Version. Use TensorBoard within Amazon SageMaker AI to debug and analyze your machine learning model and the training job of the model. You need to specify the right image URI in the RuleEvaluatorImage parameter, and the following examples walk you through how to set up the request body for the create_training_job() function. It covers scenarios related to Train a model using the input training dataset. m5. The AWS CLI, SageMaker AI Estimator API, and the Debugger APIs enable you to use any Docker base images to build and customize containers to train your models. You can filter the list of experiments by entity name, type, and tags. Receive training reports autogenerated by Debugger. The following lists the debugger rules, including information and an example on how to configure and deploy each built-in rule. You can choose one of these entities to view detailed information about the entity or choose multiple entities for comparison. sagemaker classmethods. tensorboard module or through the TensorBoard landing page in the SageMaker console, and it automatically finds and displays all training job output data in a compatible format. This offers a high-level experience of accessing the Amazon SageMaker API operations. You can track the system utilization rates, statistics overview, and built-in rule analysis through the Insights dashboard. Also make sure that you specify the TensorBoard data output path as the log directory (log_dir) for callback in the training container. Following this guide, specify the CreateXgboostReport You can use the SageMaker Python SDK to interact with Amazon SageMaker AI within your Python scripts or Jupyter notebooks. The Debugger report provide insights into your training jobs and suggest recommendations to improve your model performance. Type: DebugHookConfig In addition, you can configure alerts so you can troubleshoot violations as they arise and promptly initiate retraining. The following topics show how to use the CollectionConfig and DebuggerHookConfig API operations, followed by examples of how to use Debugger hook to save, access, and visualize output tensors. Debugger provides the following profile features: The SageMaker Debugger Rule class configures debugging rules to debug your training job. interactive_apps. SageMaker Experiments enables you to call the training information as trials through SageMaker Studio and supports visualization of the training job. Amazon SageMaker AI provides two debugging tools to help identify such convergence issues and gain visibility into your models. 3 with the latest version release. Multiple kernel sessions for multiple SageMaker Debugger Insights tabs run on the single instance. The sagemaker-debugger Python SDK provides tools for adapting your training script before training and analysis tools after training. Amazon SageMaker also gives you the November 18, 2025 Sagemaker › dg Built-in algorithms and pretrained models in Amazon SageMaker SageMaker provides algorithms for training machine learning models, classifying images, detecting objects, analyzing text, forecasting time series, reducing data dimensionality, and clustering data groups. This topic walks you through a high-level overview of the Amazon SageMaker Debugger workflow. When you close a SageMaker Debugger Insights tab, the corresponding kernel November 5, 2025 Sagemaker › dg Built-in algorithms and pretrained models in Amazon SageMaker SageMaker provides algorithms for training machine learning models, classifying images, detecting objects, analyzing text, forecasting time series, reducing data dimensionality, and clustering data groups. For any hook configuration you customize for saving output tensors, Debugger has the flexibility to create scalar summaries Amazon SageMaker Debugger comes with a client library called the sagemaker-debugger Python SDK . Configuring Hook using SageMaker Python SDK After you make the minimal changes to your training script, you can configure the hook with parameters to the SageMaker Debugger API operation, DebuggerHookConfig. TrainingCompilerConfig(enabled=True, debug=False) Bases: TrainingCompilerConfig The SageMaker Training Compiler configuration class. Studio Classic lets you build, train, debug, deploy, and monitor your ML models. As shown in the following example code, add the built-in tensor collections you want to debug. From training jobs, Debugger allows you to run your own training script (Zero Script Change experience) using Debugger built-in features— Hook and Rule —to capture tensors, have flexibility to build customized Hooks and Rules for configuring tensors as you want, and make the tensors available for The base_config argument is where you call the built-in rule methods. To turn off debugging, set the debugger_hook_config parameter to False. In this page, you'll learn how to adapt your training script using the client library. The current release of SageMaker XGBoost is based on the original XGBoost versions 1. This section walks you through the Debugger XGBoost training report. Amazon SageMaker Debugger comes with a client library called the sagemaker-debugger Python SDK . When you write a PyTorch training script, it is recommended to use the torch. Debugger provides an automatic detection of training problems through its built-in rules, and you can find a full list of the built-in rules for debugging at List of Debugger Built-in Rules. This SageMaker Debugger module provides high-level methods to set up Debugger configurations to monitor, profile, and debug your training job. Amazon SageMaker Debugger provides functionality to save tensors during training of machine learning jobs and analyze those tensors - awslabs/sagemaker-debugger Amazon CloudWatch collects Amazon SageMaker AI model training job logs and Amazon SageMaker Debugger rule processing job logs. The debugging rules analyze tensor outputs from your training job and monitor conditions that are critical for the success of the training job. Check out our Getting Started Guide to become more familiar with Read the Docs. You can use Shapley values to determine the contribution that each feature made to model predictions. nn. You can conduct an online or offline analysis by loading collected output tensors from S3 buckets paired with training jobs during or after training. May 18, 2020 · SageMaker、もしくはSageMaker Autopilot上でトレーニングJOBを実行する、といった条件で「Amazon SageMaker Experiments」を利用するために必要なことは「Estimator」に「そのtrialがどのexperimentsに紐づけられるか」といったことを定義したパラメータを追加すればよく、全て Welcome to Read the Docs ¶ This is an autogenerated index file. 2, 1. A library for training and deploying machine learning models on Amazon SageMaker - aws/sagemaker-python-sdk Estimator configuration with parameters for basic profiling using the Amazon SageMaker Debugger Python modules Configure SageMaker Debugger profiling, monitor resource utilization metrics, configure profiler rules, activate built-in profiler rules, adjust basic profiling configuration, configure framework profiling, update profiling configuration. Amazon SageMaker Model Monitor automatically monitors machine learning (ML) models in production and notifies you when quality issues happen. The report shows result plots only for the rules that found issues. Background Amazon SageMaker Model Monitor continuously Jul 10, 2025 · SageMakerは数多くの「SageMaker 」という形式のサービス群で構成されており、各サービスの機能やユースケースを正確に理解することが試験攻略の鍵となります。 It provides an XGBoost estimator that executes a training script in a managed XGBoost environment. x of the SageMaker Python SDK Installation Breaking Changes Non The smdebug client library is an open source library that powers SageMaker Debugger by calling the saved training data from training jobs. Refer to the SageMaker developer guide’s Get Started The smdebug client library is an open source library that powers SageMaker Debugger by calling the saved training data from training jobs. Please create an index. SageMaker geospatial capabilities Build, train, and deploy Amazon SageMaker Debugger built-in rules can be configured for a training job using the create_training_job() function of the AWS Boto3 SageMaker AI client. The following example shows how to use the default settings of Debugger hook configurations to construct a SageMaker AI TensorFlow estimator. To run these notebooks, you will need a SageMaker Notebook Instance or SageMaker Studio. This is a synchronous operation. The following code shows a complete example of This notebook will walk you through creating a TensorFlow training job with the SageMaker Debugger profiling feature enabled. Warning If you disable it, you won't be able to view the comprehensive Studio Debugger insights dashboard and the autogenerated profiling report. Amazon SageMaker Debugger's built-in rules analyze tensors emitted during the training of a model. The SageMaker Debugger Insights dashboard runs a Studio Classic app on an ml. The following procedure shows how to access the related CloudWatch logs. Load the files to visualize in TensorBoard and analyze your SageMaker training jobs. The Debugger reports provide insights into your training jobs and suggest recommendations to improve your model performance. While constructing a SageMaker AI estimator, activate SageMaker Debugger by specifying the debugger_hook_config parameter. Note The SageMaker Debugger Insights dashboard runs a Studio Classic application on an ml. SageMaker Experiments automatically tracks the inputs, parameters, configurations Receive profiling reports autogenerated by Debugger. You are encouraged to configure the hook from the SageMaker python SDK so you can run different jobs with different configurations without having to modify your script. Amazon SageMaker Model Card is integrated with SageMaker Model Registry. It covers scenarios related to Build a Custom Training Container and Debug Training Jobs with Amazon SageMaker Debugger ¶ Amazon SageMaker Debugger enables you to debug your model through its built-in rules and tools (smdebug hook and core features) to store and retrieve output tensors in Amazon Simple Storage Service (S3). There are two aspects to this configuration. This site highlights example Jupyter notebooks for a variety of machine learning use cases that you can run in SageMaker. When you open the TensorBoard application, TensorBoard opens with the SageMaker AI Data Manager tab. You can track and debug model parameters, such as weights, gradients, biases, and scalar values of your training job.