Artificial-intelligence-driven scanning probe microscopy

Sample preparation

The samples were prepared in-situ by sublimation of MgPc molecules (Sigma-Aldrich) at 650 K (deposition rate ≈ 0.014 molecules nm−2 min−1; sub-monolayer coverages) onto a clean Ag(100) surface (Mateck GmbH) held at room temperature. The Ag surface was prepared in ultra-high vacuum (UHV) by repeated cycles of Ar+ sputtering and annealing at 720 K. The base pressure was below 1 × 10−9 mbar during molecular deposition.

STM measurements

The STM measurements were performed using a commercial scanning probe microscope (Createc) capable of both STM and non-contact AFM at low temperature (down to 4.6 K) and in UHV. This setup includes two probe-positioning systems: a coarse one for macroscopic approaching and lateral positioning of the probe above the sample; a fine one consisting of a piezo scanner that allows for high-resolution imaging. The lateral range of the fine piezo scanner at 4.6 K is ±425 nm, i.e., each approach of the probe to the sample with the coarse system defines a scanning region area of 850 × 850 nm2). Nanonis electronics and SPM software (SPECS) were used to operate the setup. All measurements were performed at 4.6 K with a Pt/Ir tip. All topographic images were acquired in constant-current mode (Vbias = 1V, It = 25 pA) at a scan speed of 80 nm s−1.

After moving the probe macroscopically to a new sample area, DeepSPM extends the z piezo scanner to ~80% of its maximum extension, to maximize the range of tunneling current feedback controlling the z position of the scanner without crashing or losing contact. If tunneling contact is lost, DeepSPM re-approaches the probe. DeepSPM also handles probe crashes (see section below “Detection and fixing of lost contact/probe crash”).

After approaching the probe to the sample, DeepSPM waits ~120 s before starting a new scan, to let thermal drift and (mainly) creep of the z piezo settle. New measurements start at the neutral position of the xy scan piezo (that is, x = y = 0), where no voltages are applied to the xy piezo scanner, to minimize the piezo creep in the xy scanning plane. DeepSPM selects the next imaging region by minimizing the distance that the probe travels between regions (see Fig. 4, section “Finding the next imaging region” and Supplementary Fig. 1), further reducing xy piezo creep. After moving the probe to a new imaging region, DeepSPM first records a partial image that is usually severely affected by distortions due to xy piezo creep. This image is discarded and a second image, not or only minimally affected, is recorded at the same position.

During autonomous operation, the RL agent of DeepSPM continues to learn about probe conditioning (see main text and below). To avoid damaging a good probe by unnecessary conditioning during autonomous operation, DeepSPM initiates a probe-conditioning episode only after ten consecutive images have been classified as “bad probe” by the classifier CNN (Supplementary Fig. 6). The episode is terminated as soon as the first image is classified as “good probe” by the classifier CNN. Images that are not part of a conditioning episode (including the ten consecutive images that triggered it), or that have been disregarded due to “bad sample”, lost probe–sample contact or crashed probe, are labeled as “good image”.

Finding the next imaging region

For each new approach area, DeepSPM starts acquiring data at the center of the scanning range (Fig. 4a). DeepSPM uses a binary map to block the imaging regions that have already been scanned (Supplementary Fig. 1). If a region is identified as “bad sample” (e.g., excessive roughness is detected), DeepSPM defines a larger circular area around it (as further roughness is expected in the vicinity), and avoids scanning this area. The radius rforbidden of this region is increased as consecutive imaging regions are identified as “bad sample”:

$$r_{{mathrm{forbidden}}} = sqrt t ,{times} 25,{mathrm{nm}},$$


where t is the number of consecutive times an area with excessive roughness was detected.

As probe-conditioning actions can cause debris and roughness on the sample, DeepSPM blocks a similar circular area to avoid around the location of each performed conditioning action. The size of this area depends on the executed action (Supplementary Table 2). DeepSPM chooses the next imaging region centered at position ({mathbf{v}}_{mathrm{t}} = x_{mathrm{t}}{hat{mathbf{x}}} + y_{mathrm{t}}{hat{mathbf{y}}}) (xt, yt are coordinates with respect to the center of the approach area) that minimizes

$$d = left| {{mathbf{v}}_t – {mathbf{v}}_{t – 1}} right|_2 + alpha left| {{mathbf{v}}_t} right|_1$$


provided that this is a valid imaging region according to the established binary map. Here, vt−1 denotes the position of the center of the last imaging region, ||…||1 is the Manhattan norm, and ||…||2 is the standard Euclidian norm. The parameter α controls the relative weight of the two distances, i.e., to the center of the last scanned region (to minimize travel distance), and to the center of the approach area (to efficiently use the entire available area). We found α = 1 works well. This algorithm minimizes the distance the probe travels between consecutive imaging regions, reducing the impact of xy piezo creep. Once the area defined by the fine piezo scanner range has been filled, or the distance to the center of the next available scanning region is larger than 500 nm, DeepSPM moves the probe (macroscopically, with the coarse positioning system) to a new approach area.

DeepSPM architecture

The DeepSPM framework consists of two components: (i) the controller, written in Python and TensorFlow, and (ii) a TCP server, written in Labview. The controller contains the image processing, classifier CNN, and RL agent. The TCP server creates an interface between the controller and the Nanonis SPM software. The controller sends commands via TCP, e.g., for acquiring and recording an image, executing a conditioning action at a certain location. The server receives these commands and executes them on the Nanonis/SPM. It returns the resulting imaging data via TCP to the controller, where it is processed to determine the next command. Based on this design, the agent can operate on hardware decoupled from the Nanonis SPM software.

Training and test dataset for classifier CNN

We compiled a dataset of 7589 images (constant-current STM topography, 64 × 64 pixels) of MgPc molecules on Ag(100), acquired via human operation. We assigned to each image a ground truth label of the categories “good probe” (25%) or “bad probe” (75%). We randomly split the data into a training (76%) and test set (24%). We used the latter to test the performance of the classifier CNN on unseen data (i.e., not used for training). The dataset is available online at

It is important to note that the classifier CNN was trained to distinguish a “good probe” from a “bad probe”. The classifier CNN was not trained to identify a specific type of probe defect in the case of a “bad probe”. Figure 1b shows examples of possible imaging defects, including different types of probe defects (recognized as “bad probe” by the classifier CNN) and other image acquisition issues (e.g., lost contact and excessive sample roughness) that are detected algorithmically.

CNN architecture

We used the same sequential architecture for both the classifier CNN and the action CNN of the RL agent, differing only in their output layer and specific hyper-parameters (see below). The basic structure is adapted from the VGG network21. We used a total of 12 convolutional layers: four sets of three 3 × 3 layers (with 64, 128, 256, and 512 feature maps, respectively) and 2 × 2 max-pooling after the first two sets. The convolutional layers are followed by two fully connected layers, each consisting of 4096 neurons. Each layer, except the output layer, uses a ReLU activation function and batch normalization36. The input in all networks consisted of 64 × 64 pixel constant-current STM topography images. We used Dropout37 with a probability of 0.5 after each fully connected layer to reduce overfitting. The network weights were initialized using Xavier initialization38.

Classifier CNN

The classifier CNN uses the architecture above. It has a single neuron output layer with a sigmoid activation function. This output (ranging from 0 to 1) gives the classifier CNN’s estimate of the probability that the input image was recorded with a “good probe”. The decision threshold was set to 0.9. It is noteworthy that DeepSPM requires ten consecutive images classified as “bad probe” to start a conditioning episode (Supplementary Fig. 6). We trained the classifier CNN using the ADAM39 optimizer with a cross-entropy loss and L2 weight decay with a value of 5 × 10−5 and a learning rate of 10−3. To account for the imbalance of our training set (“good probe” 25% and “bad probe” 75%), we weighed STM images labeled as “good probe” by a factor of 8 when computing the loss40. In addition, we increased the available amount of training data via data augmentation, randomly flipping the input SPM images horizontally or vertically. It is noteworthy that all training data consisted of experimental data previously acquired and labeled manually.

Reinforcement learning agent and action CNN

Our RL agent responsible for the selection of probe-conditioning actions is based on double DQN34, which is an extension of DQN28. We modified the double DQN algorithm to suit the requirements of DeepSPM as follows. The action CNN controlling the RL agent uses the architecture above, with a single constant-current STM image as input. It is noteworthy that the original DQN uses a stack of four subsequent images. Our action CNN has an output layer consisting of 12 nodes, one for each conditioning action. The output of each node is interpreted as the Q-value of the corresponding action, i.e., the expected future reward to be received after executing it. We initialized the weights of the action CNN (excluding the output layer) with those of the previously trained classifier CNN, based on the assumption that the features learned by the latter are useful for the action CNN41. The output layer, which has a different size in both networks, is initialized with the Xavier initialization38. To train the action CNN, we let it operate the SPM, acquiring images, and selecting and executing probe-conditioning actions repeatedly when deemed necessary (Figs. 1 and 2). Once sufficient probe quality was reached (i.e., the probability predicted by the classifier CNN exceeded 0.9), the conditioning episode was terminated—a conditioning episode consists of the sequence of probe-conditioning actions required to obtain a good probe. Random conditioning actions (up to five) were then applied to reset (i.e., re-damage the probe), until the predicted probability drops below 0.1. The RL agent received a constant reward of −1 for every executed probe-conditioning action. It received a reward of +10 for each terminated training episode, i.e., each time the probe was deemed good again. We chose these reward values heuristically by testing them in a simulated environment. In these simulations, the RL agent executed conditioning actions and the reward protocol was applied based on images resulting from the convolution of a good, clean synthetic image with a model kernel representing the probe morphology. Following a conditioning action, this kernel was updated stochastically. In this reward scheme, the RL agent receives a positive cumulative reward for and favors short conditioning episodes, whereas it receives a negative cumulative reward and is punished for longer episodes.

The RL agent uses ε-greedy exploration to gather experience25. For each conditioning step, the agent chooses a conditioning action probabilistically based on parameter ε (0 < ε < 1): it chooses randomly with a probability ε, and it chooses the action with the largest predicted future reward (Q-value) with probability (1 − ε). For example, if ε = 1, action selection is strictly random; if ε = 0, action selection is based strictly on predicted Q-value. We start training (Supplementary Fig. 2) with 500 random steps (ε = 1) that are used to pre-fill an experience replay buffer28. This buffer contains all experiences the agent has gathered so far, each consisting of an input image, the chosen action and its outcome (the next image assessed by the classifier CNN, as well as the reward received). We used data augmentation, adding four experiences to the buffer for each step. These additional experiences consisted of images flipped horizontally and vertically. After 500 steps (i.e., 2000 experiences in the buffer), we started training the action CNN with the buffer data. We used the ADAM optimizer39 with a batch size of 64 images processed simultaneously and with a constant learning rate of 5 × 10−4. We limited the buffer size to 15,000, with new experiences replacing the old ones (first-in, first out). To allow parallel execution and increase the overall performance of the training, we decoupled the gathering of experience and the learning into separate threads. During training, we decreased ε linearly over 500 steps, from 1.0 to 0.05. After reaching ε = 0.05, we continued training with additional 4360 steps, during which we kept ε = 0.05 constant34. We used a constant discount factor of γ = 0.9525.

Testing of the RL agent

After training the RL agent, we tested its performance in operating the STM by comparing it with the probe-conditioning performance achieved via random conditioning action selection. During this evaluation, we allowed the action CNN to continue learning from continuous data acquisition with a constant ε = 0.05. Except for this value of ε, the testing process matches that of the RL agent training above. To achieve a meaningful comparison, we accounted for the fact that the state of the sample and the probe changes after each executed conditioning action (Supplementary Figs. 3 and 4), by adopting an interleaved evaluation scheme. That is, RL agent action selection and random selection alternate in conditioning the probe, switching after each completed probe-conditioning episode.

STM image pre-processing

The scanning plane of the probe is never perfectly parallel to the local surface of the sample. This results in a background gradient in the SPM images that depends on the macroscopic position and, to a lesser extent, on the nanoscopic shape of the probe. This gradient was removed in each image by fitting and subtracting a plane using RANSAC42 (Python scikit-learn implementation; polynomial of degree 1, residual threshold of 5 × 10−12, max trials of 1000). The acquired STM data were further normalized and offset to the range [−1; 1], i.e., such that pixels corresponding to the flat Ag(100) had values of −1 and those corresponding to the maximum apparent height of MgPc (~2 Å) had values of 1. In addition, we limited the range of values to [−1.5, 1.5], shifting any values outside this range to the closest one inside the interval.

Finding an appropriate action location

For a given acquired STM image, DeepSPM executes a probe-conditioning action at the center of the largest clean Ag(100) square area (Fig. 3). This center is found by calculating a binary map from the pre-processed image (see above), where pixels close (≤0.1 Å) to the surface fitted plane are considered empty, i.e., belong to a clean Ag(100) patch, and all others as occupied. The center of the largest clean Ag(100) square area within this binary map was chosen as the conditioning location. We defined an area requirement for each conditioning action (Supplementary Table 1). A conditioning action is allowed and can be selected by the agent only if the available square area is within this specified requirement.

Detection and fixing of lost contact/probe crash

DeepSPM is able to detect and fix any potential loss of probe–sample contact during scanning. It does so by monitoring the extension (z-range) of the fine piezo scanner; if the fine piezo scanner extends in the z-direction beyond a specified threshold (towards the sample surface), DeepSPM prevents the potential loss of probe–sample contact by re-approaching the probe towards the sample with the coarse probe-positioning system (until the probe is within an acceptable distance range from the sample). Data acquisition can then continue at the same position. Similarly, DeepSPM can prevent probe–sample crashes, i.e., by increasing the probe–sample distance with the coarse probe-positioning system if the fine piezo scanner retracts in the z-direction beyond a specified threshold (away from the sample surface).