العلوم و التكنولوجيا

Outplaying elite table tennis players with an autonomous robot

Coordinate system

We use right-handed conventions with the origin of the coordinate system at the centre of the playing surface of the table, in which the x-axis points towards the human player side of the table and the z-axis points upwards.

Perception

Ball triangulation

We use nine cameras synchronized with the actuators of the robot with a 200 Hz trigger signal to accurately locate the ball in the volume of the Olympic-sized court. At each trigger event, cameras capture 1,440 × 1,080 pixel Bayer8 colour images. To reduce data transfer and improve scalability (more cameras can increase robustness and accuracy44), each camera is equipped with a hardware-accelerated field programmable gate array to facilitate two-dimensional (2D) ball detection. The field programmable gate arrays process the images through a segmentation pipeline to produce a compressed 2D detection mask, which is streamed to a central server through an embedded CPU. The server verifies the shape of the ball and triangulates its 3D position using pre-calibrated camera parameters. The entire process is completed within 10.2 ms.

Camera placement is optimized using a custom covariance matrix adaptation evolution strategy (CMA-ES) algorithm45. The optimizer determines the lens selection, mounting height and orientation for each camera, subject to constraints such as the number of towers, desired coverage volume and a minimum projected 2D ball radius (5 pixels).

Spin estimation

The angular velocity of the ball is estimated by observing the movement of the logo printed on the surface of the official ball. To accurately capture the high-speed moving and rotating logo, we develop a mirror-based event vision tracking system called the gaze control system (GCS). The GCS comprises three components: (1) an event camera4 for low-latency, low-motion-blur imaging; (2) a telephoto, electrically tunable lens to magnify the ball and keep it in focus; and (3) a set of rotatable mirrors to track the ball smoothly (Fig. 2d). Given the 3D triangulation results, the mirrors and lens are controlled to track and focus on the ball with the system delay compensated by predicting the ball trajectory using the ball aerodynamics. With the ball being tracked, its contour on the event camera frame is first detected by a CNN46. Then the events on the ball are processed by two spin estimators, namely, a low-latency estimator based on another CNN33 and a high-accuracy but slower estimator based on CMax34. The CNN estimates the angular velocities with heteroscedastic uncertainties from accumulated events and is trained on pseudo-ground-truth data obtained by CMax using heteroscedastic regression47.

Events are aggregated into a polarity-separated surface of active events48 of 15 ms accumulation time window in which timestamps are minimum/maximum normalized to a range between 0 and 1. We use a centred 320 × 320 pixel hardware crop of the original 1,280 × 720 pixel.

The angular velocities estimated by the CNN are refined asynchronously by CMax. To achieve both low-latency and high accuracy, the robot agent Ace uses the angular velocities obtained by the CNN at the beginning of the trajectory and switches to the ones obtained by CMax as soon as they become available with low uncertainty. Because the spin estimation uncertainty increases when the logo is invisible, we place three GCSs to track the ball from multiple perspectives, as shown in Fig. 2a, and combine the multi-view measurements based on the respective uncertainties.

Simulation

Ball aerodynamics

The aerodynamics of the ball in flight are governed by the drag fd, Magnus fM and gravitational fg forces. Given that the ball’s angular velocity ω is approximately constant over short flight intervals, the flight dynamics can be modelled as

$$m\dot{{\bf{v}}}={{\bf{f}}}_{{\rm{d}}}+{{\bf{f}}}_{{\rm{M}}}+{{\bf{f}}}_{{\rm{g}}}=-\frac{1}{2}{c}_{{\rm{d}}}\,{\rho }_{{\rm{a}}{\rm{i}}{\rm{r}}}{r}^{2}{\rm{\pi }}\parallel {\bf{v}}\parallel {\bf{v}}-{c}_{{\rm{M}}}\,{\rho }_{{\rm{a}}{\rm{i}}{\rm{r}}}\frac{4}{3}{r}^{3}{\rm{\pi }}{\bf{v}}\times {\boldsymbol{\omega }}+m{\bf{g}}$$

(1)

where v is the ball velocity, ρair = 1.204 kg m3 (density of dry air at room temperature and standard pressure), m = 2.7 × 10−3 kg (ball mass), r = 0.02 m (ball radius), cd = 0.55 (drag coefficient), and g = (0, 0, −9.81)T m s2 (gravitational acceleration). Unlike the base model49, which treats the Magnus coefficient cM as constant, we modelled it as \({c}_{{\rm{M}}}=0.1\frac{\Vert {\bf{v}}\Vert }{r\Vert {\boldsymbol{\omega }}\Vert }-0.001\).

Ball–table contact model

The table contact model49, which assumes instantaneous point contact, is enhanced to capture some effects of surface contacts on the coefficient of restitution, εtable, by modelling it as εtable = 0.98 − 0.02vz.

$${{\bf{v}}}^{+}={C}_{v,v}^{\,{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}}{{\bf{v}}}^{{\boldsymbol{-}}}+{C}_{v,\omega }^{\,{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}}{{\boldsymbol{\omega }}}^{{\boldsymbol{-}}}$$

(2)

$${{\boldsymbol{\omega }}}^{+}={C}_{\omega ,v}^{\,{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}}{{\bf{v}}}^{{\boldsymbol{-}}}+{C}_{\omega ,\omega }^{\,{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}}{{\boldsymbol{\omega }}}^{{\boldsymbol{-}}}$$

(3)

$$\begin{array}{cc}{C}_{v,v}^{\mathrm{table}}=\left(\begin{array}{ccc}1-\alpha & 0 & 0\\ 0 & 1-\alpha & 0\\ 0 & 0 & -{\varepsilon }^{\mathrm{table}}\end{array}\right) & {C}_{v,\omega }^{\,\mathrm{table}}=\left(\begin{array}{ccc}0 & \alpha r & 0\\ -\alpha r & 0 & 0\\ 0 & 0 & 0\\ & & \end{array}\right)\\ {C}_{\omega ,v}^{\,\mathrm{table}}=\left(\begin{array}{ccc}0 & -\frac{3\alpha }{2r} & 0\\ \frac{3\alpha }{2r} & 0 & 0\\ 0 & 0 & 0\end{array}\right) & {C}_{\omega ,\omega }^{\,\mathrm{table}}=\left(\begin{array}{ccc}1-\frac{3\alpha }{2} & 0 & 0\\ 0 & 1-\frac{3\alpha }{2} & 0\\ 0 & 0 & 1\end{array}\right),\end{array}$$

where superscripts ‘−’ and ‘+’ are pre- and post-contact quantities, respectively, and

$$\alpha =\alpha ({{\bf{v}}}^{-},{{\boldsymbol{\omega }}}^{-})=\left\{\begin{array}{cc}\mu (1+{{\varepsilon }}^{{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}})\frac{{v}_{z}^{-}}{\parallel {{\bf{v}}}_{{\rm{T}}}\parallel } & ({\nu }_{{\rm{s}}} > 0)\\ \frac{2}{5} & ({\nu }_{{\rm{s}}}\le 0)\end{array}\right.$$

(4)

where the contact type is determined as sliding if νs > 0 and rolling if νs ≤ 0, with

$${\nu }_{{\rm{s}}}={\nu }_{{\rm{s}}}({{\bf{v}}}^{-},{{\boldsymbol{\omega }}}^{-})=1-\frac{5}{2}\mu (1+{{\varepsilon }}^{{\rm{t}}{\rm{a}}{\rm{b}}{\rm{l}}{\rm{e}}})\frac{{v}_{z}^{-}}{\parallel {{\bf{v}}}_{{\rm{T}}}\parallel },$$

(5)

$${{\bf{v}}}_{{\rm{T}}}=\left(\begin{array}{c}{v}_{x}^{-}-r{\omega }_{y}^{-}\\ {v}_{y}^{-}+r{\omega }_{x}^{-}\\ 0\end{array}\right).$$

(6)

εtable and μ are the coefficient of restitution and the dynamic coefficient of friction between ball and table, respectively, modelled as \({\varepsilon }^{{\rm{table}}}={\varepsilon }^{{\rm{table}}}({v}_{z}^{-})=0.98-0.02{v}_{z}^{-}\) and μ = 0.25 from experimental data.

Ball–racket contact model

The linear model proposed in the literature49 is extended to handle the wide ranges of linear and angular velocities encountered in professional-level table tennis by incorporating (1) a velocity-dependent coefficient of restitution and (2) a residual correction neural network to correct model errors. The base linear model shares the same structure as in equations (2) and (3) and is defined as

$${{\bf{v}}}^{+}={R}^{{\rm{T}}}{C}_{v,v}^{\,{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}R({{\bf{v}}}^{-}-{{\bf{v}}}^{{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}})+{R}^{{\rm{T}}}{C}_{v,\omega }^{\,{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}R{{\boldsymbol{\omega }}}^{-}+{{\bf{v}}}^{{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}$$

(7)

$${{\boldsymbol{\omega }}}^{+}={R}^{{\rm{T}}}{C}_{\omega ,v}^{\,{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}R{{\bf{v}}}^{-}+{R}^{{\rm{T}}}{C}_{\omega ,\omega }^{\,{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}R{{\boldsymbol{\omega }}}^{-}$$

(8)

with

$$\begin{array}{cc}{C}_{v,v}^{\,\mathrm{racket}}=\left(\begin{array}{ccc}1-k & 0 & 0\\ 0 & 1-k & 0\\ 0 & 0 & -{\varepsilon }^{\mathrm{racket}}\\ & & \end{array}\right) & {C}_{v,\omega }^{\,\mathrm{racket}}=\left(\begin{array}{ccc}0 & kr & 0\\ -kr & 0 & 0\\ 0 & 0 & 0\\ & & \end{array}\right)\\ {C}_{\omega ,v}^{\,\mathrm{racket}}=\left(\begin{array}{ccc}0 & -\frac{3k}{2r} & 0\\ \frac{3k}{2r} & 0 & 0\\ 0 & 0 & 0\end{array}\right) & {C}_{\omega ,\omega }^{\,\mathrm{racket}}=\left(\begin{array}{ccc}1-\frac{3k}{2} & 0 & 0\\ 0 & 1-\frac{3k}{2} & 0\\ 0 & 0 & 1\end{array}\right),\end{array}$$

where R is the rotation matrix from the local frame of the racket to the global frame of reference, vracket is the racket linear velocity at impact and k is a coefficient relating tangential quantities. \({\varepsilon }^{\mathrm{racket}}={\gamma }_{1}{{\rm{e}}}^{{\gamma }_{2}|{v}_{z}^{-{\prime} }|}\) is modelled as a function of the normal relative velocity \({v}_{z}^{-{\prime} }\) by fitting coefficients γ1γ2 on data collected from games. The residual correction neural network is a small multilayer perceptron trained on game data and corrects both velocity and angular velocity error by 4% on average.

Sensor modelling

To model the ball triangulation obtained from APS cameras, we sample latency from a uniform distribution and noise from a zero-mean Gaussian distribution, and apply dropout of sensor measurements with a fixed probability. For the spin estimation, latency and dropout are modelled similarly, with additional dropout applied directly after racket contact to reflect tracking loss of GCS around these events. Both precision (sensor noise) and accuracy (sensor bias) of GCS are modelled using separate zero-mean Gaussian distributions for spin magnitude and axis. However, accuracy is sampled once per contact event to mimic the bias introduced by GCS reinitialization at these events.

Physics perturbations

To improve the simulation-to-reality transfer, ball state perturbations are added after table contact. Each Cartesian component of ball linear and angular velocities is perturbed independently using a zero-mean Gaussian distribution.

Robot dynamics

Robot joints are modelled as decoupled, delayed linear time-invariant systems, in which each joint i is described by

$$\begin{array}{c}{\dot{{\boldsymbol{\zeta }}}}_{i}
(9)

with \({{\boldsymbol{\zeta }}}_{i}

Episode definition

An episode begins when the ball is in free flight, moving towards the robot. An episode ends when the ball meets one of four conditions: (1) the ball is out of play or no longer legal; (2) the robot hits the ball; (3) the ball passes the racket of the robot; and (4) the joint trajectory produced by Ace would result in a collision with itself or the table.

Rewards

The reward function used during training consists of several terms, all of which are calculated after the episode has finished, that is, as a function of the terminal state. Although reward terms vary across policies to induce different skills, they can be categorized by assigning specific rewards for (1) missing the ball; (2) hitting the ball but failing to return it; or (3) successfully returning the ball:

$$\left\{\begin{array}{cc}{R}_{\text{miss}} & \text{if robot fails to hit the ball}\\ {R}_{\text{hit}}^{{\rm{\neg }}\text{return}} & \text{if robot hits the ball but fails to return it}\\ {R}_{\text{hit}}^{\text{return}} & \text{if robot hits the ball and returns it}\end{array}\right.$$

(10)

A subset of the policies use a reward formulation for \({R}_{\,{\rm{hit}}}^{{\rm{return}}}\) that can be parameterized by a desired y-landing position (ydesired) and a set of reward weights (wreward = (wpws), where wp (0, 1) and ws ( − 1, 1)). ydesired is used to calculate a reward based on the distance between ydesired and the achieved y-landing position, wp is used to weight this distance reward and ws is used to weight a term proportional to the angular velocity in the y-axis of the ball frame on landing. By sampling these conditioning variables, these policies can exhibit a variety of different behaviours such as aiming, topspin and backspin.

States

The state in our RL framework can be written as \({{\bf{s}}}_{t}=({{\bf{s}}}_{t}^{\mathrm{ball}},{{\bf{s}}}_{t}^{\mathrm{robot}},{{\bf{s}}}_{t}^{\mathrm{skill}})\). \({{\bf{s}}}_{t}^{{\rm{ball}}}\) is the ball state consisting of ball position and spin histories of length N, along with their associated time-stamps. \({{\bf{s}}}_{t}^{{\rm{robot}}}\) is the robot state and consists of the joint states (position, velocity and acceleration) and end effector state (pose and twist) associated with the terminal state of \({Q}_{t-1}^{\ast (1:T)}\) (for further details see Supplementary Information section 1.4.1). For policies trained with parameterized reward functions, the state is further augmented with \({{\bf{s}}}_{t}^{{\rm{skill}}}\), which is the fixed skill state composed of ydesired and wreward. st is used to infer actions at, which are subsequently mapped to joint trajectories and reset plans (see Supplementary Information sections 1.4.2 and 1.4.3). This process requires a time budget of 5 ms and so st must be constructed 5 ms before the next set of commands is sent to the robot (Extended Data Fig. 1).

Actions

Actions, \({{\bf{a}}}_{t}\in {(-1,1)}^{2{N}_{q}}\), are sampled from a tanh squashed multivariate Gaussian distribution. This forms an abstract space, in which for each joint there are two actions that define a target joint position and velocity 32 ms into the future (see Supplementary Information section 1.4.2).

Transition probability function

The transition probability function is as follows:

$$\left\{\begin{array}{l}{{\bf{s}}}_{t+1}^{{\rm{b}}{\rm{a}}{\rm{l}}{\rm{l}}}\,\sim \,{f}_{{\rm{b}}{\rm{a}}{\rm{l}}{\rm{l}}}({{\bf{s}}}_{t},{{\bf{a}}}_{t})\\ {{\bf{s}}}_{t+1}^{{\rm{r}}{\rm{o}}{\rm{b}}{\rm{o}}{\rm{t}}}\,=\,{f}_{{\rm{r}}{\rm{o}}{\rm{b}}{\rm{o}}{\rm{t}}}({{\bf{s}}}_{t},{{\bf{a}}}_{t})\\ {{\bf{s}}}_{t+1}^{{\rm{s}}{\rm{k}}{\rm{i}}{\rm{l}}{\rm{l}}}\,=\,{{\bf{s}}}_{t}^{{\rm{s}}{\rm{k}}{\rm{i}}{\rm{l}}{\rm{l}}}\end{array}\right.$$

(11)

where fball is a stochastic function that depends on sensors and physics modelling in the simulator, and frobot is a deterministic function depending on \({Q}_{t-1}^{* (1:T)}\) and the robot dynamics.

Initial state training distribution

During training in simulation, an episode starts with an initial state that is sampled from three independent distributions:

  1. 1.

    Initial ball state \({{\bf{s}}}_{0}^{{\rm{ball}}}\): the initial state of the ball is sampled from a kernel density estimation (KDE) model fit either to synthetic or human data. For the synthetic dataset, shots are uniformly sampled from a range of initial ball states and checked for validity. KDE models are generated for both returns (that is, shots performed during a rally) and serves. During training, serves and returns are sampled at a ratio of 3:7, and the initial state is sampled with a fixed probability from either the synthetic or the human KDE models (Supplementary Information section 1.4.3).

  2. 2.

    Initial robot state \({{\bf{s}}}_{0}^{{\rm{robot}}}\): the initial robot state can be static or dynamic. Static states are sampled with the arm in a neutral configuration, and prismatic actuators are initialized uniformly within their allowed range, whereas dynamic states are sampled from reset plans stored during previous training episodes.

  3. 3.

    Initial skill state \({{\bf{s}}}_{0}^{{\rm{skill}}}\): ydesired is sampled uniformly within the bounds of the opponent’s side of the table. wreward is sampled in a way that is biased towards sparse reward weight vectors and boundary values (Supplementary Information section 1.4.5).

Algorithm

To train the deep RL policy, we use SAC36 asynchronously with multiple data collection tasks in parallel2 (see Supplementary Table 8 for hyperparameters). We use asymmetric actor–critic27,28,29, providing the ground-truth ball state from the simulator to the critic and sequences of sensor measurements to the actor. Apart from the standard policy loss, an auxiliary loss is added to the policy to reconstruct the ground truth ball state from its ball state embedding. When collecting experience, we apply three different forms of data augmentation as follows:

  1. 1.

    Symmetric augmentation to mirror all states, actions and rewards with respect to the XZ plane (that is, the plane containing the centre-line of the table and perpendicular to both the table and the net).

  2. 2.

    Event tables50 to store transitions leading to predetermined events in separate replay buffers for stratified sampling of the mini-batch. The events used in our training pipeline are defined based on heuristics and include the following events: near miss, ball hit, ball returned, high-speed return, high-top-spin return, high-back-spin return (see Supplementary Information section 1.4.8).

  3. 3.

    Hindsight experience replay51 to augment RL transitions with an additional copy in which ydesired is equal to the achieved position, wp is equal to 1, and the maximum position-based reward is given.

Feasible action for optimal control

Mapping algorithm

The action at sampled from the deep RL policy is mapped from the abstract set \({(-1,1)}^{2{N}_{q}}\) to the feasible set of joint position and velocity pairs 32 ms in the future, using a mapping algorithm. The generic mapping algorithm can be stated as follows: Let \({\mathbb{X}}\subset {{\mathbb{R}}}^{n}\) be the compact base set with centre \(\bar{{\bf{x}}}\) and \({\mathbb{Y}}\subset {{\mathbb{R}}}^{n}\) be the compact target set with centre \(\bar{{\bf{y}}}\). For a given mapping \(({{\bf{x}}}_{i}\in {\mathbb{X}},{{\bf{y}}}_{i}\in {\mathbb{Y}})\), if \({{\bf{y}}}_{i}=\bar{{\bf{y}}}\) then \({{\bf{x}}}_{i}=\bar{{\bf{x}}}\), otherwise

$$\begin{array}{c}{{\bf{x}}}_{i}=\bar{{\bf{x}}}+{{\boldsymbol{\delta }}}_{i}\\ {{\bf{y}}}_{i}=\bar{{\bf{y}}}+\frac{{\beta }_{i}}{{\alpha }_{i}}f({{\boldsymbol{\delta }}}_{i})\\ {\alpha }_{i}\ge 1:\bar{{\bf{x}}}+{\alpha }_{i}{{\boldsymbol{\delta }}}_{i}\in \partial {\mathbb{X}}\\ {\beta }_{i} > 0:\bar{{\bf{y}}}+{\beta }_{i}f({{\boldsymbol{\delta }}}_{i})\in \partial {\mathbb{Y}}\end{array}$$

(12)

where \(\partial {\mathbb{X}}\) and \(\partial {\mathbb{Y}}\) are the boundaries of \({\mathbb{X}}\) and \({\mathbb{Y}}\), respectively, \(\bar{{\bf{y}}}\) the centre of \({\mathbb{Y}}\). The ratio \(\frac{{\beta }_{i}}{{\alpha }_{i}}\) determines the location of yi between \(\bar{{\bf{y}}}\) and \(\partial {\mathbb{Y}}\), whereas the function f() modifies δi to account for the shape differences between \({\mathbb{X}}\) and \({\mathbb{Y}}\). The mapping is bijective, invertible and centre to centre and boundary to boundary of the map sets.

Optimization problem formulation

We use the result of the mapping as a terminal position and velocity constraint for an optimization problem that computes reference trajectories for each robot joint as cubic splines that minimize jerk. By definition of the problem, the result of the mapping is always inside the maximum control invariant set, which is the largest subset of the feasible state space containing the initial states from which the associated MPC problem is recursively feasible52 (Supplementary Information section 1.5.1). The result of the mapping forms the initial state for the next optimization problem. The optimization is solved using DAQP53 and sampled at 1 kHz to generate \({Q}_{t}^{* (1:T)}\).

Reset trajectories

For every \({Q}_{t}^{* (1:T)}\) produced, a reset trajectory is required that moves the robot from the terminal state of \({Q}_{t}^{* (1:T)}\) to a target stationary reset position. Ace uses a near time-optimal variation of MPC (see Supplementary Information section 1.5.2) to generate these reset trajectories. They are executed as soon as one of the termination criteria for the RL episode is satisfied (Supplementary Information section 1.4.1). If the episode is terminated due to a predicted collision, then the reset trajectory from the previous RL step is executed.

The target reset position is chosen as either a constant neutral configuration or a configuration computed by a prepare policy network. The prepare policy is trained using a dataset constructed from elite-level rallies for high-dexterity shot execution. From each recorded rally, we extract (1) the ball state at the start of an episode, \({{\bf{s}}}_{0}^{{\rm{ball}}}\); (2) ydesired; and (3) subsequent racket position \({{\bf{x}}}_{{t}_{{\rm{c}}}}^{{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}\) executed by robot at contact time tc. For each \({{\bf{x}}}_{{t}_{{\rm{c}}}}^{{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}}\), we compute offline the optimal reset configuration \({{\bf{q}}}_{\,{\rm{r}}{\rm{e}}{\rm{s}}{\rm{e}}{\rm{t}}}^{\ast }\) that maximizes a dexterity objective \(D({\bf{q}},{{\bf{x}}}_{{t}_{{\rm{c}}}}^{{\rm{r}}{\rm{a}}{\rm{c}}{\rm{k}}{\rm{e}}{\rm{t}}})\), considering kinematic constraints. This process yields a training dataset \({\mathcal{D}}={\{{({{\bf{s}}}_{0}^{{\rm{b}}{\rm{a}}{\rm{l}}{\rm{l}}},{y}_{{\rm{d}}{\rm{e}}{\rm{s}}{\rm{i}}{\rm{r}}{\rm{e}}{\rm{d}}},{{\bf{q}}}_{{\rm{r}}{\rm{e}}{\rm{s}}{\rm{e}}{\rm{t}}}^{\ast })}^{i}\}}_{i=1}^{M}\) of M samples. During deployment for each shot, the agent receives \(({{\bf{s}}}_{0}^{{\rm{ball}}},{y}_{{\rm{desired}}})\) as input and predicts the optimal \({{\bf{q}}}_{\,{\rm{r}}{\rm{e}}{\rm{s}}{\rm{e}}{\rm{t}}}^{{\rm{d}}{\rm{e}}{\rm{s}}{\rm{i}}{\rm{r}}{\rm{e}}{\rm{d}}}\) that supports dexterous execution of subsequent actions. \({{\bf{q}}}_{\,{\rm{r}}{\rm{e}}{\rm{s}}{\rm{e}}{\rm{t}}}^{{\rm{d}}{\rm{e}}{\rm{s}}{\rm{i}}{\rm{r}}{\rm{e}}{\rm{d}}}\) is sampled from a Gaussian distribution estimated from the N nearest reset configurations \({\{{{\bf{q}}}_{\,{\rm{r}}{\rm{e}}{\rm{s}}{\rm{e}}{\rm{t}}}^{\ast j}\}}_{j=1}^{N}\) to \({{\bf{q}}}_{{\rm{reset}}}^{* }\) of \(({{\bf{s}}}_{0}^{{\rm{ball}}},{y}_{{\rm{desired}}})\) found by KD-tree (k-dimensional tree) search on the dataset.

Policy sampler

Ace uses multiple rally-specific policies trained to optimize different objectives and therefore requires a sampling strategy during matches. Ace uses four different strategies for sampling the policies (see Supplementary Information section 1.7.1 for details):

  1. 1.

    Fixed: a single policy is sampled with a fixed probability of 1.

  2. 2.

    Random: a policy is chosen at random on a shot-by-shot basis from a subset of policies.

  3. 3.

    Heuristic: a set of heuristics dictates the policy sampling on a shot-by-shot basis. The heuristics map the characteristics of the incoming ball to the most appropriate policy.

  4. 4.

    Data-driven: a supervised learning model is trained to classify winning and losing shots based on data from elite table tennis players other than the seven players in the evaluation. The model is used to identify shots with the highest predicted win rate, and the policy most capable of producing those shots is sampled.

For policies conditioned on ydesired and wreward, Ace samples them from the same fixed probability distribution used during training, that is, uniform for ydesired and sparse but biased towards boundary values for wreward. As these policies are conditioned on ydesired, they also afford the use of the prepare policy, which requires ydesired as input.

Serve design

Ace achieves ITTF-compliant serves by executing a single-arm toss using the ball cup mounted on its end effector (Fig. 2c), followed by striking the ball during its free fall. Although standard ITTF rules require a free-hand toss, one-handed serves are permitted when a player has a physical disability that impedes them from properly tossing the ball with the free hand, providing a precedent for our implementation.

For the serve tossing, we collect human serve demonstrations and re-target them to the kinematics of the robot using an optimization procedure54. The resulting motion is a trajectory of joint commands utoss(t) that produces a valid ball toss when executed by the robot. We define tlift as the time index of the tossing trajectory in which the acceleration of the ball approximates that of gravity, meaning that the ball has been released from the cup.

In simulation, the ball-striking motion ustrike(t) is obtained by connecting the current state of the robot at t = tlift to a racket state produced by a genetic algorithm (GA). This connection trajectory is generated by an MPC in racket space, for which the GA searches the optimal parameters \({{\boldsymbol{\xi }}}_{s}^{\ast }\) (Supplementary Information section 1.8.2) that maximize a fitness function \({{\mathcal{F}}}_{\theta }(\cdot )\). This fitness function is designed to capture serve metrics of interest (for example, ball velocity, spin, landing position), which conditions the type of serves produced. The final serve trajectory is the concatenation of utoss(t), t (0, tlift) followed by ustrike(t).

As ustrike(t) is based on simulated physics, we assess the effectiveness of each serve on the real robot in dedicated sessions with a coach. Serves deemed sufficiently challenging for matches undergo repeated open-loop execution (at least 20 times) to verify their reliability. If the failure probability of a serve is less than 5%, it is added to the library for open-loop use during matches. If the probability exceeds 5%, we attempt closed-loop MPC execution, where the parameters \({{\boldsymbol{\xi }}}_{s}^{\ast }\) are updated online with actual hitting states output by a ball flight predictor. If this procedure successfully decreases the failure rates to 5% or less, the serve is added to the library for closed-loop use. Details on how the serves were selected from the library can be found in Supplementary Information section 1.8.1.

Experimental protocol

Players warm up with another player for up to 15 min before the match. The player practices with the robot immediately before the start of the match for 2 min, as directed by the rules of ITTF (https://www.ittf.com). During this practice period, Ace uses a policy that returns balls to a fixed position with moderate top spin, as is common practice. Players were informed that if they enter the court side of the robot, a safety light curtain triggers an emergency system stop. However, crossing the centre line is rare in high-level games, and such a trigger never occurred in our experiments. They were instructed to wear goggles to protect their eyes. They choose goggles from a variety of sizes and colours to minimize the impact on their performance. The player decides whether to serve or receive first. The player is eligible to call a 1-min time-out during the match. Elite D used this option during the third game. Following the ITTF rules, the robot racket is shown to the player and umpire before the match. All equipment, including table (SAN-EI), net (Butterfly), balls (Nittaku), racket (VICTAS and Butterfly) and floor mat, is approved by the ITTF. We use diffused lights to ensure a uniform light intensity over the whole playing area with around 1,400 lux as directed in the rules of ITTF.



■ مصدر الخبر الأصلي

نشر لأول مرة على: www.nature.com

تاريخ النشر: 2026-04-22 06:00:00

الكاتب: Peter Dürr

تنويه من موقع “beiruttime-lb.com”:

تم جلب هذا المحتوى بشكل آلي من المصدر:
www.nature.com
بتاريخ: 2026-04-22 06:00:00.
الآراء والمعلومات الواردة في هذا المقال لا تعبر بالضرورة عن رأي موقع “beiruttime-lb.com”، والمسؤولية الكاملة تقع على عاتق المصدر الأصلي.

ملاحظة: قد يتم استخدام الترجمة الآلية في بعض الأحيان لتوفير هذا المحتوى.

اترك تعليقاً

لن يتم نشر عنوان بريدك الإلكتروني. الحقول الإلزامية مشار إليها بـ *