Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement

Conference on Robot Learning (CoRL) 2025

Kallol Saha*1, Amber Li*1, Angela Rodriguez-Izquierdo*2, Lifan Yu1,
Ben Eisner1, Maxim Likhachev1, David Held1
1Robotics Institute, Carnegie Mellon University    2Princeton University
*Equal Contribution

Abstract

Multi-object rearrangement is a challenging task that requires robots to reason about a physical 3D scene and the effects of a sequence of actions. While traditional task planning methods are shown to be effective for long-horizon manipulation, they require discretizing the continuous state and action space into symbolic descriptions of objects, object relationships, and actions. Our proposed method is instead able to take in a partially-observed point cloud observation of an initial scene and plan to a goal-satisfying configuration, without needing to discretize the set of actions or object relationships. We formulate the planning problem as an A* search over the space of possible point cloud rearrangements. We sample point cloud transformations from a learned, domain-specific prior and then search for a sequence of such transformations that leads from the initial state to a goal. We evaluate our method in terms of task planning success and task execution success in both simulation and real-world environments. We experimentally demonstrate that our method produces successful plans and outperforms a policy-learning approach; we also perform ablation experiments that show the importance of search in our approach.



System Overview

Our method takes as input an RGB-D image of the scene, the names of the objects, and a goal function. From this input, it obtains a segmented point cloud, from which A* search begins planning over continuous point cloud space to reach a goal-satisfying configuration. At each step, A* explores possible actions, each consisting of (1) an object to move and (2) a placement for that object. The learned object suggester guides object selection, the predicted object placement is sampled from a learned placement suggester, and the learned model deviation estimator guides search towards actions that are more physically plausible. The method outputs a plan consisting of objects and their transformations, which the robot can execute without additional ground-truth information.


System Overview

Object and Placement Suggesters

Learned object and placement suggesters. (a) The object suggester operates on a point cloud observation of the scene. From its outputs, we derive a probability distribution that predicts which objects in the scene can feasibly be moved. (b) Given an object to move and the point cloud of the scene, the placement suggester can sample multiple transformations for that object of where it might be moved next.


System Overview

Data Collection

We train our learned placement suggester over demonstrations of each task. The demonstrations are given in the form of a sequence of RGB-D images recorded from a camera, together with the semantic names of the objects in the scene. The initial observation is converted into a segmented point cloud using the extrinsic camera values and Grounded Segment Anything. With a single camera, we can only get partial observations of the scene, which can lead to occlusions. To avoid this issue, we collect all the subsequent data by transforming each object’s initial point cloud by the transformation applied at that time step in the demonstration. To track where the objects are moving, we use CoTracker3.



Comparison to Points2Plans

Similar to our approach, prior work such as Points2Plans plan to solve long-horizon manipulation tasks from partially-observed point clouds. In contrast to our method, Points2Plans and its baselines make use of relational state abstractions to convert the point cloud observations to a symbolic representation of the scene for planning. We compare against these approaches by running our method in their constrained packing environment, where the robot must place items into a spatially constrained environment (a cupboard) by carefully reasoning about the objects’ planned placement positions. Our method slightly outperforms Points2Plans on the constrained packing task, achieving 100% task planning success and 84% task execution success on an evaluation set consisting of 5 different initial configurations run with 100 different random seeds each.


System Overview

Simulation Results

We compare the execution performance of our method to 3D Diffusion Policy, an end-to-end imitation learning policy. Our DP3 baseline is trained on a dataset of 23 task-specific demonstrations of a simulation block stacking task provided by a human expert. All of the demonstrations lead to a single consistent goal configuration since DP3 is not goal-conditioned. We evaluate on five tasks of varying complexity, where task complexity is defined by the number of steps in an optimal plan to reach a goal state. As shown below, although DP3 achieves some success on 2-step tasks, it is unable to complete any 3 or 4-step tasks. See Table 2 in our paper for success rates.



Qualitative Analysis: Finding Efficient Path Lengths

Our method can often complete the task more efficiently than the human demonstrations, since our learned object and placement suggesters capture a goal-agnostic distribution of relevant objects and placements. By sampling from this distribution, our approach explores different possible solutions, allowing it to find shorter and more effective paths to the goal, compared to the human demonstration.


Generalization to unseen Configurations

Our method generalizes to previously unseen configurations and successfully finds optimal paths for them as well. This demonstrates that the object and placement suggesters have learned a generalizable distribution that is leveraged by search to find optimal paths.



Acknowledgements

This material is based upon work supported by ONR MURI N00014-24-1-2748 and by the Toyota Research Institute. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Office of Naval Research or Toyota Research Institute.


BibTeX

@inproceedings{saha2025planning,
    title={Planning from Point Clouds over Continuous Actions for Multi-object Rearrangement},
    author={Saha, Kallol and Li, Amber and Rodriguez-Izquierdo, Angela and Yu, Lifan and Eisner, Ben and Likhachev, Maxim and Held, David},
    booktitle={Conference on Robot Learning (CoRL)},
    year={2025}
}