A New Paradigm for Robotics

Making Robots Learn Instantly

A New Paradigm for Robotics
A New Paradigm for Robotics
Adham Ghazali
December 22, 2022

Robotics has not seen as much progress as other areas of AI. Industrial robots are effective at certain tasks, such as welding car parts, but they are costly and not very versatile. Instead of working on more practical solutions, like improving movement and decision-making, the industry tends to focus on creating expensive, specialized robots that can do specific tasks like playing the violin or smiling without any practical purpose.

One of the core issue for this  underachievement is the core AI needed to produce such practical and cost-effective Robots. For example today’s Robotic navigation is wildly different than the way we do navigation as human beings. If I gave you a photo of my house that I took a few days ago in an unfamiliar neighborhood with no house numbers, you might be able to find my house by following the streets and going around the block. You might take a few wrong turns at first, but you would eventually be able to locate my house and create a mental map of the neighborhood. The next time you visit, you will likely be able to navigate to my house right away without taking any wrong turns. Such exploration and navigation is easy for humans.

To  enable robots with such exploration and navigation behavior we need to learn from diverse prior datasets in the real world. However, collecting a large amount of data from demonstrations, or even with randomized exploration, can be challenging for the robot. It needs to generalize to unseen neighborhoods, recognize visual and dynamical similarities across scenes, and learn a representation of visual observations that is robust to distractors like weather conditions and obstacles. Since such factors can be hard to model and transfer from simulated environments, we tackle these problems by building a multi-model learning algorithm that combines language, visuals and actions.

AI Image generators create images which are a mix of reality and fantasy. They have become increasingly popular on the internet. The images produced are often random and whimsical, providing a window into the brain of the human designer. A text prompt can quickly generate an image, which our brains are wired to appreciate.

Image models appear to encode some sort of understanding about natural or physical worlds, either dynamically or geometrically. For example, if you ask a model to generate a stable configuration of blocks, it will generate a block configuration that’s stable. If you tell it to generate an unstable configuration of blocks, it will look very unstable. Or if you say “a cat riding a bike ” it will generate something ‘feasible’, even through it has never seen something like it.

What makes this technology so good is a mix of the following:

Large Language Models (LLMs) are self-supervised

LLMs are trained to predict the next word(s) in a sentence, so their training data is self-supervised; i.e., the inputs and outputs are the same. The data used to train LLMs varies from the entire works of Shakespeare to C++ code snippets. LLMs are zero-shot learners, where no domain-tailored training data is needed to adapt to new tasks.

CLIP (Contrastive Language-Image Pre-Training)

CLIP is an Image Captioning Network that connects text to visuals. Like LLMs CLIP demonstrates zero-shot capabilities in downstream tasks with little to no domain-specific training data.

Diffusion Models for Image Synthesis

Diffusion models were originally proposed in 2015. They work by corrupting the training data by progressively adding Gaussian noise, and slowly wiping out details in the data until it becomes pure noise. Then, a neural network is trained to reverse this corruption process. Running this reversed corruption process synthesizes data from pure noise by gradually de-noising it until a clean sample is produced. There has been a recent resurgence in interest in diffusion models due to their training stability and the promising results they have achieved in terms of image and audio quality.

Large-Scale Datasets

internet-scale image-caption pairs are readily available anyone with access to the internet. See the  work at LAION.AI

Proposed Solution ROBOT Navigation & Decision making in unknown environments

Robotic navigation and decision-making (motion planning) have been approached as a problem of 3D reconstruction (Perception) and planning, as well as an end-to-end learning problem.

The first method requires hand engineering that is difficult to scale from one environment to another. end-to-end learning is a large black box that is uncontrollable and cannon be debugged leading to unpredictable development cycles.

Our proposed approach integrates learning and planning, and can utilize side information such as schematic roadmaps, text instructions, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate.

Our method is divided  into four sub-components

  • An Action Detection Network(ADN): It looks at the robots current camera observation/s  and performs Action conditioned pre-training. This network is trained to detect potential actions that the robot can execute (sub-goal/s).
  • A local Traversability Model:  it looks at the robot's current camera observation and a potential sub-goal/s to infer how easily that sub-goal/s can be reached.
  • A Heuristic Model: looks at overhead maps for hints and attempts to evaluate the appropriateness of these sub-goals in order to reach the goal. These models are used by a heuristic planner to identify the best waypoint in order to reach the final destination.
  • A  Motion Model: it takes the sub-goal, the current sensor observations and the detected action to produces a local driving trajectory. The model is trained to reproduce a self-supervised local SLAM algorithm.

Our target is to use an image-based learned controller and goal-directed heuristic to navigate to goals few kilometers away and execute novel tasks when we reach our goals in previously unseen environments, without performing any explicit geometric reconstruction, by utilizing only a topological representation of the environment.

We also add language instructions as inputs between two adjacent sub goals. The sub goals have random sampling distance as long as it it visible, e.g. ‘stop next the Starbucks behind the building’ is an invalid sub goal.

The resulting method should be robust to unreliable maps, GPS, and commands, since the low-level controller (Decision-making Model) ultimately makes decisions based on egocentric image observations.

Adham Ghazali

An experienced entrepreneur who loves to lead early-stage engineering teams in the fields of AI, Robotics, and Autonomous Systems.