Waleed D. Khamies1, Monatser Mohammedalamen1, Benjamin Rosman2
1University of Khartoum, 2University of Witwatersrand, CSIR
waleed.daud@outlook.com, montaserfath@gmail.com, benjros@gmail.com
I. INTRODUCTION
A. Problem Definition
Limbs are hugely valuable to many people, in that they improve mobility and the ability to manage daily activities, as well as provide the means to stay independent. It is costly (50K USD) and time-consuming for the manufacturers to design artificial limbs customized for one person, Designing intelligence prosthetics which deal with the large differences between humans (like human body dimensions, weights, height and walking styles) is so complicated by the large variability in response among many individuals. One key reason for this is that our understanding of the interactions between humans and prostheses is not well-understood, which limits our ability to predict how a human will adapt his or her movement. Physics-based, biomechanical simulations are well-positioned to advance this field as it allows for many experiments to be run at low cost.
B. Environment
We use OpenSim ProstheticsEnv environment, which models one human leg and prosthetic in another leg see in fig(1), OpenSim is a 3D human model simulator, which consists of observations of joints, muscles and tendons, 19 actions, and the reward Rt is the negative distance from the desired velocity in eq( 1).
where Vt is the horizontal velocity vector of the pelvis. OpenSim environment has a limitation it is very slow to run due to the high number of observations and state variables.
C. Reinforcement Learning (RL) algorithms
Reinforcement Learning (RL) will help prosthetics to calibrate with differences between humans and differences between walking environments [1], RL is a machine learning paradigm, where an agent learns the optimal policy for performing a sequential decision making without complete knowledge of the environment [2]. The agent must explore the environment by taking action At and edit the policy according to the reward function to maximize the reward Rt. We use the DDPG algorithm [3], TRPO and PPO to train the agent. DDPG is a model-free, off-policy actor-critic algorithm using deep function approximators that can learn policies in high-dimensional, continuous action spaces.
D. Imitation Learning
The main problem with RL algorithms is the time needed to solve the problem -training time- because the algorithm must explore the environment and adapt its policy according to the reward at every timestep. Imitation learning is a specific subset of RL where the learner tries to mimic an experts action in order to achieve the best performance. The main advantage of DAgger is that the expert teaches the learner how to recover from past mistakes [4], and we aim to leverage this to illustrate behaviour learning. There are many ways to accelerate the learning process in RL, such as Cross-Domain Transfer [2], Inter-task Mapping via Artificial Neural Network (ANN) [5]. We use Imitation learning to achieve that by implementing DAgger algorithm. The DAg- ger algorithm has shown to be able to achieve expert-level performance after a few data aggregation iterations [6]. To use imitation learning there are two assumptions:
- Similarity between the expert and the target agent in actions, observations space and the reward function.
- Environment must be described by a Markov Decision Process (MDP).
Algorithm
|
Maximum reward
|
Mean reward
|
---|---|---|
DDPG
|
113
|
-42
|
TRPO
|
194
|
43
|
PPO
|
70
|
-58
|
In the standard DAgger algorithm, the target agent exploits the expert policy and stops exploring the environment. This may be a problem, as the target agent should balance between exploitation and exploration. We propose some improve- ments to the DAgger algorithm to encourage the exploration.
II. EXPERIMENTS We run the following experiments:1
- Run RL algorithms (DDPG, TRPO and PPO) in OpenSim ProstheticsEnv, 2,000 episodes and 1,000 timesteps in the episode to give an agent more time to walk or stand up.
- we trained a DDPG agent to achieve positive reward (around +100) in the standing up task.
- we use that agent as an expert to evaluate the DAgger algorithm.
- we modify DAgger so that the expert agent labels the target agents actions based on the timestep reward, by comparing between the timestep reward of the expert agent and the target agent on a given timestep: if the expert agent has less reward than the target agent, the expert keeps the target agents action and the opposite is true.
- we use the target action value instead of timestep reward to do the comparison, and we sum the timestep rewards from a given state and action pair until the end of the episode
- we used the epsilon-greedy method [2], where the algorithm has the choice to select between taking the target action with a probability 1−ε or the expert action with a probability ε.
III. RESULTS
The maximum reward mean achieved by TRPO (see table I and fig 2), but it takes more time comparing with PPO and DDPG because it need to find the inverse of matrix which takes time. Although of this reward the agent can not walk for more than one step and sometimes it falls before the first step. Dagger algorithm achieved the best average reward comparing with other algorithms which balance between exploiting and exploring (see fig 3), we think the reasons behind that:
- The expert policy has a high reward.
- The high similarity between expert and naive agent.
- The naive agent needs more time to run by increasing
the number of iterations.
The main problem with timestep reward modification, it compares timestep reward (short-term) adding to this the large variation between timesteps. When the variation be- tween episodes is small in Action-value. The naive agent has gotten a reward greater than the expert agent, so roles can be exchanged, the expert can be naive and the naive can be an expert which will decrease training time significantly.
IV. CONCLUSIONS
We have applied imitation learning in a humanoid en- vironment to accelerate the learning process. The naive agent reaches convergence within 5 iterations while the expert reaches it after 100 iterations which means reducing training time by 95%. The DAgger algorithm achieved the best average reward comparing with other algorithms which balance between exploiting and exploring, these algorithms will work better when there is some degree of variation between expert and naive agent and this is what we are planning to do in the future by apply imitation learning from normal human legs to prosthetic, the main challenge will be how to figure out the differences and similarities between it.
V. RESEARCH LIMITATIONS
- The prosthetic model can not walk for large distances even can falls before completing the first step.
- Each experiment runs for one time, So we are planing to repeat each experiment number of times with differ- ent random seeds and take the average and variance.
- We used same hyperparameters for all algorithm to benchmark algorithms, we need to select the best hyperparameters for each algorithm and environment.
- We benchmarcked three RL algorithms only and from one library(ChainerRL). So we are planing to use different implementations.
REFERENCES
- T. Garikayi, D. van den Heever and S. Matope, (2016), Robotic prosthetic challenges for clinical applications, IEEE International Conference on Control and Robotics Engineering (ICCRE), Singapore, 2016, pp. 1-5. doi: 10.1109/ICCRE.2016.7476146
-
Joshi, Girish & Chowdhary, Girish. (2018). Cross-Domain Transfer in Reinforcement Learning using Target Apprentice.
- Lillicrap, Timothy & J. Hunt, Jonathan & Pritzel, Alexander & Heess, Nicolas & Erez, Tom & Tassa, Yuval & Silver, David & Wierstra, Daan. (2015). Continuous control with deep reinforcement learning. CoRR.
- Attia, Alexandre & Dayan, Sharone. (2018). Global overview of Imitation Learning.
- Cheng, Qiao & Wang, Xiangke & Shen, Lincheng. (2017). An Au- tonomous Inter-task Mapping Learning Method via Artificial Neural Network for Transfer Learning. 10.1109/ROBIO.2017.8324510.
- J.J. Zhu, DAgger algorithm implementation, (2017), GitHub reposi- tory,