RLApplications.bib

Castanon:1997p23922
Approximate dynamic programming for sensor management
D. Castanon
Decision and Control, 1997., Proceedings of the 36th IEEE Conference on  2  1202 -- 1207 vol.2  (1997)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=657615&isnumber=14272&punumber=5239&k2dockey=657615@ieeecnfs
http://dx.doi.org/10.1109/CDC.1997.657615
This paper studies the problem of dynamic scheduling of multi-mode sensor resources for the problem of classification of multiple unknown objects. Because of the uncertain nature of the object types, the problem is formulated as a partially observed Markov decision problem with a large state space. The paper describes a hierarchical algorithm approach for efficient solution of sensor scheduling problems with large numbers of objects, based on a combination of stochastic dynamic programming and nondifferentiable optimization techniques. The algorithm is illustrated with an application involving classification of 10,000 unknown objects
Keywords: application, sensor management, object classification, dynamic scheduling of multi-mode sensor resources, RL
SzeFongYau:2000p23917
Performance analysis of the approximate dynamic programming algorithm for parameter estimation of superimposed signals
S. F. Yau and Y. Bresler
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on]  48  1274 -- 1286  (2000)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=839975&isnumber=18164&punumber=78&k2dockey=839975@ieeejrns
http://dx.doi.org/10.1109/78.839975
We consider the classical problem of fitting a model composed of multiple superimposed signals to noisy data using the criteria of maximum likelihood (ML) or subspace fitting, jointly termed generalized subspace fitting (GSF). We analyze a previously proposed approximate dynamic programming algorithm (ADP), which provides a computationally efficient solution to the associated multidimensional multimodal optimization problem. We quantify the error introduced by the approximations in ADP and deviations from the key local interaction signal model (LISMO) modeling assumption in two ways. First, we upper bound the difference between the exact minimum of the GSF criterion and its value at the ADP estimate and compare the ADP with GSF estimates obtained by exhaustive multidimensional search on a fine lattice. Second, motivated by the similar accuracy bounds, we use perturbation analysis to derive approximate expressions for the MSE of the ADP estimates. These various results provide, for the first time, an effective tool to predict the performance of the ADP algorithm for various signal models at nonasymptotic conditions of interest in practical applications. In particular, they demonstrate that for the classical problems of sinusoid retrieval and array processing, ADP performs comparably to exact (but expensive) maximum likelihood (ML) over a wide range of signal-to-noise ratios (SNRs) and is therefore an attractive algorithm
Keywords: inverse problem, search problem, application, RL, array processing
Shervais:2003p23875
Intelligent supply chain management using adaptive critic learning
S. Shervais and T. Shannon and G. Lendaris
Systems, Man and Cybernetics, Part A, IEEE Transactions on  33  235 -- 244  (2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1219461&isnumber=27401&punumber=3468&k2dockey=1219461@ieeejrns
http://dx.doi.org/10.1109/TSMCA.2003.809214
A set of neural networks is employed to develop control policies that are better than fixed, theoretically optimal policies, when applied to a combined physical inventory and distribution system in a nonstationary demand environment. Specifically, we show that model-based adaptive critic approximate dynamic programming techniques can be used with systems characterized by discrete valued states and controls. The control policies embodied by the trained neural networks outperformed the best, fixed policies (found by either linear programming or genetic algorithms) in a high-penalty cost environment with time-varying demand.
Keywords: supply chain management, RL, application
Bahati:2008p24269
Adapting to Run-Time Changes in Policies Driving Autonomic Management
R. Bahati and M. Bauer
Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on    88 -- 93  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4488327&isnumber=4488306&punumber=4488305&k2dockey=4488327@ieeecnfs
http://dx.doi.org/10.1109/ICAS.2008.47
The use of policies within autonomic computing has received significant interest in the recent past. Policy-driven management offers significant benefit since it makes it more straight forward to define and modify systems behavior at run-time, through policy manipulation, rather than through re- engineering. In this paper, we present an adaptive policy-driven autonomic management system which makes use of reinforcement learning methodologies to determine how to best use a set of active policies to meet different performance objectives. The focus, in particular, is on strategies for adapting what has been learned for one set of policy actions to a ";similar"; set of policies when run-time policy modifications occur. We illustrate the impact of the adaptation strategies on the behavior of a multi-component Web server.
Keywords: autonomic computing, web services, RL, application
Salkham:2008p23971
A Collaborative Reinforcement Learning Approach to Urban Traffic Control Optimization
A. Salkham and R. Cunningham and A. Garg and V. Cahill
Web Intelligence and Intelligent Agent Technology, 2008 IEEE/WIC/ACM International Conference on  2  560 -- 566  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4740684&isnumber=4740584&punumber=4740404&k2dockey=4740684@ieeecnfs
http://dx.doi.org/10.1109/WIIAT.2008.88
The high growth rate of vehicles per capita now poses a real challenge to efficient Urban Traffic Control (UTC). An efficient solution to UTC must be adaptive in order to deal with the highly-dynamic nature of urban traffic. In the near future, global positioning systems and vehicle-to-vehicle/infrastructure communication may provide a more detailed local view of the traffic situation that could be employed for better global UTC optimization. In this paper we describe the design of a next-generation UTC system that exploits such local knowledge about a junction's traffic in order to optimize traffic control. Global UTC optimization is achieved using a local Adaptive Round Robin (ARR) phase switching model optimized using Collaborative Reinforcement Learning (CRL). The design employs an ARR-CRL-based agent controller for each signalized junction that collaborates with neighbouring agents in order to learn appropriate phase timing based on the traffic pattern. We compare our approach to non-adaptive fixed-time UTC system and to a saturation balancing algorithm in a large-scale simulation of traffic in Dublin's inner city centre. We show that the ARR-CRL approach can provide significant improvement resulting in up to ~57% lower average waiting time per vehicle compared to the saturation balancing algorithm.
Keywords: application, traffic control, urban traffic, RL, phase-timing control
Liu:2008p23751
Adaptive Critic Learning Techniques for Engine Torque and Air--Fuel Ratio Control
D. Liu and H. Javaherian and O. Kovalenko and T. Huang
Systems, Man, and Cybernetics, Part B, IEEE Transactions on  38  988 -- 993  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554033&isnumber=4567535&punumber=3477&k2dockey=4554033@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.922019
A new approach for engine calibration and control is proposed. In this paper, we present our research results on the implementation of adaptive critic designs for self-learning control of automotive engines. A class of adaptive critic designs that can be classified as (model-free) action-dependent heuristic dynamic programming is used in this research project. The goals of the present learning control design for automotive engines include improved performance, reduced emissions, and maintained optimum performance under various operating conditions. Using the data from a test vehicle with a V8 engine, we developed a neural network model of the engine and neural network controllers based on the idea of approximate dynamic programming to achieve optimal control. We have developed and simulated self-learning neural network controllers for both engine torque (TRQ) and exhaust air--fuel ratio (AFR) control. The goal of TRQ control and AFR control is to track the commanded values. For both control problems, excellent neural network controller transient performance has been achieved.
Keywords: engine torque control, actor-critic, application, control of automotive engines, RL, exhaust air-fuel ratio control
Jodogne:2006p10601
Approximate Policy Iteration for Closed-Loop Learning of Visual Tasks
S. Jodogne and C. Briquet and J. Piater
ECML 2006      (2006)
http://www.springerlink.com/index/u7541v105r6t7gv0.pdf
pproximate Policy Iteration (API) is a reinforcement learning paradigm that is able to solve high-dimensional, continuous control problems. We propose to exploit API for the closed-loop learning of mappings from images to actions. This approach requires a family of function approximators that maps visual percepts to a real-valued function. For this purpose, we use Regression Extra-Trees, a fast, yet accurate and versatile machine learning algorithm. The inputs of the Extra-Trees consist of a set of visual features that digest the informative patterns in the visual signal. We also show how to parallelize the Extra-Tree learning process to further reduce the computational expense, which is often essential in visual tasks. Experimental results on real-world images are given that indicate that the combination of API with Extra-Trees is a promising framework for the interactive learning of visual tasks.
Keywords: visual navigation, RL, application
Mori:2005p10821
Off-Policy Natural Actor-Critic
T. Mori and Y. Nakamura and S. Ishii
IEIC Technical Report (Institute of Electronics      (2005)
http://isw3.naist.jp/IS/TechReport/report/2005007.ps
Recently-developed Natural Actor-Critic (NAC) [1] [2], which employs natural policy gradient learning for the actor and LSTD-Q() for the critic, has provided a good model-free reinforcement learning scheme applicable to high-dimensional systems. Since NAC is an on-policy learning method, however, a new sample sequence is required for estimating sufficient statistics in the policy gradient after the current policy or the policy's parameterization is modified, or the gradient is estimated as biased. Moreover, the control of exploration and exploitation should be performed by a direct operation on the policy, which may be a large constraint on introducing an exploratory factor. To overcome these problems, we propose an off-policy NAC in this article, in which the policy gradient is estimated without any asymptotic bias by using past system trajectories under the control by past policies. In addition, this method allows the exploration to be controlled from the outside of the policy optimization; this is more effective than the direct policy operation. We also propose a hierarchical parameterization of the policy and an off-policy NAC learning for that policy. Computer experiments using a snake-like robot simulator show our new off-policy NAC is so effective that the number of required trajectories is much smaller than that by the on-policy method. Moreover, even when the policy parameterization is high dimensional, our hierarchical off-policy NAC learning successfully obtains a good policy, by controlling the policy's effective dimensionality dynamically. These improvements allow a more efficient application in high-dimensional control problems.
Keywords: snake-like robot, RL, application
ElFakdi:2008p24405
Policy gradient based Reinforcement Learning for real autonomous underwater cable tracking
A. El-Fakdi and M. Carreras
Intelligent Robots and Systems, 2008. IROS 2008. IEEE/RSJ International Conference on    3635 -- 3640  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4650873&isnumber=4650570&punumber=4637508&k2dockey=4650873@ieeecnfs
http://dx.doi.org/10.1109/IROS.2008.4650873
This paper proposes a field application of a high-level reinforcement learning (RL) control system for solving the action selection problem of an autonomous robot in cable tracking task. The learning system is characterized by using a direct policy search method for learning the internal state/action mapping. Policy only algorithms may suffer from long convergence times when dealing with real robotics. In order to speed up the process, the learning phase has been carried out in a simulated environment and, in a second step, the policy has been transferred and tested successfully on a real robot. Future steps plan to continue the learning process on-line while on the real robot while performing the mentioned task. We demonstrate its feasibility with real experiments on the underwater robot ICTINEU AUV.
Keywords: underwater cable tracking, RL, application
PaterninaArboleda:2008p24039
Simulation-optimization using a reinforcement learning approach
C. Paternina-Arboleda and J. Montoya-Torres and A. Fabregas-Ariza
Simulation Conference, 2008. WSC 2008. Winter    1376 -- 1383  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4736213&isnumber=4736042&punumber=4712635&k2dockey=4736213@ieeecnfs
http://dx.doi.org/10.1109/WSC.2008.4736213
The global optimization of complex systems such as industrial systems often necessitates the use of computer simulation. In this paper, we suggest the use of reinforcement learning (RL) algorithms and artificial neural networks for the optimization of simulation models. Several types of variables are taken into account in order to find global optimum values. After a first evaluation through mathematical functions with known optima, the benefits of our approach are illustrated through the example of an inventory control problem frequently found in manufacturing systems. Single-item and multi-item inventory cases are considered. The efficiency of the proposed procedure is compared against a commercial tool.
Keywords: inventory control, RL, application
Kephart:2007p4269
Coordinating multiple autonomic managers to achieve specified power-performance tradeoffs
J. Kephart and H. Chan and R. Das and D. Levine and G. Tesauro and F. Rawson and C. Lefurgy
Proceedings of the Fourth International Conference on Autonomic Computing    24  (2007)
Keywords: power management, RL, application
Murray:2001p23935
Globally convergent approximate dynamic programming applied to an autolander
J. Murray and C. Cox and R. Saeks and G. Lendaris
American Control Conference, 2001. Proceedings of the 2001  4  2901 -- 2906 vol.4  (2001)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=946342&isnumber=20472&punumber=7520&k2dockey=946342@ieeecnfs
http://dx.doi.org/10.1109/ACC.2001.946342
A globally convergent nonlinear approximate dynamic programming algorithm is described, and an implementation of the algorithm in the linear case is developed. The resultant linear approximate dynamic programming algorithm is illustrated via the design of an autolander for the NASA X-43 research aircraft, without a priori knowledge of the X-43's flight dynamics
Keywords: autolander, aircraft control, RL, application
QinminYang:2008p23987
A Suite of Robust Controllers for the Manipulation of Microscale Objects
Q. Yang and S. Jagannathan
Systems, Man, and Cybernetics, Part B, IEEE Transactions on  38  113 -- 125  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4389965&isnumber=4432675&punumber=3477&k2dockey=4389965@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2007.909943
A suite of novel robust controllers is introduced for the pickup operation of microscale objects in a microelectromechanical system (MEMS). In MEMS, adhesive, surface tension, friction, and van der Waals forces are dominant. Moreover, these forces are typically unknown. The proposed robust controller overcomes the unknown contact dynamics and ensures its performance in the presence of actuator constraints by assuming that the upper bounds on these forces are known. On the other hand, for the robust adaptive critic-based neural network (NN) controller, the unknown dynamic forces are estimated online. It consists of an action NN for compensating the unknown system dynamics and a critic NN for approximating a certain strategic utility function and tuning the action NN weights. By using the Lyapunov approach, the uniform ultimate boundedness of the closed-loop manipulation error is shown for all the controllers for the pickup task. To imitate a practical system, a few system states are considered to be unavailable due to the presence of measurement noise. An output feedback version of the adaptive NN controller is proposed by exploiting the separation principle through a high-gain observer design. The problem of measurement noise is also overcome by constructing a reference system. Simulation results are presented and compared to substantiate the theoretical conclusions.
Keywords: microactuators, RL, micromanipulators, application
Frampton:2008p24259
Using dialogue acts to learn better repair strategies for spoken dialogue systems
M. Frampton and O. Lemon
Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on    5045 -- 5048  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4518792&isnumber=4517521&punumber=4505270&k2dockey=4518792@ieeecnfs
http://dx.doi.org/10.1109/ICASSP.2008.4518792
Repair or error-recovery strategies are an important design issue in spoken dialogue systems (SDSs) - how to conduct the dialogue when there is no progress (e.g. due to repeated ASR errors). Nearly all current SDSs use hand-crafted repair rules, but a more robust approach is to use reinforcement learning (RL) for data-driven dialogue strategy learning. However, as well as usually being tested only in simulation, current RL approaches use small state spaces which do not contain linguistically motivated features such as "dialogue acts" (DAs). We show that a strategy learned with DA features outperforms hand-crafted and slot-status strategies when tested with real users (+9% average task completion, p < 0.05). We then explore how using DAs produces better repair strategies e.g. focus-switching. We show that DAs are useful in de.ciding both when to use a repair strategy, and which one to use.
Keywords: spoken dialogue systems, RL, application
Peroni:2005p23889
Optimal control of a fed-batch bioreactor using simulation-based approximate dynamic programming
C. Peroni and N. Kaisare and J. Lee
Control Systems Technology, IEEE Transactions on  13  786 -- 790  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1501862&isnumber=32226&punumber=87&k2dockey=1501862@ieeejrns
http://dx.doi.org/10.1109/TCST.2005.852105
In this brief, we extend the simulation-based approximate dynamic programming (ADP) method to optimal feedback control of fed-batch reactors. We consider a free-end problem, wherein the batch time is considered in finding the optimal feeding strategy in addition to the final time productivity. In ADP, the optimal solution is parameterized in the form of profit-to-go function. The original definition of profit-to-go is modified to include the decision of batch termination. Simulations from heuristic feeding policies generate the initial profit-to-go versus state data. An artificial neural network then approximates profit-to-go as a function of process state. Iterations of the Bellman equation are used to improve the profit-to-go function approximator. The profit-to-go function approximator thus obtained, is then implemented in an online controller. This method is applied to cloned invertase expression in Saccharomyces cerevisiae in a fed-batch bioreactor.
Keywords: fed-batch reactors, biorectors, RL, application
Esfahani:2008p24015
Widest K-Shortest Paths Q-Routing: A New QoS Routing Algorithm in Telecommunication Networks
A. Esfahani and M. Analoui
Computer Science and Software Engineering, 2008 International Conference on  4  1032 -- 1035  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4722795&isnumber=4722542&punumber=4721667&k2dockey=4722795@ieeecnfs
http://dx.doi.org/10.1109/CSSE.2008.1264
Actually, various kinds of sources (such as voice, video or data) with diverse traffic characteristics and quality of service requirements (QoS), which are multiplexed at very high rates, leads to significant traffic problems such as packet losses, transmission delays, delay variations, etc, caused mainly by congestion in the networks. The prediction of these problems in real time is quite difficult, making the effectiveness of "traditional" methodologies based on analytical models questionable. This article proposed and evaluates a QoS routing policy in packets topology and irregular traffic of communications network called widest K-shortest paths Q-routing. The technique used for the evaluation signals of reinforcement is Q-learning. Compared to standard Q-routing, the exploration of paths is limited to K best non loop paths in term of hops number (number of routers in a path) leading to a substantial reduction of convergence time. In this work a proposal for routing which improves the delay factor and is based on the reinforcement learning is concerned. We use Q-learning as the reinforcement learning technique and introduce K-shortest idea into the learning process. The proposed algorithm are applied to two different topologies. The OPNET is used to evaluate the performance of the proposed algorithm. The algorithm evaluation is done for two traffic conditions, namely low load and high load.
Keywords: network routing, RL, application
Ranjan:2008p24387
Reinforcement learning for dynamic channel allocation in mobile cellular systems
R. Ranjan and A. Phophalia
Recent Advances in Microwave Theory and Applications, 2008. MICROWAVE 2008. International Conference on    924 -- 927  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4763228&isnumber=4762962&punumber=4749420&k2dockey=4763228@ieeecnfs
http://dx.doi.org/10.1109/AMTA.2008.4763228
In cellular communication systems, an important problem is to allocate the communication resource (bandwidth) so as to maximize the service provided to a set of mobile callers whose demand for service changes randomly. This problem is formulated as a dynamic programming problem and we use a reinforcement learning (RL) method to find dynamic channel allocation policies that are better than previous heuristic solutions. The policies obtained perform well for a broad variety of call traffic patterns. The superior performance of the proposed technique in terms of empirical blocking probability is illustrated in simulation examples.
Keywords: application, dynamic channel allocation, mobile radio, RL, cellular radio
Draper:1996p10670
Modeling object recognition as a Markov decision process
B. A. Draper
ICPR'96      (1996)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=547241
Keywords: vision, rooftop recognition, RL, application
Padhi:2003p23907
Approximate dynamic programming based optimal neurocontrol synthesis of a chemical reactor process using proper orthogonal decomposition
R. Padhi and S. Balakrishnan
Neural Networks, 2003. Proceedings of the International Joint Conference on  3  1891 -- 1896 vol.3  (2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1223696&isnumber=27485&punumber=8672&k2dockey=1223696@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2003.1223696
The concept of approximate dynamic programming and adaptive critic neural network based optimal controller is extended in this study to include systems governed by partial differential equations. An optimal controller is synthesized for a dispersion type tubular chemical reactor, which is governed by two coupled nonlinear partial differential equations. It consists of three steps: First, empirical basis functions are designed using the 'Proper Orthogonal Decomposition' technique and a low-order lumped parameter system to represent the infinite-dimensional system is obtained by carrying out a Galerkin projection. Second, approximate dynamic programming technique is applied in a discrete time framework, followed by the use of a dual neural network structure called adaptive critics, to obtain optimal neurocontrollers for this system. In this structure, one set of neural networks captures the relationship between the state variables and the control, whereas the other set captures the relationship between the state and the costate variables. Third, the lumped parameter control is then mapped back to the spatial dimension using the same basis functions to result in a feedback control. Numerical results are presented that illustrate the potential of this approach. It should be noted that the procedure presented in this study can be used in synthesizing optimal controllers for a fairly general class of nonlinear distributed parameter systems.
Keywords: chemical reactor, RL, application
Wen:2009p23963
A Machine Learning Method for Dynamic Traffic Control and Guidance on Freeway Networks
K. Wen and S. Qu and Y. Zhang
Informatics in Control, Automation and Robotics, 2009. CAR '09. International Asia Conference on    67 -- 71  (2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4777196&isnumber=4777174&punumber=4777173&k2dockey=4777196@ieeecnfs
http://dx.doi.org/10.1109/CAR.2009.96
A distributed approach to Reinforcement Learning in tasks of ramp metering and dynamic route guidance is presented. The problem domain, a freeway integration control application, is formulated as a distributed reinforcement learning problem. The DRL .....
Keywords: freeway network, application, traffic control, ramp metering, RL, dynamic guidance
Tsitsiklis:1999p8185
Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives
J. Tsitsiklis and B. van Roy
Automatic Control, IEEE Transactions on  44  1840--1851  (1999)
Keywords: pricing, Asian option, RL, application
WeiminLi:2008p23743
Dynamic energy management for hybrid electric vehicle based on approximate dynamic programming
W. Li and G. Xu and Z. Wang and Y. Xu;
Intelligent Control and Automation, 2008. WCICA 2008. 7th World Congress on    7864 -- 7869  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4594156&isnumber=4592780&punumber=4577718&k2dockey=4594156@ieeecnfs
http://dx.doi.org/10.1109/WCICA.2008.4594156
In this paper, an approximate dynamic programming (ADP) based strategy for real-time energy control of parallel hybrid electric vehicles(HEV) is presented. The aim is to develop a fuel-optimal control which is not relying on the priori knowledge of the future driving conditions (global optimal control), but only upon the current system operation. Approximate dynamic programming is an on-line learning method, which controls the system while simultaneously learning its characteristics in real time. A suboptimal energy control is then obtained with a proper definition of a cost function to be minimized at each time instant. The cost function includes the fuel consumption, emissions and the deviation of battery soc. Our approach guarantees an optimization of vehicle performance and an adaptation to driving conditions. Simulation results over standard driving cycles are presented to demonstrate the effectiveness of the proposed stochastic approach. It was found that the obtained ADP control algorithm outperforms a traditional rule-based control strategy.
Keywords: fuel-optimal control of automotive engines, parallel hybrid electric vehicles, RL, application
XinXu:2008p23724
Self-learning path-tracking control of autonomous vehicles using kernel-based approximate dynamic programming
X. Xu and H. Zhang and B. Dai and H. He;
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on    2182 -- 2189  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4634099&isnumber=4633757&punumber=4625775&k2dockey=4634099@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4634099
With the fast development of robotics and intelligent vehicles, there has been much research work on modeling and motion control of autonomous vehicles. However, due to model complexity, and unknown disturbances from dynamic environment, the motion control of autonomous vehicles is still a difficult problem. In this paper, a novel self-learning path-tracking control method is proposed for a car-like robotic vehicle, where kernel-based approximate dynamic programming (ADP) is used to optimize the controller performance with little prior knowledge on vehicle dynamics. The kernel-based ADP method is a recently developed reinforcement learning algorithm called kernel least-squares policy iteration (KLSPI), which uses kernel methods with automatic feature selection in policy evaluation to get better generalization performance and learning efficiency. By using KLSPI, the lateral control performance of the robotic vehicle can be optimized in a self-learning and data-driven style. Compared with previous learning control methods, the proposed method has advantages in learning efficiency and automatic feature selection. Simulation results show that the proposed method can obtain an optimized path-tracking control policy only in a few iterations, which will be very practical for real applications.
Keywords: mobile robotics, RL, application
DerongLiu:2005p23877
A self-learning call admission control scheme for CDMA cellular networks
D. Liu and Y. Zhang and H. Zhang;
Neural Networks, IEEE Transactions on  16  1219 -- 1228  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1510721&isnumber=32346&punumber=72&k2dockey=1510721@ieeejrns
http://dx.doi.org/10.1109/TNN.2005.853408
n the present paper, a call admission control scheme that can learn from the network environment and user behavior is developed for code division multiple access (CDMA) cellular networks that handle both voice and data services. The idea is built upon a novel learning control architecture with only a single module instead of two or three modules in adaptive critic designs (ACDs). The use of adaptive critic approach for call admission control in wireless cellular networks is new. The call admission controller can perform learning in real-time as well as in offline environments and the controller improves its performance as it gains more experience. Another important contribution in the present work is the choice of utility function for the present self-learning control approach which makes the present learning process much more efficient than existing learning control methods. The performance of our algorithm will be shown through computer simulation and compared with existing algorithms.
Keywords: CDMA cellular networks, call admission control, RL, application
Gosavi:2004p8079
A reinforcement learning algorithm based on policy iteration for average reward: Empirical results with yield management and convergence analysis
A. Gosavi
Mach Learn  55  5--29  (2004)
Keywords: airline yield management, RL, application
Marbach:1998p17964
Reinforcement Learning for Call Admission Control and Routing in Integrated Service Networks
P. Marbach and O. Mihatsch and M. Schulte and J. Tsitsiklis
Advances in Neural Information Processing Systems      (1998)
http://www.cs.toronto.edu/~marbach/PUBL/nips97.ps.gz
Keywords: routing, call admission control, RL, application
Williams:2007p23764
Approximate Dynamic Programming for Communication-Constrained Sensor Network Management
J. Williams and J. Fisher and A. Willsky
Signal Processing, IEEE Transactions on [see also Acoustics, Speech, and Signal Processing, IEEE Transactions on]  55  4300 -- 4311  (2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4276992&isnumber=4276952&punumber=78&k2dockey=4276992@ieeejrns
http://dx.doi.org/10.1109/TSP.2007.896099
Resource management in distributed sensor networks is a challenging problem. This can be attributed to the fundamental tradeoff between the value of information contained in a distributed set of measurements versus the energy costs of acquiring measurements, fusing them into the conditional probability density function (pdf) and transmitting the updated conditional pdf. Communications is commonly the highest contributor among these costs, typically by orders of magnitude. Failure to consider this tradeoff can significantly reduce the operational lifetime of a sensor network. While a variety of methods have been proposed that treat a subset of these issues, the approaches are indirect and usually consider at most a single time step. In the context of object tracking with a distributed sensor network, we propose an approximate dynamic programming approach that integrates the value of information and the cost of transmitting data over a rolling time horizon. We formulate this tradeoff as a dynamic program and use an approximation based on a linearization of the sensor model about a nominal trajectory to simultaneously find a tractable solution to the leader node selection problem and the sensor subset selection problem. Simulation results demonstrate that the resulting algorithm can provide similar estimation performance to that of the common most informative sensor selection method for a fraction of the communication cost.
Keywords: resource management, sensor network, RL, application
JiaMa:2008p23765
Dual Heuristic Programming Based Neurocontroller for Vibration Isolation Control
J. Ma and T. Yang and Z. Hou and M. Tan and D. Liu;
Networking, Sensing and Control, 2008. ICNSC 2008. IEEE International Conference on    874 -- 879  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4525339&isnumber=4525150&punumber=4489617&k2dockey=4525339@ieeecnfs
http://dx.doi.org/10.1109/ICNSC.2008.4525339
The dual heuristic programming (DHP) approach has a superior capability to solve approximate dynamic programming problems in the family of adaptive critic designs (ACDs). In this paper, a DHP based controller is developed for vibration isolation applications. In the specific case of an active-passive isolator, a multilayer feedforward neural network is pre- trained as a differentiable model of the plant for the adaptation of the critic and action networks. In addition, in order to avoid plunging in the local minima during the training process, pseudorandom signals, which represent the vibrations of the base, are applied to the vibration isolation system. This technique greatly improves the robustness of the DHP controller against unknown disturbances. Moreover, the "shadow critic" training strategy is adopted to improve the convergence rate of the training. Simulation results have shown that the developed DHP controller alleviates vibration disturbances more effectively and have better control performance in comparison with the passive isolator. Additionally, as compared with the one adapted on-line, the pre-trained model network leads the training to be more efficient. Therefore, it further demonstrates the effectiveness of the DHP controller designed with the pre-trained model network even in the presence of unmodeled uncertainties.
Keywords: vibration isolation, RL, application
WeiSongLin:2008p23733
Adaptive critic motion controller based on sparse radial basis function network
W. Lin and C. Tu;
Automation Congress, 2008. WAC 2008. World    1 -- 9  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4698998&isnumber=4698939&punumber=4681917&k2dockey=4698998@ieeecnfs
Motion controllers capable of incremental learning and optimization can automatically tune their parameters to pursue optimal control. By implementing reinforcement learning and approximate dynamic programming, an adaptive critic motion controller is shown able to achieve this objective. The control policy and the adaptive critic are implemented by sparse radial basis function networks. The policy and the critic updating rules are derived. Ability and performance of the adaptive critic motion controller is demonstrated by the control of a rotary inverted pendulum system.
Keywords: rotary inverted pendulum, RL, application
Rashid:2008p24379
Investigating a Dynamic Loop Scheduling with Reinforcement Learning Approach to Load Balancing in Scientific Applications
M. Rashid and I. Banicescu and R. Cario
Parallel and Distributed Computing, 2008. ISPDC '08. International Symposium on    123 -- 130  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4724238&isnumber=4724214&punumber=4724213&k2dockey=4724238@ieeecnfs
http://dx.doi.org/10.1109/ISPDC.2008.25
The advantages of integrating reinforcement learning (RL) techniques into scientific parallel time-stepping applications have been revealed in research work over the past few years. The object of the integration was to automatically select the most appropriate dynamic loop scheduling (DLS) algorithm from a set of available algorithms with the purpose of improving the application performance via load balancing during the application execution. This paper investigates the performance of such a dynamic loop scheduling with reinforcement learning (DLS-with-RL) approach to load balancing. The DLS-with-RL is most suitable for use in time-stepping scientific applications with large number of steps. The RL agent's characteristics depend on a learning rate parameter and a discount factor parameter. An application simulating wavepacket dynamics that incorporates a DLS-with-RL approach is allowed to execute on a cluster of workstations to investigate the influences of these parameters. The RL agent implemented two RL algorithms: QLEARN and SARSA learning. Preliminary results indicate that on a fixed number of processors, the simulation completion time is not sensitive to the values of the learning parameters used in the experiments. The results also indicate that for this application, there is no advantage of choosing one RL technique over another, even though the techniques differed significantly in the number of times they selected the various DLS algorithms.
Keywords: scientific parallel time-stepping application, resource allocation, application, processor scheduling, RL
Papadaki:2005p23883
Resource management in CDMA networks based on approximate dynamic programming
K. Papadaki and V. Friderikos
Local and Metropolitan Area Networks, 2005. LANMAN 2005. The 14th IEEE Workshop on    6 pp. -- 6  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1541536&isnumber=32915&punumber=10349&k2dockey=1541536@ieeecnfs
http://dx.doi.org/10.1109/LANMAN.2005.1541536
In this paper a power and rate control scheme for downlink packet transmission in CDMA networks is proposed. Under the assumption of stochastic packet arrivals and channel states the base station transmits to multiple mobile user at any time instant within rate and power capacity constraints. The objective is to maximize system throughput, while taking into account the queue length distribution over a time horizon. We are interested in optimal rate allocation policies over time and thus we formulate the problem as a discrete stochastic dynamic program. This dynamic program (DP) is exponentially complex in the number of users which renders it impractical and therefore we use an approximate dynamic programming algorithm to obtain in real time sub-optimal rate allocation policies. Numerical results reveal that the proposed algorithm increased the performance (in terms of a number of different measured parameters such as average queue size) of at least 3.5 times compared to a number of different baseline greedy heuristics
Keywords: downlink packet transmission, application, CDMA networks, RL, rate control, power
Lu:2008p23750
Direct Heuristic Dynamic Programming for Damping Oscillations in a Large Power System
C. Lu and J. Si and X. Xie
Systems, Man, and Cybernetics, Part B, IEEE Transactions on  38  1008 -- 1013  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554034&isnumber=4567535&punumber=3477&k2dockey=4554034@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.923157
This paper applies a neural-network-based approximate dynamic programming method, namely, the direct heuristic dynamic programming (direct HDP), to a large power system stability control problem. The direct HDP is a learning- and approximation-based approach to addressing nonlinear coordinated control under uncertainty. One of the major design parameters, the controller learning objective function, is formulated to directly account for network-wide low-frequency oscillation with the presence of nonlinearity, uncertainty, and coupling effect among system components. Results include a novel learning control structure based on the direct HDP with applications to two power system problems. The first case involves static var compensator supplementary damping control, which is used to provide a comprehensive evaluation of the learning control performance. The second case aims at addressing a difficult complex system challenge by providing a new solution to a large interconnected power network oscillation damping control problem that frequently occurs in the China Southern Power Grid.
Keywords: oscillation damping, power system stabilization, application, power network, RL, static var compensator supplementary damping control
DiGiovanna:2009p24202
Coadaptive Brain--Machine Interface via Reinforcement Learning
J. D. {\$}^*{\$} and B. Mahmoudi and J. Fortes and J. Principe and J. Sanchez
Biomedical Engineering, IEEE Transactions on  56  54 -- 64  (2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4540104&isnumber=4783498&punumber=10&k2dockey=4540104@ieeejrns
http://dx.doi.org/10.1109/TBME.2008.926699
This paper introduces and demonstrates a novel brain--machine interface (BMI) architecture based on the concepts of reinforcement learning (RL), coadaptation, and shaping. RL allows the BMI control algorithm to learn to complete tasks from interactions with the environment, rather than an explicit training signal. Coadaption enables continuous, synergistic adaptation between the BMI control algorithm and BMI user working in changing environments. Shaping is designed to reduce the learning curve for BMI users attempting to control a prosthetic. Here, we present the theory and in vivo experimental paradigm to illustrate how this BMI learns to complete a reaching task using a prosthetic arm in a 3-D workspace based on the user's neuronal activity. This semisupervised learning framework does not require user movements. We quantify BMI performance in closed-loop brain control over six to ten days for three rats as a function of increasing task difficulty. All three subjects coadapted with their BMI control algorithms to control the prosthetic significantly above chance at each level of difficulty.
Keywords: man-machine systems, brain-computer interfaces, RL, application
Giupponi:2009p23983
Fuzzy Neural Control for Economic-Driven Radio Resource Management in Beyond 3G Networks
L. Giupponi and R. Agustý and J. Pýrez-Romero and O. Sallent
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on  39  170 -- 189  (2009)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4787023&isnumber=4787649&punumber=5326&k2dockey=4787023@ieeejrns
http://dx.doi.org/10.1109/TSMCC.2008.2001724
Joint radio resource management (JRRM) is the envisaged process aimed at optimizing the radio resource usage of wireless systems to satisfy the requirements of both the network operators and the users in the context of future generation wireless networks. In particular, this paper proposes a two-layered JRRM framework to improve the efficiency of multiradio and multioperator cellular scenarios. On the one hand, the intraoperator JRRM relies on fuzzy neural mechanisms with economic-driven reinforcement learning techniques to exploit radio resources within a single-operator domain. Microeconomic concepts are included in the proposed approach so that user profile differentiation can be considered when making a JRRM decision. On the other hand, interoperator JRRM enables subscribers to obtain service through other operators, if the home operator network is blocked. Simulation results in a number of different scenarios show that interoperator agreements established in a cooperative scenario benefit both the operators and users, which enables efficient load management and increased operator revenue.
Keywords: joint radio resource management, RL, application
Usynin:2008p24298
Prognostics-Driven Optimal Control for Equipment Performing in Uncertain Environment
A. Usynin and J. Hines and A. Urmanov
Aerospace Conference, 2008 IEEE    1 -- 9  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4526626&isnumber=4526225&punumber=4505446&k2dockey=4526626@ieeecnfs
http://dx.doi.org/10.1109/AERO.2008.4526626
This paper discusses the problem of optimal control for systems performing in uncertain environments, where little information is available regarding the system dynamics. A reinforcement learning approach is proposed to tackle the problem. A particular method to incorporate Prognostics and Health Management information derived on the system of interest is proposed to improve the reinforcement learning routine. The ideas behind reinforcement learning-based search for optimal control strategies are outlined. A numerical example illustrating the benefits of using PHM information is given.
Keywords: reliability, aerospace control, RL, application
Costa:2008p24329
RL-Based Scheduling Strategies in Actual Grid Environments
B. Costa and I. Dutra and M. Mattoso
Parallel and Distributed Processing with Applications, 2008. ISPA '08. International Symposium on    572 -- 577  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4725196&isnumber=4725093&punumber=4725092&k2dockey=4725196@ieeecnfs
http://dx.doi.org/10.1109/ISPA.2008.119
In this work, we study the behaviour of different resource scheduling strategies when doing job orchestration in grid environments. We empirically demonstrate that scheduling strategies based on reinforcement learning are a good choice to improve the overall performance of grid applications and resource utilization.
Keywords: grid computing, scheduling, RL, application
Powell:2005p23882
Approximate dynamic programming for high dimensional resource allocation problems
W. Powell and A. George and B. Bouzaiene-Ayari and H. Simao
Neural Networks, 2005. IJCNN '05. Proceedings. 2005 IEEE International Joint Conference on  5  2989 -- 2994 vol. 5  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1556401&isnumber=33093&punumber=10421&k2dockey=1556401@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2005.1556401
There are wide arrays of discrete resource allocation problems (buffers in manufacturing, complex equipment in electric power, aircraft and locomotives in transportation) which need to be solved over time, under uncertainty. These can be formulated as dynamic programs, but typically exhibit high dimensional state, action and outcome variables (the three curses of dimensionality). For example, we have worked on problems where the dimensionality of these variables is in the ten thousand to one million range. We describe an approximation methodology for this problem class, and summarize the problem classes where the approach seems to be working well, and research challenges that we continue to face.
Keywords: resource allocation, overview, RL, application
Powell:2007p23827
The optimizing-simulator: Merging simulation and optimization using approximate dynamic programming
W. Powell
Simulation Conference, 2007 Winter    43 -- 53  (2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4419587&isnumber=4419576&punumber=4419575&k2dockey=4419587@ieeecnfs
http://dx.doi.org/10.1109/WSC.2007.4419587
There is a wide range of simulation problems that involve making decisions during the simulation, where we would like to make the best decisions possible, taking into account not only what we know when we make the decision, but also the impact of the decision on the future. Such problems can be formulated as dynamic programs, stochastic programs and optimal control problems, but these techniques rarely produce computationally tractable algorithms. We demonstrate how the framework of approximate dynamic programming can produce near-optimal (in some cases) or at least high quality solutions using techniques that are very familiar to the simulation community. The price of this challenge is that the simulation has to be run iteratively, using statistical learning techniques to produce the desired intelligence. The benefit is a reduced dependence on more traditional rule-based logic.
Keywords: overview, RL, application
Jasmin:2008p23950
A Reinforcement Learning algorithm to economic dispatch considering transmission losses
E. Jasmin and T. I. Ahamed and V. Jagathiraj
TENCON 2008 - 2008, TENCON 2008. IEEE Region 10 Conference    1 -- 6  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4766652&isnumber=4766377&punumber=4753725&k2dockey=4766652@ieeecnfs
http://dx.doi.org/10.1109/TENCON.2008.4766652
Reinforcement Learning (RL) refers to a class of learning algorithms in which learning system learns which action to take in different situations by using a scalar evaluation received from the environment on performing an action. RL has been successfully applied to many multi stage decision making problem (MDP) where in each stage the learning systems decides which action has to be taken. Economic Dispatch (ED) problem is an important scheduling problem in power systems, which decides the amount of generation to be allocated to each generating unit so that the total cost of generation is minimized without violating system constraints. In this paper we formulate economic dispatch problem as a multi stage decision making problem. In this paper, we also develop RL based algorithm to solve the ED problem. The performance of our algorithm is compared with other recent methods. The main advantage of our method is it can learn the schedule for all possible demands simultaneously.
Keywords: economic dispatch, application, power generation scheduling, RL, power generation dispatch
Chang:2007p16093
An asymptotically efficient simulation-based algorithm for finite horizon stochastic dynamic programming
H. Chang and M. Fu and J. Hu and S. Marcus
IEEE Trans. on Automatic Control      (2007)
http://www.isr.umd.edu/~Marcus/docs/samw_final.pdf
Keywords: finite-horizon inventory control problem, RL, application
JunMoKang:1999p23919
Approximate dynamic programming solutions for lean burn engine aftertreatment
J. Kang and I. Kolmanovsky and J. Grizzle
Decision and Control, 1999. Proceedings of the 38th IEEE Conference on  2  1703 -- 1708 vol.2  (1999)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=830269&isnumber=18006&punumber=6713&k2dockey=830269@ieeecnfs
http://dx.doi.org/10.1109/CDC.1999.830269
The competition to deliver fuel efficient and environmentally friendly vehicles is driving the automotive industry to consider ever more complex powertrain systems. Adequate performance of these new highly interactive systems can no longer be obtained through traditional approaches, which are intensive in hardware use and final control software calibration. The paper explores the use of dynamic programming to make model-based design decisions for a lean burn, direct injection spark ignition engine, in combination with a three way catalyst and lean NOx trap aftertreatment system. The primary contribution is the development of a very rapid method to evaluate the tradeoffs in fuel economy and emissions for this novel powertrain system, as a function of design parameters and controller structure, over a standard emission test cycle
Keywords: fuel economy, engine control, RL, application
Kovalenko:2004p23888
Neural network modeling and adaptive critic control of automotive fuel-injection systems
O. Kovalenko and D. Liu and H. Javaherian
Intelligent Control, 2004. Proceedings of the 2004 IEEE International Symposium on    368 -- 373  (2004)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1387711&isnumber=30194&punumber=9528&k2dockey=1387711@ieeecnfs
http://dx.doi.org/10.1109/ISIC.2004.1387711
We investigate the applications of a class of adaptive critic designs that can be classified as (model-free) action-dependent heuristic dynamic programming (ADHDP). Adaptive critic designs are defined as designs that approximate dynamic programming in the general case, i.e., approximate optimal control over time in noisy, nonlinear environment. The goals of the present learning control design for automotive engines include improved performance, reduced emissions, and maintained optimum performance under various operating conditions. Using the data obtained from a test vehicle, we first develop a neural network model of the engine. The neural network controller is then designed based on the idea of approximate dynamic programming to achieve optimal control. In the simulation studies, the initial controller is trained using the neural network engine model developed rather than the actual engine. We have developed and tested self-learning neural network controllers for both engine torque and exhaust air-fuel ratio control. The goal of the engine torque control is to track the commanded torque. The objective of the air-fuel ratio control is to regulate the engine air-fuel ratio at specified set points. For both control problems, good transient performance of the neural network controller has been observed. A distinct feature of the present technique is the controller's real-time adaptation capability which allows the neural network controller to be further refined and improved in real-time vehicle operation through continuous learning and adaptation.
Keywords: automotive fuel injection, RL, application
Ernst:2005p4189
Approximate Value Iteration in the Reinforcement Learning Context. Application to Electrical Power …
D. Ernst and M. Glavic and P. Geurts and L. Wehenkel
International Journal of Emerging Electric Power Systems      (2005)
http://www.bepress.com/ijeeps/vol3/iss1/art1066
. Application to Electrical Power System Control. Damien Ernst, University
Keywords: Electrical Power System Control, RL, application
Das:1999p10438
Solving Semi-Markov Decision Problems Using Average Reward Reinforcement Learning
T. Das and A. Gosavi and S. Mahadevan and N. Marchalleck
MANAGEMENT SCIENCE      (1999)
http://www.cs.umass.edu/~mahadeva/papers/management-sci.ps.gz
Keywords: optimal preventive maintenance schedule, production inventory system, RL, application
Koole:2005p23872
Approximate dynamic programming in multi-skill call centers
G. Koole and A. Pot
Winter Simulation Conference, 2005 Proceedings of the    8  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1574297&isnumber=33294&punumber=10515&k2dockey=1574297@ieeecnfs
http://dx.doi.org/10.1109/WSC.2005.1574297
We consider a multi-skill call center consisting of specialists and fully cross-trained agents. All traffic is inbound and there is a waiting queue for each skill type. Our objective is to obtain good call routing policies. In this paper we use the so-called policy iteration (PI) method. It is applied in the context of approximate dynamic programming (ADP). The standard PI method requires the exact value function, which is well known from dynamic programming. We remark that standard methods to obtain the value function suffer from the curse of dimensionality, i.e., the number of states grows exponentially with the size of the call center. Therefore, we replace the real value function by an approximation and we use techniques from ADP. The value function is approximated by simulating the system. We apply this method to our call routing problem, yielding very good results.
Keywords: call routing, RL, application
Valasek:2008p23972
Improved Adaptive--Reinforcement Learning Control for Morphing Unmanned Air Vehicles
J. Valasek and J. Doebbler and M. Tandale and A. Meade
Systems, Man, and Cybernetics, Part B, IEEE Transactions on  38  1014 -- 1020  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4554213&isnumber=4567535&punumber=3477&k2dockey=4554213@ieeejrns
http://dx.doi.org/10.1109/TSMCB.2008.922018
This paper presents an improved Adaptive--Reinforcement Learning Control methodology for the problem of unmanned air vehicle morphing control. The reinforcement learning morphing control function that learns the optimal shape change policy is integrated with an adaptive dynamic inversion control trajectory tracking function. An episodic unsupervised learning simulation using the Q-learning method is developed to replace an earlier and less accurate Actor-Critic algorithm. Sequential Function Approximation, a Galerkin-based scattered data approximation scheme, replaces a K-Nearest Neighbors (KNN) method and is used to generalize the learning from previously experienced quantized states and actions to the continuous state-action space, all of which may not have been experienced before. The improved method showed smaller errors and improved learning of the optimal shape compared to the KNN.
Keywords: morphing control, unmanned aircraft control, RL, application
Hu:2008p24295
QELAR: A Q-learning-based Energy-Efficient and Lifetime-Aware Routing Protocol for Underwater Sensor Networks
T. Hu and Y. Fei
Performance, Computing and Communications Conference, 2008. IPCCC 2008. IEEE International    247 -- 255  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4745119&isnumber=4745069&punumber=4712735&k2dockey=4745119@ieeecnfs
http://dx.doi.org/10.1109/PCCC.2008.4745119
Underwater sensor network (UWSN) has emerged as a promising network technique for various aquatic applications in recent years. Due to some constraints in UWSNs, such as high latency, low bandwidth and high energy consumption, it is challenging to build networking protocolsfor UWSNs. In this paper, we focus on addressing the routing issue in UWSNs. We propose an adaptive, energy-efficient, and lifetime-aware routing protocol based on reinforcement learning, QELAR. Our protocol assumes generic MAC protocols and aims at prolonging the lifetime of networks by making residual energy of sensor nodes more evenly distributed. The residual energy of each node as well as the energy distribution among a group is factored in throughout the routing process to calculate the reward function, which aids in selecting the adequate forwarders for packets. We have performed extensive simulations of the proposed protocol on the Aqua-sim platform, and compared with one existing routing protocol (VBF) in terms of packet delivery rate, energy efcLency, latency and lifetime. The results show that the QELAR protocol yields 20% longer lifetime on average than VBF.
Keywords: routing, underwater sensor network, RL, application
Powell:2003p4182
Approximate Dynamic Programming for Asset Management: Solving the curses of dimensionality
W. B. Powell
    233  (2003)
Keywords: asset management, RL, application
JongMinLee:2004p23886
Empirical model based control of nonlinear processes using approximate dynamic programming
J. M. Lee and J. Lee
American Control Conference, 2004. Proceedings of the 2004  4  3041 -- 3046 vol.4  (2004)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1384375&isnumber=30159&punumber=9503&k2dockey=1384375@ieeecnfs
A major difficulty associated with using an empirical nonlinear model for model-based control is that the model can be unduly extrapolated into regions of the state space where identification data were scarce or even nonexistent. Optimal control solutions obtained with such over-extrapolations can result in performances far worse than predicted by the model. In the multi-step predictive control setting, it is not straightforward to prevent such overuse of the model by forcing the optimizer to find a solution within the "trusted" regions of state space. Given the difficulty, we propose an approximate dynamic programming based approach for designing a model-based controller that avoids such abusage of an empirical model with respect to the distribution of the identification data. The approach starts with closed-loop test data obtained with some suboptimal controllers, e.g., PI controllers, and attempts to derive a new control policy that improves upon their performances. Iterative improvement based on successive closed-loop testing is possible. A diabatic CSTR example is provided to illustrate the proposed approach.
Keywords: biabatic CSTR, tank reactor, application, RL, bioreactor control
HongbingWang:2008p23948
RLPLA: A Reinforcement Learning Algorithm of Web Service Composition with Preference Consideration
H. Wang and P. Tang and P. Hung
Congress on Services Part II, 2008. SERVICES-2. IEEE    163 -- 170  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4700514&isnumber=4700469&punumber=4700468&k2dockey=4700514@ieeecnfs
http://dx.doi.org/10.1109/SERVICES-2.2008.15
here are many static and dynamic Web services composition strategies, however literatures about automatic composition is very rare. In this article, a new algorithm based on reinforcement learning is proposed to realize web service composition automatically and randomly. On the other hand, the existing composition prototype systems mainly focus on function-oriented composition, but not QoS-oriented composition. After understanding the function-oriented composition by reinforcement learning, this paper then introduces preference logic to seek a QoS optimization solution, which is some kind of qualitative solution. When compared with quantitative solution it has many advantages. The result is a novel algorithm RLPLA, which is an algorithm of Web services composition based on reinforcement learning and preference logic
Keywords: web services, quality of service, RL, application
Nedich:2005p23879
Farsighted sensor management strategies for move/stop tracking
A. Nedich and M. Schneider and R. Washburn
Information Fusion, 2005 8th International Conference on  1  8  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1591905&isnumber=33512&punumber=10604&k2dockey=1591905@ieeecnfs
http://dx.doi.org/10.1109/ICIF.2005.1591905
We consider the sensor management problem arising in using a multi-mode sensor to track moving and stopped targets. The sensor management problem is to determine what measurements to take in time so as to optimize the utility of the collected data. Finding the best sequence of measurements is a hard combinatorial problem due to many factors, including the large number of possible sensor actions and the complexity of the dynamics. The complexity of the dynamics is due in part to the sensor dwell-time depending on the sensor mode, targets randomly starting and stopping, and the uncertainty in the sensor detection process. For such a sensor management problem, we propose a novel, computationally efficient, farsighted algorithm based on an approximate dynamic programming methodology. The algorithm's complexity is polynomial in the number of targets. We evaluate this algorithm against a myopic algorithm optimizing an information-theoretic scoring criterion. Our simulation results indicate that the farsighted algorithm performs better with respect to the average time the track error is below a specified goal value.
Keywords: tracking of moving, application, sensor management, RL, multi-mode sensors, stopped targets
Peters:2006p10138
Policy gradient methods for robotics
J. Peters and S. Schaal
Proceedings of the IEEE/RSJ International Conference on …      (2006)
http://www-clmc.usc.edu/publications/P/peters-IROS2006.pdf
Keywords: anthropomorphic arm, hitting a baseball, RL, application
Abbeel:2006p16213
Using inaccurate models in reinforcement learning
P. Abbeel and M. Quigley and A. Ng
Proceedings of the 23rd international conference on Machine …      (2006)
http://portal.acm.org/citation.cfm?id=1143844.1143845
In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.
Keywords: DDP, application, flight simulator, RL, RC Car
Borkar:2008p24319
A learning scheme for stationary probabilities of large markov chains with examples
V. Borkar and D. Das and A. Banik and D. Manjunath
Communication, Control, and Computing, 2008 46th Annual Allerton Conference on    1097 -- 1099  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4797682&isnumber=4797526&punumber=4786970&k2dockey=4797682@ieeecnfs
http://dx.doi.org/10.1109/ALLERTON.2008.4797682
We describe a reinforcement learning based scheme to estimate the stationary distribution of subsets of states of large Markov chains. `Split sampling' ensures that the algorithm needs to just encode the state transitions and will not need to know any other property of the Markov chain. (An earlier scheme required knowledge of the column sums of the transition probability matrix.) This algorithm is applied to analyze the stationary distribution of the states of a node in an 802.11 network.
Keywords: communication network, RL, application
Lee:2006p1549
Approximate dynamic programming based approach to process control and scheduling
J. Lee and J. Lee
Computers {\&} Chemical Engineering  30  1603--1618  (2006)
http://dx.doi.org/10.1016/j.compchemeng.2006.05.043
Keywords: process control, scheduling, RL, application
Castaon:2008p23723
Team task allocation and routing in risky environments under human guidance
D. Castaon and D. Ahner
Decision and Control, 2008. CDC 2008. 47th IEEE Conference on    1139 -- 1144  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4739148&isnumber=4738560&punumber=4721212&k2dockey=4739148@ieeecnfs
http://dx.doi.org/10.1109/CDC.2008.4739148
In this paper we design coordination policies for unmanned vehicles that select and perform tasks in uncertain environments where vehicles may fail. We develop algorithms that accept different levels of human guidance, from simple allocation of priorities through the use of task values to more complex task partitioning and load balancing techniques. The goal is to maximize expected value completed under human guidance. We develop alternative algorithms based on approximate dynamic programming versions appropriate for each level of guidance, and compare the resulting performance using simulation results.
Keywords: unmanned vehicle coordination, RL, application
Achbany:2008p23958
Continually Learning Optimal Allocations of Services to Tasks
Y. Achbany and I. J. Jureta and S. Faulkner and F. Fouss
Services Computing, IEEE Transactions on  1  141 -- 154  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4721292&isnumber=4768635&punumber=4629386&k2dockey=4721292@ieeejrns
http://dx.doi.org/10.1109/TSC.2008.12
Open service-oriented systems which autonomously and continually satisfy users' service requests to optimal levels are an appropriate response to the need for increased automation of information systems. Given a service request, an open service-oriented system interprets the functional and nonfunctional requirements laid out in the request and identifies the optimal selection of services{\&}{\#}x2014;that is, identifies services. These services' coordinated execution optimally satisfies the requirements in the request. When selecting services, it is relevant to: (1) revise selections as new services appear and others become unavailable; (2) use multiple criteria, including nonfunctional ones to choose among competing services; (3) base the comparisons of services on observed, instead of advertised performance; and (4) allow for uncertainty in the outcome of service executions. To address issues (1){\&}{\#}x2013;(4), we propose the Multi-Criteria Randomized Reinforcement Learning (MCRRL) service selection approach. MCRRL learns and revises service selections using a novel multicriteria-driven (including quality of service parameters, deadline, reputation, cost, and preferences) reinforcement learning algorithm, which integrates the exploitation of data about individual services' past performance with optimal, undirected, continual exploration of new selections that involve services whose behavior has not been observed. The experiments indicate the algorithm behaves as expected and outperforms two standard approaches.
Keywords: optimal allocations, service to task mapping, RL, application
Maes:2007p11103
Sequence Labeling with Reinforcement Learning and Ranking Algorithms
F. Maes and L. Denoyer and P. Gallinari
ECML-07      (2007)
http://www.springerlink.com/index/v82h161v6760135j.pdf
Many problems in areas such as Natural Language Processing, Information Retrieval, or Bioinformatic involve the generic task of sequence labeling. In many cases, the aim is to assign a label to each element in a sequence. Until now, this problem has mainly been addressed with Markov models and Dynamic Programming. We propose a new approach where the sequence labeling task is seen as a sequential decision process. This method is shown to be very fast with good generalization accuracy. Instead of searching for a globally optimal label sequence, we learn to construct this optimal sequence directly in a greedy fashion. First, we show that sequence labeling can be modelled using Markov Decision Processes, so that several Reinforcement Learning (RL) algorithms can be used for this task. Second, we introduce a new RL algorithm which is based on the ranking of local labeling decisions.
Keywords: Natural Language Chunking, Spanish Named Entity Recognition, application, Handwritten Recognition, RL, ranking, sequence labeling
Lendaris:2003p23880
Controller design via adaptive critic and model reference methods
G. Lendaris and R. Santiago and J. McCarthy and M. Carroll
Neural Networks, 2003. Proceedings of the International Joint Conference on  4  3173 -- 3178 vol.4  (2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1224080&isnumber=27487&punumber=8672&k2dockey=1224080@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2003.1224080
Dynamic programming (DP) is a principled way to design optimal controllers for certain classes of nonlinear systems; unfortunately, DP is computationally very expensive. The reinforcement learning methods known as adaptive critics (AC) provide comput.....
Keywords: aircraft control, control augmentation subsystem, application, hypersonic shaped airplane, RL
Balakrishna:2008p24316
Estimating Taxi-out times with a reinforcement learning algorithm
P. Balakrishna and R. Ganesan and L. Sherry and B. Levy
Digital Avionics Systems Conference, 2008. DASC 2008. IEEE/AIAA 27th    3.D.3-1 - 3.D.3--12  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4702812&isnumber=4702732&punumber=4694773&k2dockey=4702812@ieeecnfs
http://dx.doi.org/10.1109/DASC.2008.4702812
Flight delays have a significant impact on the nationpsilas economy. Taxi-out delays in particular constitute a significant portion of the block time of a flight. In the future, it can be expected that accurate predictions of dasiawheels-offpsila tim.....
Keywords: aircraft control, policy evaluation, application, taxi-out time prediction, RL
JihWenSheu:2008p23757
Designing Automatic Train Regulation for MRT system by adaptive critic method
J. Sheu and W. Lin;
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on    406 -- 412  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4633824&isnumber=4633757&punumber=4625775&k2dockey=4633824@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4633824
Because of the disturbance of operation environment in mass rapid transit (MRT) system, the robustness against disturbance and the schedule punctuality under control constraint are important issues to be considered in designing Automatic Train Regulation (ATR) for MRT system. In this paper, the study on suitable traffic model for designing ATR system and ATR design based on adaptive critic design (ACD) of approximated dynamic programming, specifically on dual heuristic programming (DHP) are presented. Moreover, the method to deal with control constraint and applying gain scheduling to deal with the time variant environment of MRT system are addressed as well. For comparison, simulations with real operation environment data are done for ATRs designed by both adaptive critic method and LQ optimization method.
Keywords: automatic train regulation, mass rapid transit systems, RL, application
Wang:2006p10294
Adaptive Routing for Sensor Networks using Reinforcement Learning
P. Wang and T. Wang
Proceedings of the Sixth IEEE International Conference on Computer and Information Technology       (2006)
http://doi.ieeecomputersociety.org/10.1109/CIT.2006.34
Efficient and robust routing is central to wireless sensor networks (WSN) that feature energy-constrained nodes, unreliable links, and frequent topology change. While most existing routing techniques are designed to reduce routing cost by optimizing one goal, e.g., routing path length, load balance, re-transmission rate, etc, in real scenarios however, these factors affect the routing performance in a complex way, leading to the need of a more sophisticated scheme that makes correct trade-offs. In this paper, we present a novel routing scheme, AdaR that adaptively learns an optimal routing strategy, depending on multiple optimization goals. We base our approach on a least squares reinforcement learning technique, which is both data efficient, and insensitive against initial setting, thus ideal for the context of ad-hoc sensor networks. Experimental results suggest a significant performance gain over a naive Q-learning based implementation.
Keywords: routing, wireless sensor networks, RL, application
Frazier:2008p23726
Approximate Dynamic Programming in Knowledge Discovery for Rapid Response
P. Frazier and W. Powell and S. Dayanik and P. Kantor
42nd Hawaii International Conference on System Sciences, 2009. HICSS '09.     1 -- 10  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4755493&isnumber=4755314&punumber=4755313&k2dockey=4755493@ieeecnfs
http://dx.doi.org/10.1109/HICSS.2009.79
One knowledge discovery problem in the rapid response setting is the cost of learning which patterns are indicative of a threat. This typically involves a detailed follow-through, such as review of documents and information by a skilled analyst, or detailed examination of a vehicle at a border crossing point, in deciding which suspicious vehicles require investigation. Assessing various strategies and decision rules means we must compare not only the short term effectiveness of interrupting a specific traveler, or forwarding a specific document to an analyst, but we must also weigh the potential improvement in our profiles that results even from sending a "false alarm". We show that this problem can be recast as a dynamic programming problem with, unfortunately, a huge state space. Several specific heuristics are introduced to provide improved approximations to the solution. The problems of obtaining real-world data to sharpen the analysis are discussed briefly.
Keywords: threat management, RL, application
Enns:2003p23903
Helicopter trimming and tracking control using direct neural dynamic programming
R. Enns and J. Si;
Neural Networks, IEEE Transactions on  14  929 -- 939  (2003)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1215408&isnumber=27335&punumber=72&k2dockey=1215408@ieeejrns
http://dx.doi.org/10.1109/TNN.2003.813839
his paper advances a neural-network-based approximate dynamic programming control mechanism that can be applied to complex control problems such as helicopter flight control design. Based on direct neural dynamic programming (DNDP), an approximate dynamic programming methodology, the control system is tailored to learn to maneuver a helicopter. The paper consists of a comprehensive treatise of this DNDP-based tracking control framework and extensive simulation studies for an Apache helicopter. A trim network is developed and seamlessly integrated into the neural dynamic programming (NDP) controller as part of a baseline structure for controlling complex nonlinear systems such as a helicopter. Design robustness is addressed by performing simulations under various disturbance conditions. All designs are tested using FLYRT, a sophisticated industrial scale nonlinear validated model of the Apache helicopter. This is probably the first time that an approximate dynamic programming methodology has been systematically applied to, and evaluated on, a complex, continuous state, multiple-input multiple-output nonlinear system with uncertainty. Though illustrated for helicopters, the DNDP control system framework should be applicable to general purpose tracking control.
Keywords: Apache helicopter, helicopter control, RL, application
JuJiang:2008p24258
Altitude control of aircraft using coefficient-based policy method
J. Jiang and H. Gong and J. Liu and H. Xu and Y. Chen;
Electrical and Computer Engineering, 2008. CCECE 2008. Canadian Conference on    000361 -- 000364  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4564557&isnumber=4564482&punumber=4554522&k2dockey=4564557@ieeecnfs
http://dx.doi.org/10.1109/CCECE.2008.4564557
This paper proposes a coefficient-based policy searching method, the direct policy search (DPS), for searching (learning) and construct policies for controlling the altitude of an aircraft. The DPS is a new and efficient reinforcement learning (RL) strategy combined with genetic algorithms (GAs). Specifically, an optimal policy in DPS consists of a set of coefficients which are learned using GA-based RL (GARL). The proposed method for learning optimal policy is demonstrated in controlling the complicated altitude system of a Boeing 747 aircraft whose solution space consists of 20 variables. Simulation results show that this new approach produces competitive performances with the traditional algorithms such as the classical state-feedback algorithm and the pure RL algorithm.
Keywords: aircraft control, altitude control, application, RL, Boeing 747
Peshkin:2007p17962
Reinforcement Learning for Adaptive Routing
L. Peshkin and V. Savova
Arxiv preprint cs      (2007)
http://arxiv.org/abs/cs/0703138
Reinforcement learning means learning a policy--a mapping of observations into actions--based on feedback from the environment. The learning can be viewed as browsing a set of policies while evaluating them by trial through interaction with the environment. We present an application of gradient ascent algorithm for reinforcement learning to a complex domain of packet routing in network communication and compare the performance of this algorithm to other routing methods on a benchmark problem.
Keywords: packet routing, RL, application
Armstrong:2008p23943
Reinforcement learning for automated performance tuning: Initial evaluation for sparse matrix format selection
W. Armstrong and A. Rendell
Cluster Computing, 2008 IEEE International Conference on    411 -- 420  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4663802&isnumber=4663724&punumber=4655410&k2dockey=4663802@ieeecnfs
http://dx.doi.org/10.1109/CLUSTR.2008.4663802
The field of reinforcement learning has developed techniques for choosing beneficial actions within a dynamic environment. Such techniques learn from experience and do not require teaching. This paper explores how reinforcement learning techniques might be used to determine efficient storage formats for sparse matrices. Three different storage formats are considered: coordinate, compressed sparse row, and blocked compressed sparse row. Which format performs best depends heavily on the nature of the matrix and the computer system being used. To test the above a program has been written to generate a series of sparse matrices, where any given matrix performs optimally using one of the three different storage types. For each matrix several sparse matrix vector products are performed. The goal of the learning agent is to predict the optimal sparse matrix storage format for that matrix. The proposed agent uses five attributes of the sparse matrix: the number of rows, the number of columns, the number of non-zero elements, the standard deviation of non-zeroes per row and the mean number of neighbours. The agent is characterized by two parameters: an exploration rate and a parameter that determines how the state space is partitioned. The ability of the agent to successfully predict the optimal storage format is analyzed for a series of 1,000 automatically generated test matrices.
Keywords: performance tuning of algorithms, RL, application
Gosavi:2004p446
Reinforcement learning for long-run average cost
A. Gosavi
European Journal of Operational Research  155  654--674  (2004)
http://dx.doi.org/10.1016/S0377-2217(02)00874-3
Keywords: application, maintenance problem, RL
Richert:2008p24251
Increasing the Autonomy of Mobile Robots by On-line Learning Simultaneously at Different Levels of Abstraction
W. Richert and O. Luke and B. Nordmeyer and B. Kleinjohann
Autonomic and Autonomous Systems, 2008. ICAS 2008. Fourth International Conference on    154 -- 159  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4488338&isnumber=4488306&punumber=4488305&k2dockey=4488338@ieeecnfs
http://dx.doi.org/10.1109/ICAS.2008.26
We present a framework that is able to handle system and environmental changes by learning autonomously at different levels of abstraction. It is able to do so in continuous and noisy environments by 1) an active strategy learning module that uses reinforcement learning and 2) a dynamically adapting skill module that proactively explores the robot's own action capabilities and thereby providing actions to the strategy module. We present results that show the feasibility of simultaneously learning low-level skills and high-level strategies in order to reach a goal while reacting to disturbances like hardware damages. Thereby, the robot drastically increases its overall autonomy.
Keywords: hierarchy, mobile robots, RL, application
Ipek:2008p24343
Self-Optimizing Memory Controllers: A Reinforcement Learning Approach
E. Ipek and O. Mutlu and J. Martinez and R. Caruana
Computer Architecture, 2008. ISCA '08. 35th International Symposium on    39 -- 50  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4556714&isnumber=4556700&punumber=4556699&k2dockey=4556714@ieeecnfs
http://dx.doi.org/10.1109/ISCA.2008.21
Efficiently utilizing off-chip DRAM bandwidth is a critical issue in designing cost-effective, high-performance chip multiprocessors (CMPs). Conventional memory controllers deliver relatively low performance in part because they often employ fixed, rigid access scheduling policies designed for average-case application behavior. As a result, they cannot learn and optimize the long-term performance impact of their scheduling decisions,and cannot adapt their scheduling policies to dynamic workload behavior.We propose a new, self-optimizing memory controller design that operates using the principles of reinforcement learning (RL)to overcome these limitations. Our RL-based memory controller observes the system state and estimates the long-term performance impact of each action it can take. In this way, the controller learns to optimize its scheduling policy on the fly to maximize long-term performance. Our results show that an RL-based memory controller improves the performance of a set of parallel applications run on a 4-core CMP by 19% on average (upto 33%), and it improves DRAM bandwidth utilization by 22%compared to a state-of-the-art controller.
Keywords: scheduling, application, access scheduling, microcontrollers, RL
ChenglongFu:2008p23952
Gait Synthesis and Sensory Control of Stair Climbing for a Humanoid Robot
C. Fu and K. Chen;
Industrial Electronics, IEEE Transactions on  55  2111 -- 2120  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4475434&isnumber=4505400&punumber=41&k2dockey=4475434@ieeejrns
http://dx.doi.org/10.1109/TIE.2008.921205
Stable and robust walking in various environments is one of the most important abilities for a humanoid robot. This paper addresses walking pattern synthesis and sensory feedback control for humanoid stair climbing. The proposed stair-climbing gait is formulated to satisfy the environmental constraint, the kinematic constraint, and the stability constraint; the selection of the gait parameters is formulated as a constrained nonlinear optimization problem. The sensory feedback controller is phase dependent and consists of the torso attitude controller, zero moment point compensator, and impact reducer. The online learning scheme of the proposed feedback controller is based on a policy gradient reinforcement learning method, and the learned controller is robust against external disturbance. The effectiveness of our proposed method was confirmed by walking experiments on a 32-degree-of-freedom humanoid robot.
Keywords: gait synthesis, application, walking, stair climbing, RL, humanoid robots
Kocsis:2001p18052
Learning Time Allocation Using Neural Networks
L. Kocsis and J. Uiterwijk and J. van den Herik
LECTURE NOTES IN COMPUTER SCIENCE      (2001)
http://www.springerlink.com/index/0PTV8ATA6EQ35CT6.pdf
The strength of a game-playing program is mainly based on the adequacy of the evaluation function and the efficacy of the search algorithm. This paper investigates how temporal difference learning and genetic algorithms can be used to improve various decisions made during game-tree search. The existent TD algorithms are not directly suitable for learning search decisions. Therefore we propose a modified update rule that uses the TD error of the evaluation function to shorten the lag between two rewards. The genetic algorithms can be applied directly to learn search decisions. For our experiments we selected the problem of time allocation from the set of search decisions. On each move the player can decide on a certain search depth, being constrained by the amount of time left. As testing ground, we used the game of Lines of Action, which has roughly the same complexity as Othello. From the results we conclude that both the TD and the genetic approach lead to good results when compared to the existent time-allocation techniques. Finally, a brief discussion of the issues that can emerge when the algorithms are applied to more complex search decisions is given.
Keywords: games, time-allocation, RL, application
Richter:2009p10231
Traffic Light Scheduling using Policy-Gradient Reinforcement Learning
S. Richter
abotea.rsise.anu.edu.au      ()
http://abotea.rsise.anu.edu.au/satellite-events-icaps07/dc/dc-23.pdf
Keywords: traffic light scheduling, RL, application
Yadav:2007p23747
Robust/Optimal Temperature Profile Control of a High-Speed Aerospace Vehicle Using Neural Networks
V. Yadav and R. Padhi and S. Balakrishnan
Neural Networks, IEEE Transactions on  18  1115 -- 1128  (2007)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4267724&isnumber=4267697&punumber=72&k2dockey=4267724@ieeejrns
http://dx.doi.org/10.1109/TNN.2007.899229
n approximate dynamic programming (ADP)-based suboptimal neurocontroller to obtain desired temperature for a high-speed aerospace vehicle is synthesized in this paper. A 1-D distributed parameter model of a fin is developed from basic thermal physics principles. ldquoSnapshotrdquo solutions of the dynamics are generated with a simple dynamic inversion-based feedback controller. Empirical basis functions are designed using the ldquoproper orthogonal decompositionrdquo (POD) technique and the snapshot solutions. A low-order nonlinear lumped parameter system to characterize the infinite dimensional system is obtained by carrying out a Galerkin projection. An ADP-based neurocontroller with a dual heuristic programming (DHP) formulation is obtained with a single-network-adaptive-critic (SNAC) controller for this approximate nonlinear model. Actual control in the original domain is calculated with the same POD basis functions through a reverse mapping. Further contribution of this paper includes development of an online robust neurocontroller to account for unmodeled dynamics and parametric uncertainties inherent in such a complex dynamic system. A neural network (NN) weight update rule that guarantees boundedness of the weights and relaxes the need for persistence of excitation (PE) condition is presented. Simulation studies show that in a fairly extensive but compact domain, any desired temperature profile can be achieved starting from any initial temperature profile. Therefore, the ADP and NN-based controllers appear to have the potential to become controller synthesis tools for nonlinear distributed parameter systems.
Keywords: aerospace vehicles, temperature profile control, RL, application
Barty:2005p20124
Temporal difference learning with kernels for pricing american style options
K. Barty and J. Roy and C. Strugarek
Preprint      (2005)
http://www.optimization-online.org/DB_FILE/2005/05/1133.pdf
We propose in this paper to study the problem of estimating the cost-to-go function for an infinite-horizon discounted Markov chain with possibly con- tinuous state space. For implementation purposes, the state space is typically discretized. As soon as the dimension of the state space becomes large, the computation is no more practicable, a phenomenon referred to as the curse of dimensionality. The approximation of dynamic programming problems is therefore of ma jor importance. A powerful method for dynamic programming, often referred to as neuro- dynamic programming, consists in representing the Bellman function as a linear combination of a priori defined functions, called neurons. The choice of the neurons represents a delicate operation since it requires to have an idea of the optimal solution. Furthermore, in a classical learning algorithm once the choice of these neurons is made it is no longer modified although the amount of the available information concerning the solution increases along the iterations. In other words, such algorithms are ``locked'' in the vector subspace generated by these neurons. Consequently, the algorithm is not able to reach the optimal solution if it does not belong to the neurons vector subspace. In this article, we propose an alternative approach very similar to tempo- ral differences, based on functional gradient descent and using an infinite kernel basis. Our algorithm is a result of the combination of stochastic ap- proximation ideas, and nonparametric estimation concepts. Furthermore, our algorithm, though aimed at infinite dimensional problems, is implementable in practice. We prove the convergence of this algorithm under a few condi- tions, which are classical in stochastic approximation schemes. We conclude by showing on examples how this algorithm can be used to solve both infinite- horizon discounted Markov chain problems, and bermudan option pricing.
Keywords: pricing, American option, RL, application
Vengerov:2002p23920
Adaptive learning scheme for power control in wireless transmitters
D. Vengerov and N. Bambos and H. Berenji
Vehicular Technology Conference, 2002. Proceedings. VTC 2002-Fall. 2002 IEEE 56th  4  2390 -- 2394 vol.4  (2002)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1040649&isnumber=22309&punumber=8066&k2dockey=1040649@ieeecnfs
http://dx.doi.org/10.1109/VETECF.2002.1040649
We address the issue of power-controlled shared channel access in future wireless networks supporting packetized data traffic. We formulate this problem using the dynamic programming framework and present a distributed approximate dynamic programming algorithm capable of finding the optimal policy. The main tradeoff facing a transmitter is to balance its current power level with future backlog in the presence of stochastically changing interference. Simulation experiments demonstrate that the algorithm can achieve significant performance gains over the standard power control approach.
Keywords: power control, packet radio networks, RL, application
Williams:2005p23868
An approximate dynamic programming approach for communication constrained inference
J. Williams and J. Fisher and A. Willsky
Statistical Signal Processing, 2005 IEEE/SP 13th Workshop on    1202 -- 1207  (2005)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=1628778&isnumber=34164&punumber=10843&k2dockey=1628778@ieeecnfs
Resource management in distributed sensor networks is a challenging problem. This can be attributed to the funda- mental tradeoff between the value of information contained in a distributed set of measurements versus the energy costs of acquiring the measurements, fusing them into a model of uncertainty, and transmitting the resulting model. Commu- nications is commonly the highest contributor among these costs, typically by orders of magnitude. Failure to consider this tradeoff can significantly reduce the operational lifetime of a sensor network. While a variety of methods have been proposed that treat a subset of these issues, the approaches are indirect and usually consider at most a single time step. In the context of target tracking with a distributed sensor network we propose an approximate dynamic programming approach which integrates the value of information and the cost of transmitting data over a rolling time horizon. Specif- ically, we consider tracking a single target and constrain the problem such that, at any time, a single sensor, referred to as the leader node, is activated to both sense and update the probabilistic model. The issue of selecting which sensor should be the leader at each time is of primary interest, as it directly impacts the trade-off between the estimation accu- racy and the cost of communicating the probabilistic model from old leader node to new leader node. We formulate this trade-off as a dynamic program, and use an approximation based on a linearization of the sensor model about a nomi- nal trajectory to find a tractable solution. Simulation results demonstrate that the resulting algorithm can provide similar estimation performance to that of the common most infor- mative sensor election method at a fraction of the commu- nication energy cost.
Keywords: resource management, distributed sensor networks, RL, application
Pineau:2003p4682
Policy-contingent abstraction for robust robot control
J. Pineau and G. Gordon and S. Thrun
Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI), Acapulco, Mexico      (2003)
Keywords: mobile robotic assistant, RL, application
Wang:2008p23759
ADP-Based Optimal Adaptive Waveform Selection in Cognitive Radar
B. Wang and J. Wang and J. Li
Intelligent Information Technology Application Workshops, 2008. IITAW '08. International Symposium on    788 -- 790  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4732055&isnumber=4731858&punumber=4731857&k2dockey=4732055@ieeecnfs
http://dx.doi.org/10.1109/IITA.Workshops.2008.96
he problem of adaptive waveform selection can be viewed as a problem of stochastic dynamic programming. However, the computational cost of solution of optimality equations is very high as a result of large state variable space. So far few suitable scheduling algorithms have been proposed. To account for the aboved shortcoming, we can use approximate dynamic programming (ADP) to solve adaptive waveform selection problem. The scheduling algorithm we proposed can minimize the time taken to detect new targets, detect these targets in accordance with importance, and minimize the use of radar resources. Meanwhile we have a lower computational cost compared with the traditional algorithms. Finally, the whole paper is summarized.
Keywords: scheduling, application, RL, cognitive radar, adaptive waveform selection
Galstyan:2004p17974
Resource allocation in the grid using reinforcement learning
A. Galstyan and K. Czajkowski and K. Lerman
Autonomous Agents and Multiagent Systems      (2004)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1373688
[PDF]
Keywords: resource allocation, grid, RL, application
Crites:1998p17978
Elevator Group Control Using Multiple Reinforcement Learning Agents
R. Crites and A. Barto
Mach Learn      (1998)
http://www.springerlink.com/index/H023RU5337451W57.pdf
Recent algorithmic and theoretical advances in reinforcement learning (RL) have attracted widespread interest. RL algorithms have appeared that approximate dynamic programming on an incremental basis. They can be trained on the basis of real or simulated experiences, focusing their computation on areas of state space that are actually visited during control, making them computationally tractable on very large problems. If each member of a team of agents employs one of these algorithms, a new collective learning algorithm emerges for the team as a whole. In this paper we demonstrate that such collective RL algorithms can be powerful heuristic methods for addressing large-scale control problems. Elevator group control serves as our testbed. It is a difficult domain posing a combination of challenges not seen in most multi-agent learning research to date. We use a team of RL agents, each of which is responsible for controlling one elevator car. The team receives a global reward signal which appears noisy to each agent due to the effects of the actions of the other agents, the random nature of the arrivals and the incomplete observation of the state. In spite of these complications, we show results that in simulation surpass the best of the heuristic elevator control algorithms of which we are aware. These results demonstrate the power of multi-agent RL on a very large scale stochastic dynamic optimization problem of practical utility.
Keywords: elevator group control, RL, application
Shervais:2001p23932
Improving theoretically-optimal and quasi-optimal inventory and transportation policies using adaptive critic based approximate dynamic programming
S. Shervais and T. Shannon
Neural Networks, 2001. Proceedings. IJCNN '01. International Joint Conference on  2  1008 -- 1013 vol.2  (2001)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=939498&isnumber=20330&punumber=7474&k2dockey=939498@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2001.939498
We demonstrate the possibility of improving on theoretically-optimal fixed policies for control of physical inventory systems in a nonstationary fitness terrain, based on the combined application of evolutionary search and adaptive critic terrain following. We show that adaptive critic based approximate dynamic programming techniques based on plant-controller Jacobians can be used with systems characterized by discrete valued states and controls. Improvements over the best fixed policies (found using either an LP model or a genetic algorithm) in a high-penalty environment, average 83% under conditions both of stationary and nonstationary demand using real world data
Keywords: stock control, RL, application
Abate:2008p23767
An approximate dynamic programming approach to probabilistic reachability for stochastic hybrid systems
A. Abate and M. Prandini and J. Lygeros and S. Sastry
Decision and Control, 2008. CDC 2008. 47th IEEE Conference on    4018 -- 4023  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4739410&isnumber=4738560&punumber=4721212&k2dockey=4739410@ieeecnfs
http://dx.doi.org/10.1109/CDC.2008.4739410
his paper addresses the computational overhead involved in probabilistic reachability computations for a general class of controlled stochastic hybrid systems. An approximate dynamic programming approach is proposed to mitigate the curse of dimensionality issue arising in the solution to the stochastic optimal control reformulation of the probabilistic reachability problem. An algorithm tailored to this problem is introduced and compared with the standard numerical solution to dynamic programming on a benchmark example.
Keywords: probabilistic reachability, hybrid systems, RL, application
Pietrabissa:2006p23881
An approximate dynamic programming approach to admission control in WCDMA networks
A. Pietrabissa and D. A. Borza
Computer Aided Control System Design, 2006 IEEE International Conference on Control Applications, 2006 IEEE International Symposium on Intelligent Control, 2006 IEEE    2087 -- 2092  (2006)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4776962&isnumber=4776587&punumber=4768844&k2dockey=4776962@ieeecnfs
http://dx.doi.org/10.1109/CACSD-CCA-ISIC.2006.4776962
This paper presents a Connection Admission Control (CAC) algorithm for Wide-CDMA (WCDMA) networks based on an Approximate Dynamic Programming (ADP) approach: the network is modeled as a Markov Decision Process (MDP), and ADP algorithms are used to compute the optimal admission policy. Two innovative concept are introduced: i) by formulating the problem as a linear programming problem, the blocking probabilities of the different classes of service supported by the network are controlled by appropriately modifying the reward function; ii) on the ground of practical considerations on the CAC problem, a restricted structure policy is determined, based on the reduction of the action space for low load conditions. Numerical simulation results show the effectiveness of the approach.
Keywords: admission control, WCDMA networks, RL, application
Kwok:2004p10267
Reinforcement learning for sensing strategies
C. Kwok and D. Fox
Intelligent Robots and Systems (IROS-2004)      (2004)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1389903
Since sensors have limited range and coverage, mobile robots often have to make decisions on where to point their sensors. A good sensing strategy allows a robot to collect information that is useful for its tasks. Most existing solutions to this active sensing problem choose the direction that maximally reduces the uncertainty in a single state variable. In more complex problem domains, however, uncertainties exist in multiple state variables, and they affect the performance of the robot in different ways. The robot thus needs to have more sophisticated sensing strategies in order to decide which uncertainties to reduce, and to make the correct trade-offs. In this work, we apply a least squares reinforcement learning method to solve this problem. We implemented and tested the learning approach in the RoboCup domain, where the robot attempts to reach a ball and accurately kick it into the goal. We present experimental results that suggest our approach is able to learn highly effective sensing strategies.
Keywords: application, RoboCup, sensing, RL, reach the ball, kick it
deFarias:2003p4366
The Linear Programming Approach to Approximate Dynamic Programming
D. de Farias and B. V. Roy
Operations Research  51  850--865  (2003)
Keywords: queueing network, RL, application
Schneegass:2008p23946
Uncertainty propagation for quality assurance in Reinforcement Learning
D. Schneegass and S. Udluft and T. Martinetz
Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on    2588 -- 2595  (2008)
http://ieeexplore.ieee.org/search/srchabstract.jsp?arnumber=4634160&isnumber=4633757&punumber=4625775&k2dockey=4634160@ieeecnfs
http://dx.doi.org/10.1109/IJCNN.2008.4634160
n this paper we address the reliability of policies derived by Reinforcement Learning on a limited amount of observations. This can be done in a principled manner by taking into account the derived Q-functionpsilas uncertainty, which stems from the uncertainty of the estimators used for the MDPpsilas transition probabilities and the reward function. We apply uncertainty propagation parallelly to the Bellman iteration and achieve confidence intervals for the Q-function. In a second step we change the Bellman operator as to achieve a policy guaranteeing the highest minimum performance with a given probability. We demonstrate the functionality of our method on artificial examples and show that, for an important problem class even an enhancement of the expected performance can be obtained. Finally we verify this observation on an application to gas turbine control.
Keywords: gas turbine control, RL, application
Kalyanakrishnan:2007p10468
Batch reinforcement learning in a complex domain
S. Kalyanakrishnan and P. Stone
Proceedings of the 6th international joint conference on …      (2007)
http://portal.acm.org/citation.cfm?id=1329125.1329241
Temporal difference reinforcement learning algorithms are perfectly suited to autonomous agents because they learn directly from an agent's experience based on sequential actions in the environment. However, their most common algorithmic variants are relatively inefficient in their use of experience data, which in many agent-based settings can be scarce. In particular, they make just one learning "update" for each atomic experience. Batch reinforcement learning algorithms, on the other hand, aim to achieve greater data efficiency by saving experience data and using it in aggregate to make updates to the learned policy. Their success has been demonstrated in the past on simple domains like grid worlds and low-dimensional control applications like pole balancing. In this paper, we compare and contrast batch reinforcement learning algorithms with on-line algorithms based on their empirical performance in a complex, continuous, noisy, multiagent domain, namely RoboCup soccer Keepaway. We find that the two batch methods we consider, Experience Replay and Fitted Q Iteration, both yield significant gains in sample complexity, while achieving high asymptotic performance.
Keywords: Robocup Keepaway Soccer, RL, application