My papers

This table is automagically The generator is Exhibit. Scroll down to the end of the page to find a link. I made every effort (I could afford!) to make sure that the page looks OK in recent browsers (as of 2010, August). These browsers include IE, Safari, Chrome, Firefox, Opera. However, it may very well be that the page will be messed up on some systems for some browser setting. In particular, you must enable Javascript to be able to properly view this page. Please let me know if the page is messed up, or if you find a broken link. This link points to the script file that I used to process my bib file. You may use the scripts and this page as a template at your own risk. generated from this BiBTeX file.
The purpose of this section is to let google find the indexable publications and list. Here is a non-existent word, madgyhung, that will help me to find out when google started to index this page.

2010 (15)

  1. Bartók, G., Szepesvári, Cs., and Zilles, S., Models of Active Learning in Group-structured State Spaces, Information and Computation, 208 , pp. 364--384, 2010. PDF (abstract) We investigate the problem of learning the transition dynamics of deterministic, discrete-state environments. We assume that an agent exploring such an environment is able to perform actions (from a finite set of actions) in the environment and to sense the state changes. The question investigated is whether the agent can learn the dynamics without visiting all states. Such a goal is unrealistic in general, hence we assume that the environment has structural properties an agent might exploit. In particular, we assume that the set of all action sequences forms an algebraic group. We introduce a learning model in different variants and study under which circumstances the corresponding ``group structured environments'' can be learned efficiently by experimenting with group generators (actions). It turns out that for some classes of such environments the choice of actions given to the agent determines if efficient learning is possible. Negative results are presented, even without efficiency constraints, for rather general classes of groups, showing that even with group structure, learning an environment from partial information is far from trivial. However, positive results for special subclasses of Abelian groups turn out to be a good starting point for the design of efficient learning algorithms based on structured representations.
  2. Szepesvári, Cs., Reinforcement Learning Algorithms for MDPs, Wiley Encyclopedia of Operations Research, Wiley, 2010.
  3. Szepesvári, Cs., Some Recent Algorithmic Results about the Exploration-vs-exploitation Dilemma, 2010. (link)
  4. Póczos, B., Greiner, R., Szepesvári, Cs., and Li, L., Budgeted Distribution Learning of Belief Net Parameters, ICML-10, 2010. PDF (abstract) Most learning algorithms assume that a data set is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over discrete variables. As such BNs are models of probabilistic distributions, we consider the ``generative'' challenge of learning the parameters, for a fixed structure, that best match the true distribution. We focus on the budgeted learning setting, where there is a known fixed cost c_i for acquiring the value of the i-th feature for any specified instance, and a known total cost to spend acquiring all information. After formally defining this problem from a Bayesian perspective, we first consider allocation algorithms that must decide, before seeing any results, which features of which instances to probe. We show this is NP-hard, even if all variables are independent, then prove that the greedy allocation algorithm IGA is optimal when the costs are uniform and the features are independent, but can otherwise be sub-optimal. We then show that general (non-allocation) policies perform better, and explore the challenges of learning the parameters for general belief networks in this setting, describing conditions for when the obvious round-robin algorithm will, versus will not work optimally. We also explore the effectiveness of this and various other heuristic algorithms.

2009 (12)

  1. Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, Cs., and Wiewiora, E., Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation, ICML-09, pp. 993--1000, 2009. PDF (abstract) Sutton, Szepesvari and Maei (2009) recently introduced the first temporal-difference learning algorithm compatible with both linear function approximation and off-policy training, and whose complexity scales only linearly in the size of the function approximator. Although their gradient temporal difference (GTD) algorithm converges reliably, it can be very slow compared to conventional linear TD (on on-policy problems where TD is convergent), calling into question its practical utility. In this paper we introduce two new related algorithms with better convergence rates. The first algorithm, GTD2, is derived and proved convergent just as GTD was, but uses a different objective function and converges significantly faster (but still not as fast as conventional TD). The second new algorithm, linear TD with gradient correction, or TDC, uses the same update rule as conventional TD except for an additional term which is initially zero. In our experiments on small test problems and in a Computer Go application with a million features, the learning rate of this algorithm was comparable to that of conventional TD. This algorithm appears to extend linear TD to off-policy learning with no penalty in performance while only doubling computational requirements.
  2. Maei, H.R., Szepesvári, Cs., Bhatnagar, S., Silver, D., Precup, D., and Sutton, R.S., Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation, NIPS-22, pp. 1204--1212, 2009. PDF (abstract) We introduce the first temporal-difference learning algorithms that converge with smooth value function approximators, such as neural networks. Conventional temporal-difference (TD) methods, such as TD(lambda), Q-learning and Sarsa have been used successfully with function approximation in many applications. However, it is well known that off-policy sampling, as well as non-linear function approximation, can cause these algorithms to become unstable (i.e., the parameters of the approximator may diverge). Sutton et al (2009a,b) solved the problem of off-policy learning with linear TD algorithms by introducing a new objective function, related to the Bellman-error, and algorithms that perform stochastic gradient-descent on this function. In this paper, we generalize their work to non-linear function approximation. We present a Bellman error objective function and two gradient-descent TD algorithms that optimize it. We prove the asymptotic almost-sure convergence of both algorithms for any finite Markov decision process and any smooth value function approximator, under usual stochastic approximation conditions. The computational complexity per iteration scales linearly with the number of parameters of the approximator. The algorithms are incremental and are guaranteed to converge to locally optimal solutions.
  3. Yao, H., Bhatnagar, S., and Szepesvári, Cs., LMS-2: Towards an Algorithm that is as Cheap as LMS and Almost as Efficient as RLS, Proc. of the 48th IEEE Conference on Decision and Control (CDC-09), pp. 1181--1188, 2009. PDF

2008 (10)

  1. Munos, R. and Szepesvári, Cs., Finite Time Bounds for Fitted Value Iteration, JMLR, 9 , pp. 815--857, 2008. (link) PDF (abstract) This is the longer version of the ICML'2005 paper with all the proofs and some extra material. In this paper we develop a theoretical analysis of the performance of sampling-based fitted value iteration (FVI) to solve infinite state-space, discounted-reward Markovian decision processes (MDPs) under the assumption that a generative model of the environment is available. Our main results come in the form of finite-time bounds on the performance of two versions of sampling-based FVI.The convergence rate results obtained allow us to show that both versions of FVI are well behaving in the sense that by using a sufficiently large number of samples for a large class of MDPs, arbitrary good performance can be achieved with high probability.An important feature of our proof technique is that it permits the study of weighted $L^p$-norm performance bounds. As a result, our technique applies to a large class of function-approximation methods (e.g., neural networks, adaptive regression trees, kernel machines, locally weighted learning), and our bounds scale well with the effective horizon of the MDP. The bounds show a dependence on the stochastic stability properties of the MDP: they scale with the discounted-average concentrability of the future-state distributions. They also depend on a new measure of the approximation power of the function space, the inherent Bellman residual, which reflects how well the function space is ``aligned'' with the dynamics and rewards of the MDP.The conditions of the main result, as well as the concepts introduced in the analysis, are extensively discussed and compared to previous theoretical results.Numerical experiments are used to substantiate the theoretical findings.

2007 (7)

  1. Audibert, J.-Y., Munos, R., and Szepesvári, Cs., Tuning Bandit Algorithms in Stochastic Environments, ALT-07, Springer, pp. 150--165, 2007. PDF

2006 (5)

2005 (5)

2004 (5)

2003 (3)

  1. French, M.C., Szepesvári, Cs., and Rogers, E., Performance of Nonlinear Approximate Adaptive Controllers, Performance of Nonlinear Approximate Adaptive Controllers, Wiley, 2003.

2001 (5)

  1. Lörincz, A., Hévizi, Gy., and Szepesvári, Cs., Ockham's Razor Modeling of the Matrisome Channels of the Basal Ganglia Thalamocortical Loop, International Journal of Neural Systems, 11 , pp. 1-- 2, 2001. (link) (abstract) A functional model of the basal ganglia-thalamocortical (BTC) loops is described. In our modeling effort, we try to minimize the complexity of our starting hypotheses. For that reason, we call this type of modeling Ockham's razor modeling. We have the additional constraint that the starting assumptions should not contradict experimental findings about the brain. First assumption: The brain lacks direct representation of paths but represents directions (called speed fields in control theory). Then control should be concerned with speed-field tracking (SFT). Second assumption: Control signals are delivered upon differencing in competing parallel channels of the BTC loops. This is modeled by extending SFT with differencing that gives rise to the robust Static and Dynamic State (SDS) feedback-controlling scheme. Third assumption: Control signals are expressed in terms of a gelatinous medium surrounding the limbs. This is modeled by expressing parameters of motion in parameters of the external space. We show that corollaries of the model fit properties of the BTC loops. The SDS provides proper identification of motion related neuronal groups of the putamen. Local minima arise during the controlling process that works in external space. The model explains the presence of parallel channels as the means to avoiding such local minima. Stability conditions of the SDS predict that the initial phase of learning is mostly concerned with selection of sign for the inverse dynamics. The model provides a scalable controller. State description in external space instead of configurational space reduces the dimensionality problem. Falsifying experiment is suggested. Computer experiments demonstrate the feasibility of the approach. We argue that the resulting scheme has a straightforward connectionist representation exhibiting population coding and Hebbian learning properties.

2000 (9)

  1. Sorantin, E., Schmidt, F., Mayer, H., Becker, M., Szepesvári, Cs., Graif, E., and Winkler, P., Computer Aided Diagnosis of Clustered Microcalcifications Using Artificial Neural Nets, Journal of Computing and Information Technology, 8 (2) , 2000. (link) PDF (abstract) Objective: Development of a fully automated computer application for detection and classification of clustered microcalcifications using neural nets. Material and Methods: Mammographic films with clustered microcalcifications of known histology were digitized. All clusters were rated by two radiologists on a 3 point scale: benign, indeterminate and malignant. Automated detected clustered microcalcifications were clustered. Features derived from those clusters were used as input to 2 artificial neural nets: one was trained to identify the indeterminate clusters, whereas the second ANN classified the remaining clusters in benign or malignant ones. Performance evaluation followed the patient-based receiver operator characteristic analysis. Results: For identification of patients with indeterminate clusters a an Az-value of 0.8741 could be achieved. For the remaining patients their clusters could be classified as benign or malignant at an Az-value of 0.8749, a sensitivity of 0.977 and specificity of 0.471. Conclusions: A fully automated computer system for detection and classification of clustered microcalcifications was developed. The system is able to identify patients with indeterminate clusters, where additional investigations are recommended, and produces a reliable estimation of the biologic dignity for the remaining ones.
  2. French, M.C., Szepesvári, Cs., and Rogers, E., Scaling of LQ Performance in Approximate Adaptive Designs, Proceedings of the International Symposium on the Mathematical Theory of Networks and Systems (MTNS 2000), 2000.

1999 (6)

  1. Murvai, J., Barta, E., Vlahovicek, K., Szepesvári, Cs., Acatrinei, C., and Pongor, S., The SBASE Protein Domain Sequence Library Release 6.0., Nucleic Acids Research, 27 (1) , pp. 257--259, 1999.
  2. Kalmár, Zs. and Szepesvári, Cs., An Evaluation Criterion for Macro-Learning and Some Results, Mindmaker Ltd., (TR-99-01) , Budapest 1121, Konkoly Th. M. u. 29-33, HUNGARY 1999. PDF (abstract) It is known that a well-chosen set of macros makes it possible to considerably speed-up the solution of planning problems. Recently, macros have been considered in the planning framework, built on Markovian decision problem. However, so far no systematic approach was put forth to investigate the utility of macros within this framework. In this article we begin to systematically study this problem by introducing the concept of multi-task MDPs defined with a distribution over the tasks. We propose an evaluation criterion for macro-sets that is based on the expected planning speed-up due to the usage of a macro-set, where the expectation is taken over the set of tasks. The consistency of the empirical speed-up maximization algorithm is shown in the finite case. For acyclic systems, the expected planning speed-up is shown to be proportional to the amount of ``time-compression'' due to the macros. Based on these observations a heuristic algorithm for learning of macros is proposed. The algorithm is shown to return macros identical with those that one would like to design by hand in the case of a particular navigation like multi-task MDP. Some related questions, in particular the problem of breaking up MDPs into multiple tasks, factorizing MDPs and learning generalizations over actions to enhance the amount of transfer are also considered in brief at the end of the paper.

1998 (8)

  1. Szepesvári, Cs., Static and Dynamic Aspects of Optimal Sequential Decision Making, Szeged, Aradi vrt. tere 1, HUNGARY, 6720 1998. PDF (abstract) Foreword: In this thesis the theory of optimal sequential decisions having a general recursive structure is investigated via an operator theoretical approach, the recursive structure (of both of the dynamics and the optimality criterion) being encoded into the so-called cost propagation operator. Decision problems like Markovian Decision Problems with expected or worst-case total discounted/undiscounted cost criterion; repeated zero-sum games such as Markov-games; or alternating Markov-games all admit such a recursive structure. Our setup has the advantage that it emphasizes their common properties as well as it points out some differences. The thesis consists of two parts, in the first part the model is assumed to be known while in the second one the models are to be explored. The setup of Part I is rather abstract but enables a unified treatment of a large class of sequential decision problems, namely the class when the total cost of decision policies is defined recursively by a so called cost propagation operator. Under natural monotonicity and continuity conditions the greedy policies w.r.t. the optimal cost-to-go function turn out to be optimal, due to the recursive structure. Part II considers the case when the models are unknown, and have to be explored and learnt. The price of considering unknown models is that here we have to restrict ourselves to models with an additive cost structure in order to obtain tractable learning situations. The almost sure convergence of the most frequently used algorithms proposed in the reinforcement learning community is proved. These algorithms are treated as multidimensional asynchronous stochastic approximation schemes and their convergence is deduced from the main theorem of the second part. The key of the method here is the so called rescaling property of certain homogeneous processes. A practical and verifiable sufficient condition for the convergence of on-line learning policies to an optimal policy is formulated and a convergence rate is established.
  2. Kalmár, Zs., Szepesvári, Cs., and Lörincz, A., Module-Based Reinforcement Learning: Experiments with a Real Robot, Machine Learning, 31 , pp. 1-- 2, 1998. PDF (abstract) The behavior of reinforcement learning (RL) algorithms is best understood in completely observable, discrete-time controlled Markov chains with finite state and action spaces. In contrast, robot-learning domains are inherently continuous both in time and space, and moreover are partially observable. Here we suggest a systematic approach to solve such problems in which the available qualitative and quantitative knowledge is used to reduce the complexity of learning task. The steps of the design process are to: i) decompose the task into subtasks using the qualitative knowledge at hand; ii) design local controllers to solve the subtasks using the available quantitative knowledge and iii) learn a coordination of these controllers by means of reinforcement learning. It is argued that the approach enables fast, semi-automatic, but still high quality robot-control as no fine-tuning of the local controllers is needed. The approach was verified on a non-trivial real-life robot task. Several RL algorithms were compared by ANOVA and it was found that the model-based approach worked significantly better than the model-free approach. The learnt switching strategy performed comparably to a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which were not foreseen in advance, thus supporting the view that adaptive algorithms are advantageous to non-adaptive ones in complex environments.

1997 (9)

  1. Szepesvári, Cs. and Lörincz, A., High Precision Neurocontrol of a Chaotic Bioreactor, Nonlinear Analysis, 30 (3) , pp. 1669--1676, 1997.
  2. Szepesvári, Cs. and Lörincz, A., Approximate Inverse-Dynamics based Robust Control using Static and Dynamic State Feedback, Applications of Neural Adaptive Control Technology, Kalkkuhl, J. and Hunt, K.J. and Zbikowski, R. and Dzieli{\'n}sky, A. (Eds.), World Scientific, Singapore, pp. 151--197, 1997. PDF (abstract) It is rigorously shown that inverse-dynamics models can be used to stabilize plants of any order provided that the inverse-dynamic model is used in a mixed mode fashion, in that of a `Static and Dynamic' State-feedback (SDS) mode. When the resulting controller is used for tracking increasing the gain of the dynamic feedback decreases the tracking error. Yet another attractive feature of the SDS scheme is that the inverse-dynamics model can be tuned on-line by {\em any} adaptation mechanism without cancelling stability if the conditions of the non-adaptive stability theorem hold at any time instant. Computer simulations of the control of a chaotic bioreactor and a `realistic' robotic manipulator demonstrate the robustness of the approach. It is shown that SDS control will yield zero asymptotic error when controlling the bioreactor using an inverse-dynamics model which when used in a traditional mode would yield intolerably large errors. In the case of the robotic arm simulations the effects of perturbation and sampling frequency are investigated and the SDS control is compared with the non-adaptive computed torque method. A fully self-organizing associative neural network architecture that can be used to approximate the inverse-dynamics in the form of a Position-and-Direction-to-Action (PDA) map is also described. Similarities between the basal ganglia - thalamocortical loops and the SDS scheme are discussed and it is argued that the SDS scheme could be viewed as a model of higher order motor functions of these areas.
  3. Kalmár, Zs., Szepesvári, Cs., and Lörincz, A., Module Based Reinforcement Learning for a Real Robot, Proceedings of the 6th European Workshop on Learning Robots, pp. 22--32, 1997. PDF (abstract) The behaviour of reinforcement learning (RL) algorithms is best understood in completely observable, finite state- and action-space, discrete-time controlled Markov-chains. Robot-learning domains, on the other hand, are inherently infinite both in time and space, and moreover they are only partially observable. In this article we suggest a systematic method whose motivation comes from the desire to transform the task-to-be-solved into a finite-state, discrete-time, ``approximately'' Markovian task, which is completely observable too. The key idea is to break up the problem into subtasks and design controllers for each of the subtasks. Then operating conditions are attached to the controllers (together the controllers and their operating conditions which are called modules) and possible additional features are designed to facilitate observability. A new discrete time-counter is introduced at the ``module-level'' that clicks only when a change in the value of one of the features is observed. The approach was tried out on a real-life robot. Several RL algorithms were compared and it was found that a model-based approach worked best. The learnt switching strategy performed equally well as a handcrafted version. Moreover, the learnt strategy seemed to exploit certain properties of the environment which could not have been seen in advance, which predicted the promising possibility that a learnt controller might outperform a handcrafted switching strategy in the future.

1996 (9)

  1. Ribeiro, C. H. C. and Szepesvári, Cs., Q-learning Combined with Spreading: Convergence and Results, Proceedings of ISRF-IEE International Conference: Intelligent and Cognitive Systems, Neural Networks Symposium, Tehran, Iran pp. 32--36, 1996.
  2. Littman, M.L. and Szepesvári, Cs., A Generalized Reinforcement Learning Model: Convergence and Applications, ICML, pp. 310--318, 1996. PDF
  3. Szepesvári, Cs. and Lörincz, A., Approximate Geometry Representations and Sensory Fusion, Neurocomputing, 12 , pp. 267--287, 1996. PDF (abstract) This paper summarizes the recent advances in the theory of self-organizing development of approximate geometry representations based on the use of neural networks. Part of this work is based on the theoretical approach of (Szepesvari, 1993), which is different from that of (Martinetz, 1993) and also is somewhat more general. The Martinetz approach treats signals provided by artificial neuron-like entities whereas the present work uses the entities of the external world as its starting point. The relationship between the present work and the Martinetz approach will be detailed. We approach the problem of approximate geometry representations by first examining the problem of sensory fusion, i.e., the problem of fusing information from different transductors. A straightforward solution is the simultaneous discretization of the output of all transductors, which means the discretization of a space defined as the product of the individual transductor output spaces. However, the geometry relations are defined for the external world only, so it is still an open question how to define the metrics on the product of output spaces. It will be shown that simple Hebbian learning can result in the formation of a correct geometry representation. Some topological considerations will be presented to help us clarify the underlying concepts and assumptions. The mathematical framework gives rise to a corollary on the ``topographical mappings'' realized by Kohonen networks. In fact, the present work as well as (Martinetz, 1993) may be considered as a generalization of Kohonen's topographic maps. We develop topographic maps with self-organizing interneural connections.
  4. Fomin, T., Rozgonyi, T., Szepesvári, Cs., and Lörincz, A., Self-organizing Multi-resolution Grid for Motion Planning and Control, International Journal of Neural Systems, 7 , pp. 757--776, 1996. PDF (abstract) A fully self-organizing neural network approach to low-dimensional control problems is described. We consider the problem of learning to control an object and solving the path planning problem at the same time. Control is based on the path planning model that follows the gradient of the stationary solution of a diffusion process working in the state space. Previous works are extended by introducing a self-organizing multigrid-like discretizing structure to represent the external world. Diffusion is simulated within a recurrent neural network built on this multigrid system. The novelty of the approach is that the diffusion on the multigrid is fast. Moreover, the diffusion process on the multigrid fits well the requirements of the path planning: it accelerates the diffusion in large free space regions while still keeps the resolution in small bottleneck-like labyrinths along the path. Control is achieved in the usual way: associative learning identifies the inverse dynamics of the system in a direct fashion. To this end there are introduced interneurons between neighbouring discretizing units that detect the strength of the steady-state diffusion and forward control commands to the control neurons via modifiable connections. This architecture forms the Multigrid Position-and-Direction-to-Action (MPDA) map. The architecture integrates reactive path planning and continuous motion control. It is also shown that the scheme leads to population coding for the actual command vector.

1995 (2)

1994 (7)

  1. Oláh, B. and Szepesvári, Cs., Complexity of Learning: the Case of Everyday Neural Networks, Proceedings of IEEE WCCI ICNN'94, Orlando, Florida pp. 61--65, 1994. PDF (abstract) Nowadays artificial neural networks (ANNs) receive an increasing attention. However recent computer architectures do not allow yet the implementation of large ANNs. Thus it is an important question to examine how the learning time of ANNs scales respect to their size (and/or with the size of the tasks). Judd has introduced a computational framework for the learning problem (J. Judd, ``Neural Network Design and the Complexity of Learning.'' A Bradford Book, MIT Press, Cambridge, 1990) and proved, that learning in neural networks in general is too hard, i.e. in the worst case learning in neural networks is NP-complete. However, in his proof he restricts the domain of neural network architectures and tasks in such a way, that ``everyday'' neural network architectures, such as the one of the back-propagation algorithm, are excluded. Consequently Judd's proof does not tell anything for these types of networks. First we outline a thorough framework for loading. The framework enables to differentiate between loading problems at a finer level. Two theorems are presented about the complexity of learning for ``everyday'' ANN architectures. The first theorem says, that for extended binary tasks and in the worst-case, the loading problem is NP-complete, while the second says, that for binary tasks and basis LUF there exists a polynomial time algorithm. From these results it seems that the loading problem for ``everyday'' neural network is interesting from the mathematical point of view as it lies on the boundary of efficiently solvable problems.

1993 (4)

  1. Szepesvári, Cs., The Role of Local Connections in Competitive Neural Networks: Collective Learning and Representation of Geometry, Szeged, Hungary 1993.
  2. Szepesvári, Cs. and Lörincz, A., Integration of ANN Cues, Dynamic AI Concepts and ANN Decision System into an Adaptive Self-Organizing Agent, 3rd Conf. on Artificial Intelligence, Koch, P. (Eds.), John von Neumann Society for Computer Sciences, Budapest, Hungary pp. 231--237, 1993.
  3. Szepesvári, Cs. and Lörincz, A., Integration of Artificial Neural Networks and Dynamic Concepts to an adaptive and self-organizing agent, Proc. of WCNN'93, Lawrence Erlbaum Associates, Inc. Publishers, New Jersey, Portland, Oregon, USA pp. 524--527, 1993.
  4. Szepesvári, Cs. and Lörincz, A., Integration of Artificial Neural Networks and Dynamic Concepts to an adaptive and self-organizing agent, Proc. of ICANN'93, Gielen, S. and Kappen, B. (Eds.), Springer-Verlag, London, Amsterdam, The Netherlands 1993.
Copyright notice: Papers are presented and may be downloaded to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each copyright holder. In most cases, these works may not be reposted without explicit permission of the copyright holder.