多臂賭博機

在概率論和機器學習中，多臂賭博機問題（英語：multi-armed bandit problem）^[1]有時稱為K-或N-臂賭博機問題（英語：K-or N-armed bandit problem）^[2]，是一個必須在競爭（替代）之間分配一組固定的有限資源的問題。當每個選擇的屬性在分配時僅部分已知時，以最大化其預期收益的方式進行選擇，並且隨著時間的推移或通過向該選擇分配資源可能會更好地被理解。這是一個經典的強化學習問題，體現了探索-利用權衡困境^[3]^[4]。這個名字來源於想像一個賭徒坐在一排賭博機（或稱角子機、老虎機）前（有時被稱為「單臂賭博機」），他必須決定玩哪台機器，每台機器玩多少次以及玩的順序^[5]，並且是否繼續使用當前機器或嘗試不同的機器。多臂賭博機問題也屬於隨機調度的廣義範疇。

在該問題中，每台機器根據該機器特定的概率分布提供隨機獎勵，該獎勵是先驗未知的。賭徒的目標是最大化通過一系列槓桿拉動所獲得的獎勵總和^[4]。賭徒在每次試驗中面臨的關鍵權衡是在「利用」具有最高預期收益的機器和「探索」以獲得有關其他機器的預期收益的更多信息之間^[3]。機器學習也面臨著探索和利用之間的權衡。在實踐中，多臂賭博機已用於對諸如管理大型組織（如科學基金會或製藥公司）中的研究項目等問題進行建模^[3]^[4]。在問題的早期版本中，賭徒一開始對機器一無所知。

赫伯特·羅賓斯於1952年認識到該問題的重要性，在「實驗序貫設計的某些方面」中構建了收斂種群選擇策略^[6]。約翰·C·吉廷斯首次發表的吉廷斯指數定理給出了最大化預期折扣獎勵的最優策略^[7]。

參考資料

^ Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .
^ Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.
^ ^3.0 ^3.1 ^3.2 引用錯誤：沒有為名為Gittins89的參考文獻提供內容
^ ^4.0 ^4.1 ^4.2 引用錯誤：沒有為名為BF的參考文獻提供內容
^ Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 
^ Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .
^ J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

延伸閱讀

Scholia上有關多臂賭博機的資訊

Guha, S.; Munagala, K.; Shi, P., Approximation algorithms for restless bandit problems, Journal of the ACM, 2010, 58: 1–50, S2CID 1654066, arXiv:0711.3861 , doi:10.1145/1870103.1870106
Dayanik, S.; Powell, W.; Yamazaki, K., Index policies for discounted bandit problems with availability constraints, Advances in Applied Probability, 2008, 40 (2): 377–400, doi:10.1239/aap/1214950209  .
Powell, Warren B., Chapter 10, Approximate Dynamic Programming: Solving the Curses of Dimensionality, New York: John Wiley and Sons, 2007, ISBN 978-0-470-17155-4 .
Robbins, H., Some aspects of the sequential design of experiments, Bulletin of the American Mathematical Society, 1952, 58 (5): 527–535, doi:10.1090/S0002-9904-1952-09620-8  .
Sutton, Richard; Barto, Andrew, Reinforcement Learning, MIT Press, 1998, ISBN 978-0-262-19398-6, （原始內容存檔於2013-12-11） .

Allesiardo, Robin, A Neural Networks Committee for the Contextual Bandit Problem, Neural Information Processing – 21st International Conference, ICONIP 2014, Malaisia, November 03-06,2014, Proceedings, Lecture Notes in Computer Science 8834, Springer: 374–381, 2014, ISBN 978-3-319-12636-4, S2CID 14155718, arXiv:1409.8191 , doi:10.1007/978-3-319-12637-1_47 .

Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588  .
Katehakis, M.; C. Derman, Computing optimal sequential allocation rules in clinical trials, Adaptive statistical procedures and related topics, Institute of Mathematical Statistics Lecture Notes - Monograph Series 8: 29–39, 1986, ISBN 978-0-940600-09-6, JSTOR 4355518, doi:10.1214/lnms/1215540286 .
Katehakis, M.; A. F. Veinott, Jr., The multi-armed bandit problem: decomposition and computation, Mathematics of Operations Research, 1987, 12 (2): 262–268, JSTOR 3689689, S2CID 656323, doi:10.1287/moor.12.2.262.

外部連結

MABWiser, open source Python implementation of bandit strategies that supports context-free, parametric and non-parametric contextual policies with built-in parallelization and simulation capability.
PyMaBandits （頁面存檔備份，存於網際網路檔案館）, open source implementation of bandit strategies in Python and Matlab.
Contextual （頁面存檔備份，存於網際網路檔案館）, open source R package facilitating the simulation and evaluation of both context-free and contextual Multi-Armed Bandit policies.
bandit.sourceforge.net Bandit project （頁面存檔備份，存於網際網路檔案館）, open source implementation of bandit strategies.
Banditlib （頁面存檔備份，存於網際網路檔案館）, Open-Source implementation of bandit strategies in C++.
Leslie Pack Kaelbling and Michael L. Littman (1996). Exploitation versus Exploration: The Single-State Case.
Tutorial: Introduction to Bandits: Algorithms and Theory. Part1 （頁面存檔備份，存於網際網路檔案館）. Part2 （頁面存檔備份，存於網際網路檔案館）.
Feynman's restaurant problem, a classic example (with known answer) of the exploitation vs. exploration tradeoff.
Bandit algorithms vs. A-B testing （頁面存檔備份，存於網際網路檔案館）.
S. Bubeck and N. Cesa-Bianchi A Survey on Bandits （頁面存檔備份，存於網際網路檔案館）.
A Survey on Contextual Multi-armed Bandits （頁面存檔備份，存於網際網路檔案館）, a survey/tutorial for Contextual Bandits.
Blog post on multi-armed bandit strategies, with Python code （頁面存檔備份，存於網際網路檔案館）.
Animated, interactive plots （頁面存檔備份，存於網際網路檔案館） illustrating Epsilon-greedy, Thompson sampling, and Upper Confidence Bound exploration/exploitation balancing strategies.

[doi10.1023/A:1013689704352-1] Auer, P.; Cesa-Bianchi, N.; Fischer, P. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning. 2002, 47 (2/3): 235–256. doi:10.1023/A:1013689704352 .

[2] Katehakis, M. N.; Veinott, A. F. The Multi-Armed Bandit Problem: Decomposition and Computation. Mathematics of Operations Research. 1987, 12 (2): 262–268. S2CID 656323. doi:10.1287/moor.12.2.262.

[Gittins89-3] 3.0 ^3.1 ^3.2 引用錯誤：沒有為名為Gittins89的參考文獻提供內容

[BF-4] 4.0 ^4.1 ^4.2 引用錯誤：沒有為名為BF的參考文獻提供內容

[weber-5] Weber, Richard, On the Gittins index for multiarmed bandits, Annals of Applied Probability, 1992, 2 (4): 1024–1033, JSTOR 2959678, doi:10.1214/aoap/1177005588 

[6] Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society. 1952, 58 (5): 527–535. doi:10.1090/S0002-9904-1952-09620-8 .

[Gittins1979-7] J. C. Gittins. Bandit Processes and Dynamic Allocation Indices. Journal of the Royal Statistical Society. Series B (Methodological). 1979, 41 (2): 148–177. JSTOR 2985029. S2CID 17724147. doi:10.1111/j.2517-6161.1979.tb01068.x.

[1]

[2]

[3]

[4]

[5]

[6]

[7]