Reinforcement  Learning

 

°­È­ÇнÀÀ̶õ ÀßÇÑ Çൿ¿¡ ´ëÇØ ÄªÂù ¹Þ°í À߸øÇÑ Çൿ¿¡ ´ëÇØ ¹úÀ» ¹ÞÀº °æÇèÀ» ÅëÇØ ÀÚ½ÅÀÇ Áö½ÄÀ» Å°¿ö³ª°¡´Â ÇнÀ¹ýÀÌ´Ù. ·Îº¿ (Robot) ÀÌ ¿©·¯ ¹øÀÇ ½ÇÆÐ¿Í ¼º°ø°æÇèÀ» ½×À¸¸ç ÁÖ¾îÁø ÀÛ¾÷À» Àß ¼öÇàÇÒ ¼ö ÀÖµµ·Ï ÇÏ´Â °ÍÀÌ´Ù. ·Îº¿Àº ¾î¶² »óÅ¿¡¼­ °¡´ÉÇÑ Çൿµé ÁßÀÇ Çϳª¸¦ ¼±ÅÃ, ÀÌ Çൿ °á°ú¿¡ µû¸¥ Æ÷»ó (reward) À» ¹Þ°í ³ª¼­ ´ÙÀ½ »óŸ¦ ¾Ë°Ô µÈ´Ù.......  (±èÁ¾È¯ 2001)

°­È­ÇнÀÀº ÀڽŰú ȯ°æ°úÀÇ »óÈ£°ü°è¿Í ÀÌ¿¡ µû¸¥ °­È­½ÅÈ£¸¦ ÅëÇÏ¿© ÀÚ½ÅÀÇ ÇൿÀ» °³¼±ÇØ ³ª°¡´Â ¹æ¹ýÀ¸·Î¼­ ȯ°æ¿¡ ´ëÇÑ Á¤È®ÇÑ »çÀü Áö½ÄÀÌ ¾øÀÌ ÇнÀ ¹× ÀûÀÀ¼ºÀ» º¸Àå Çϱ⠶§¹®¿¡ ·Îº¿ÀÇ ÇнÀ¿¡ À¯¿ëÇÏ´Ù. ±×·¯³ª °­È­ÇнÀ¹ýÀÇ °¡Àå Å« ¹®Á¦´Â ÃëÇÑ Çൿ¿¡ ´ëÇÑ º¸»óÀÌ Áï°¢ÀûÀ¸·Î °è»êµÇÁö ¾ÊÀ» °æ¿ì ÇнÀÀÌ ¾î·Á¿î Á¡ÀÌ´Ù. .... (½É±Íº¸)

°­È­ÇнÀÀº ¼öÄ¡·Î Ç¥ÇöµÇ´Â º¸»ó (reward) ½ÅÈ£¸¦ ÃÖ´ë·Î ÇϱâÀ§Çؼ­ ¹«¾ùÀ» ÇؾßÇÒÁö (¾î¶»°Ô »óȲ°ú ÇൿÀ» ¸ÅÇÎ ÇؾßÇÒÁö) ¸¦ ÇнÀÇÏ´Â °ÍÀÌ´Ù. ÇнÀÀÚ´Â ´ëºÎºÐÀÇ ±â°èÇнÀ (Machine Learning) ¿¡¼­ ó·³ ¾î¶² ÇൿÀ» ÃëÇØ¾ß ÇÒÁö¸¦ Á÷Á¢ ¹è¿ì´Â °ÍÀÌ ¾Æ´Ï¶ó ¾î¶² ÇൿÀ» ÇØ¾ß °¡Àå ÁÁÀº º¸»óÀÌ ÁÖ¾îÁö´Â Áö¸¦ ¹ß°ßÇØ¾ß ÇÏ´Â °ÍÀÌ´Ù. ...... (AI Topics : reinforcement learning)

paper

±â°èÇнÀÀÇ ¹®Á¦¿¡¼­´Â ¾î¶² ȯ°æ¿¡ ³õ¿©ÀÖ´Â ¿¡ÀÌÀüÆ® (¶Ç´Â ·Îº¿) ¸¦ °¡Á¤ÇÏ°í, ±× ¿¡ÀÌÀüÆ®°¡ ÀÚ½ÅÀÇ ÇöÀç»óŸ¦ Áö°¢ÇÏ°í ÇൿÀ» ÇÑ´Ù. ¸¶Âù°¡Áö·Î ȯ°æÀº º¸»ó (reward, ±àÁ¤ÀûÀÌµç ºÎÁ¤ÀûÀ̵ç) À» ÇÑ´Ù. °­È­ÇнÀ ¾Ë°í¸®ÁòÀº ¹®Á¦ÀÇ ÇØ°á°úÁ¤¿¡¼­ ¿¡ÀÌÀüÆ®¿¡ ´ëÇÑ ´©ÀûµÈ º¸»óÀ» ÃÖ´ë·Î ¸¸µå´Â policy ¸¦ ãÀ¸·Á´Â °ÍÀÌ´Ù.

ȯ°æÀº ÀϹÝÀûÀ¸·Î finite-state Markov decision process (MDP) ·Î¼­ Çü½ÄÈ­ÇÏ°í, ÀÌ·¯ÇÑ ¸Æ¶ôÀ» À§ÇÑ °­È­ÇнÀ ¾Ë°í¸®ÁòÀº dynamic programming ±â¼ú°ú Á÷Á¢ÀûÀ¸·Î °ü·ÃµÈ´Ù. MDP ¿¡¼­ÀÇ State transition probabilities ¿Í reward probabilities Àº ¹®Á¦¸¦ Ǫ´Â °úÁ¤¿¡¼­ ÀϹÝÀûÀ¸·Î stochastic ÀÌÁö¸¸ stationary ÇÏ´Ù.

°­È­ÇнÀÀº, Á¤È®ÇÑ ÀÔ·Â/Ãâ·Â ¦ (correct input/output pairs) µµ Á¸ÀçÇÏÁö ¾ÊÀ¸¸ç ¸í½ÃÀûÀ¸·Î Á¤È®ÇÑ Â÷¼±ÀÇ (sub-optimal) Çൿµµ Á¸ÀçÇÏÁö ¾Ê´Â ÁöµµÇнÀ (Supervised Learning) ¹®Á¦¿Í´Â ´Ù¸£´Ù. ´õ±¸³ª ¹ÌÁöÀÇ ¿µ¿ªÀÇ Å½Çè (exploration of uncharted territory) °ú ÇöÀçÀÇ Áö½ÄÀ» °³Ã´ (exploitation of current knowledge) ÇÏ´Â °Í °£ÀÇ ±ÕÇüÀ» ã´Â °ÍÀ» Æ÷ÇÔÇÏ´Â ¿Â¶óÀÎ ¼öÇà (on-line performance) ¿¡ ÃÊÁ¡ÀÌ ÀÖ´Ù.

Çü½ÄÀûÀ¸·Î ±âº»ÀûÀÎ °­È­ÇнÀ ¸ðµ¨Àº ´ÙÀ½°ú °°ÀÌ ±¸¼ºµÈ´Ù :

  1. ÀÏ·ÃÀÇ È¯°æ»óÅ S ;
  2. ÀÏ·ÃÀÇ Çൿ A ;
  3. ÀÏ·ÃÀÇ scalar "º¸»ó" (in ?)

°¢°¢ÀÇ ½Ã°£ t ¿¡, ¿¡ÀÌÀüÆ®´Â ±×ÀÇ »óÅ st¡ôS ¿Í °¡´ÉÇÑ Çൿ A(st) µéÀ» Áö°¢ÇÑ´Ù. ¿¡ÀÌÀüÆ®´Â ÇϳªÀÇ Çൿ a¡ôA(st) À» ¼±ÅÃÇÏ°í, ȯ°æÀ¸·ÎºÎÅÍ »õ·Î¿î »óÅ st+1 ¿Í ÇϳªÀÇ º¸»ó rt+1 À» ¹Þ´Â´Ù. ÀÌ·¯ÇÑ »óÈ£ÀÛ¿ë (¿¡ÀÌÀüÆ®¿Í ȯ°æ »çÀÌ¿¡) ¿¡ ±âÃÊÇؼ­, °­È­ÇнÀ ¿¡ÀÌÀüÆ®´Â, ÇϳªÀÇ ÃÖÁ¾ »óŸ¦ °¡Áö´Â MDP µéÀ» À§ÇÑ º¸»óÀÇ ´©Àû·® r0+r1+...+rn ¶Ç´Â  ÃÖÁ¾»óÅ°¡ ¾Æ´Ñ MDP µéÀ» À§ÇÑ º¸»óÀÇ ´©Àû·® ¥Òt¥ãtrt À» ÃÖ´ëÈ­ ÇÏ´Â ÇϳªÀÇ policy ¥ð:S¡æA ¸¦ °³¹ßÇØ¾ß ÇÑ´Ù (¿©±â¼­ ¥ã Àº 0.0 °ú 1.0 »çÀÌÀÇ some "future reward" discounting factor ÀÌ´Ù).

°­È­ÇнÀÀº ƯÈ÷ long-term reward °¡ short-term reward ÀÇ ºñ¿ëÀ¸·Î ¼öÇàµÉ¼ö ÀÖ´Â (long-term reward can be had at the expense of short-term reward) ¹®Á¦¿¡¼­ Àß ÀÀ¿ëµÈ´Ù. ÀÌ·¯ÇÑ Á¾·ùÀÇ ¹®Á¦µéÀº temporal difference learning À̶ó°í ¾Ë·ÁÁø °­È­ÇнÀ ±â¼úÀ» »ç¿ëÇÏ¿© Á¤»óÀûÀ¸·Î ´Ù·ç¾îÁø´Ù. °­È­ÇнÀÀº ´Ù¾çÇÑ ¹®Á¦, ¿¹¸¦µé¸é robot control, elevator scheduling, backgammon °ú °°Àº ¹®Á¦¿¡ ¼º°øÀûÀ¸·Î ÀÀ¿ëµÇ¾î¿Ô´Ù.

°­È­ÇнÀÀº ¾î¶² »óÅÂÀÇ ¹Ù¶÷Á÷ÇÑ Á¤µµ¸¦ ³ªÅ¸³»´Â optimal value function (V)¸¦ ÃßÁ¤ (estimate) ÇÑ´Ù. ÀÌ·¯ÇÑ ÃßÁ¤Àº recursive Bellman equations ¿¡ ±âÃÊÇÑ °ÍÀÌ´Ù. ..... (Wikipedia : Reinforcement learning)

term :

°­È­ ÇнÀ (Reinforcement learning)    °­È­ (Reinforcement)   AlphaGo    ·Îº¿ (Robot)    ±â°èÇнÀ (Machine Learning)    µö ·¯´× (Deep Learning)

site :

Wikipedia : Reinforcement learning    À§Å°¹é°ú : °­È­ÇнÀ

°­È­ÇнÀÀ» ÀÌ¿ëÇÑ ÀÚÀ²À̵¿·Îº¿ÀÇ Çൿ°èȹ  : Áß¾Ó´ë  ½É±Íº¸

[·Îº¿À̾߱â] °­È­ÇнÀ¡¦ ĪÂùÇÏ¸é ¶È¶ÈÇØÁöÁÒ : Áß¾ÓÀϺ¸ 2001. 10. 26  ±èÁ¾È¯

video :

°­È­ÇнÀÀ» ÀÌ¿ëÇÑ NPC AI ±¸Çö - ÀÌ°æÁ¾ : IBS Àκ¥¹æ¼Û±¹ : 2016/10/31

 

¾Ë¾ÆµÎ¸é ¾µµ¥ÀÖ´Â ½Å±âÇÑ °­È­ÇнÀ - ±èÅÂÈÆ : naver d2 : 2017/09/11

 

Introduction of Deep Reinforcement Learning - °ûµ¿Çö : naver d2 : 2017/09/04

 

Reinforcement Learning : An Introduction - Richard Sutton : ±³°ú¼­ °­µ¶ 1 : J Hong : 2017/07/31

 

Reinforcement Learning : An Introduction - Richard Sutton : ±³°ú¼­ °­µ¶ 2 : J Hong : 2017/07/31