Tutorial summary : Reinforcement Learning

Imam Firdaus ยท ยท 9 minute read

Tutorial LSI Design Contest Okinawa ๐Ÿ”—

http://www.lsi-contest.com/2021/shiyou_3-1e.html

Reinforcemet Learning ๐Ÿ”—

One of the fields of machine learning. Through trial and error, the agent (the body that performs an action) learns the behavior to be given the maximum reward.

Sequence of reinforcement learning ๐Ÿ”—

In reinforcement learning, learning is carried out by utilizing four elements of “Environment”, “Agent”, “Action”, and “Rewards”. Environment: Environments with agents Agent: Acting entities Action: The agent’s behavior Rewards: A reward for an action The sequence of reinforcement learning is shown in Fig. 1 below.

  1. For each time step t, the agent observes the current state from the environment.
  2. The agent chooses the (have the highest Q value) action with the highest value among the actions that the agent can take in the present state.
  3. The agent receives a reward for the good or bad results of the action.
  4. The Q value (an indicator of the value of an action) in the state st, action at is updated.

Maze Exploration using reinforcement learning ๐Ÿ”—

Agents: People State: Where in S1 ~ S25 the agent is

Action: Move in the direction ofใ€Œโ†’ใ€๏ผŒใ€Œโ†‘ใ€๏ผŒใ€Œโ†ใ€๏ผŒใ€Œโ†“ใ€ Rewards: S5, S7, S8, S 14, S 17, S 19, S 20, S 22: Negative reward (demon) S 25: Positive reward (money) Else: No reward

Q-table ๐Ÿ”—

It can be rephrased that the purpose of reinforcement learning is to update the Q value, which is an index of the value of an action, and to complete a table of Q values that will ultimately maximize the reward that an agent can get. Table 1 shows the Q value table of the unlearned state in the example. *The initial value is generated randomly, and S 25 is set to 0 because it becomes a goal, and there is no need to act further.

Table 1 quantifies the value of each action in each state. For example: In S1, the values of “โ†’”, “โ†‘”, “โ†”, and “โ†“” are 0.1, 0.3, 0.2, and 0.5, respectively. It turns out that the most valuable action is to go to the “down arrow.”. In this example, the Q โ€ value table is completed through reinforcement learning.

Q-learning ๐Ÿ”—

There are several methods for updating Q-value, and this example deals with Q-Learning (Following: Q Learning and Notation) which is one of them. The equation for Q value update in Q learning is shown below.

The flow of Q value renewal is examined using an actual example. The agent grasps that its current state is S1. From Table 1, the action to advance to “โ†“” with the highest Q value is selected. Since the transition destination becomes S6, a value of 0 is received as a reward. Q value is updated based on the reward.

Learning Results ๐Ÿ”—

Based on learning example, route taken by agent shown below

Matlab example ๐Ÿ”—

The project file is splitted in 6 matlab files and one program description docs. Those file are

  1. Q_Learning.m : The main file
  2. Action.m : Function that decide the agent’s behavior
  3. Routing.m : Function that display route based on Q_table
  4. Search_Location.m : Function that find the location of the agent
  5. update_Qvalue.m : Function that update Q_values
  6. print_Qtable.m : print Q table
  7. Description_of_the_program.docs : Explantion regarding each file and algorithm

These files are the implementation of Q learning function in matlab. The algorithm has been explained above.

Chalenge ๐Ÿ”—

For level 1, program is run on Vivado. The program input several values and the component that being simulated output max value from several values that it receives.

Commentary/Summary ๐Ÿ”—

Reinforcement learning learns certain behaviour through trial and error and maximizes the maximum rewards. Each best action in each state is implemented through Q tables. After each action, Q tables is updated based Q learning parameters and rewards. Q learning didn’t require the environtment is fully modeled.

Tutorial Singkat tentang Reinfocement Learning ๐Ÿ”—

https://www.youtube.com/watch?v=KHOKV4YsHSU

Salah satu algoritma machine learning yang berkaitan dengan bagaimana agent mengambil keputusan di environtment untuk memaksimalkan reward

Element pad RL ๐Ÿ”—

  • Policy
  • Reward
  • Value
  • Model Environtment (opt)

Q-Learning ๐Ÿ”—

  • Alogitma RL Model-free
  • Merupakan value dari action yang dilakukan dari state tertentu
  • Dapat direpresentasikan pada tabel NxZ dimana N adalah jumlah state dan Z adalah jumlah action

Algoritma Q-Learning ๐Ÿ”—

Inisialisasi parameter a,g,e
Inisialisasi tabel parameter
Loop n-episode:
    Inisialisasi state S
    Loop hingga terminal state:
        Pilih action A dari S dengan policy Q (e.g.epsilon-greedy)
        Lakukan Action A amati perubahan reward R dan state S
        Update Q(S,A) dengan persamaan
            Q(S,A) = Q(S,A) + a(R + g Max_a Q'(S',a) - Q(S,A))
        S <- S'
    Episode++

Metode pengambilan Action ๐Ÿ”—

  • Ambil action dengan value terbesar (greedy)
  • Ambil action secara acak dengan peluang e, sisanya greedy dengan peluang 1 - e (epsilon-greedy)

Demo : ๐Ÿ”—

https://mladdict.com/q-learning-simulator

Commentary ๐Ÿ”—

Pada tutorial ini dijelaskan elemen pada Q learning, penjelasan Q learning, Algoritma Q learning, dan metode pemiliham action pada Q learning. Terdapat juga demonstrasi Q learning yang ditunjukkan.

Tutorial Arsitektur Chip RL ๐Ÿ”—

https://www.youtube.com/watch?v=Kj4J-MdmCZo&list=PLkYjWBQDCTmD-4Vj12jzzDpi0i7pKHGHZ&index=17

System Model ๐Ÿ”—

  • Agent lives in environtment where it perform action
  • An interpreter observe the environtment and reward
  • Reward or reinforcement is quality figure for the action
  • May be positive and negative number

Q-Learning Algorithm ๐Ÿ”—

  • One of the most known and employed RL Method
  • Based on Qualiy matrix
  • Size is NxZ, N is num of state and Z is num of action
  • Row is considered as unique state
  • Action is choosen based on value on column
  • At the beginning Q matrix is initialized with random or zero values
  • Updated using value : $$Q_{new}(s_t,a_t) = (1-\alpha)Q(s_t,a_t)+\alpha(r_t+\gamma \max_a Q(s_{t+1},a))$$ where
    • $s_t$ and $s_{t+1}$ current state and next state
    • $a_t$ and $a_{t+1}$ current action and next action
    • $\gamma , \epsilon [0,1],$ discount factor : how much agent account long term reward
    • $\alpha , \epsilon [0,1],$ learning rate : how much agent account newest knowledge
    • $r_t$ current reward

Q-Learning Architecture ๐Ÿ”—

  • High level architecture

  • System : Policy Generator, Q-Learning accelerator

  • PG is application-dependent

  • Agent receive state $s_{t+1}$ and reward $r_{t+1}$ from the observer

  • Next action generated by PG according to the value in QLA

  • $s_t$, $a_t$, $r_t$ obtained by delaying $s_{t+1}$, $a_{t+1}$, and $r_{t+1}$ by registers

  • Q learning Accelerator architecture

  1. Block RAM implements dual port ram store entire column of Q matrix for each actions

2. Q Updater implements Qmatrix update equation

3. Max block to choose max Q value. Propagation delay here can be main limitation

Implementation Results, Experiment, and Performance Comparison ๐Ÿ”—

Optimization :

  • Reduce number of multiplication in $Q_{upt}$
  • Mux propagation delay
  • a, g, replaced with right shift

LSI Design Flow

  1. Mathematical Modelling
  2. Architecture design
    1. Processing element
    2. Memory unit
    3. Control unit
    4. System integration
  3. Hardware Modelling
  4. RTL Simulation
  5. FPGA Implementation

Commentary ๐Ÿ”—

Pada tutorial ini, terdapat penjelasan mengenai model sistem RL. Terdapat pula pseudocode dari Q learning. Terdapat pula arsitektur hardware yang digunakan untuk mempercepat proses Q learning. Dijelaskan pula metode-metode yang berguna untuk mengoptimasi hardware tersebut. Terakhir, dijelaskan desain flow dari LSI.

Pengantar Reinforcement Learning ๐Ÿ”—

https://www.youtube.com/watch?v=ClJgvgUS_xw

Reinforcemet Learning ๐Ÿ”—

How agent can become proficient in an unknown environment. Given only its percepts and occasional rewards.

Reward: feedback that can be used to help the agent to know that something good or bad happened

Difference with other ML paradigm ๐Ÿ”—

  1. No supervisor
  2. Feedback is delayed
  3. Time matters, sequence considered
  4. The agent’s action affect the data it sees

Variables ๐Ÿ”—

History

  • Sequence of action and rewards
  • Ht = O1, R1. A1, …, At-1 Ot, Rt
  • All obsevable variables up to time t
  • History influence 1. Agents’ action 2. Rewards

State

  • Is the information used to determine next event
  • Is function of history St=f(Ht)
  • Sa = State of Agent
  • Se = State of Environtment

Component ๐Ÿ”—

RL agent may include one or mote these components :

  • Policy : Agent behaviour function
    • Map from state to action
    • Can be deterministic or stochastic
  • Value function : Value had when agent in certain state or action
    • Is prediction of future reward
    • Can be used to select actions
  • Model : Agent’s representation of environtment
    • Predict what environtment will do next
    • P predict next state
    • R predict next immediate reward

Example ๐Ÿ”—

RL Agent Taxonomy ๐Ÿ”—

  • Policy vs Value Funtion
    • Value based
    • Policy based
    • Actor Critic
  • Model
    • Model free
    • Model based

Commentary ๐Ÿ”—

Pada tutorial ini, dijelaskan mengenai reinforcement learning dan reward. Terdapat perbedaan mengenai RL dan paradigman RL yang lain. Dijelaskan pula variabel yang terlibat dan komponen yang ada pada RL. Ada juga jenis-jenis RL yang dijelaskan.

Tutorial Vivado HLS ๐Ÿ”—

https://drive.google.com/drive/folders/1ZgQ7w-WVJF_Rx3SzOPpMEYJiYxsiosax

Tutorial ini membahas keseluruhan workflow design RTL dengan HLS yang mencakup:

  1. Coding HLS (video 1-4)

    1. Membuat project
      1. Menamai project
      2. Memilih board
    2. Mendownload library : https://github.com/definelicht/hlslib
    3. Include library
    4. Membuat dataconfig
    5. Memasukkan cflag
    6. Representasi vektor
    7. Membuat vector reader
      1. Menampilkan arsitektur
      2. Membuat kode
    8. Membuat scalar writter
    9. Menjelaskan arsitektur pengali, penjumlah, accumulator
    10. Membuat pengali
    11. Membuat penjumlah
    12. Membuat accumulator
    13. Membuat top level interface

    Selfnote :

    • Cflag wajib dumasukkan ke semua file cpp
    • Pastikan cflag ada di top tb juga
    • Flag hanya bagian belakang tanpa -cflag ""
  2. Simulasi fungsional dengan C testbench (video 5)

    1. Membuat tb
    2. Menjalankan simulasi
  3. Simulasi RTL dengan waveform (video 6)

    1. Memasukkan top level function
    2. Membuat directive pada interface module
    3. Membaut directive pada adder
    4. Membuat directive pada accumulator
    5. Membuat directive pada top level
    6. Melakukan C synthesis
    7. Melakukan Analysis
    8. Simulasi timing dengan C/RTL cosimulation

    Selfnote :

    • Jika gagal compile coba pindahkan ke drive C:
  4. Export IP HLS (video 6)

    1. Memilih menu
    2. Memilih option
    3. Sistesis ke vivado
  5. Vivado Synthesis (video 7)

    1. Buat Project Vivado baru
    2. Ip catalog -> Add repository
    3. Add block design
    4. Run block automation
    5. Setting PS-PL interface
    6. Add HLS
    7. Run block automation
    8. Setting PS-PL interface
    9. Validate design
    10. Add wrapper
    11. Generate block design
    12. Synthesis
    13. Implement
    14. Generate bitstream
    15. Export hardware design and bitsteam
    16. Copy hw handoff
    17. Rename three files to same name

    Selfnote

  6. Membuat code PS pada pynq (video 8)

    1. Download file board pynq-z1 : https://pynq.readthedocs.io/en/v2.3/overlay_design_methodology/board_settings.html
    2. Install said files to vivado and vitis
    3. Copy 3 said files to pynq board. Use ssh, smb, or sftp.
    4. Open jupyter notebook in http://192.168.2.99:9090
    5. Run command as shown in the tutorial

Setup pynq framework :

  1. Prepare USB cable
  2. Download file image pynq-z1 : https://github.com/Xilinx/PYNQ/releases
  3. Open pynq 2.6 documentation if needed : https://pynq.readthedocs.io/en/v2.3/getting_started/pynq_z1_setup.html
  4. Prepeare Network cable and switch or router
  5. Prepare SD card
  6. Add file board to Vitis and Vivado
  7. Flash image file to SD card
  8. Change jumper position on board
  9. Plug the USB prog to usb power'
  10. Plug the ethernet cable
  11. Flip the power switch

Selfnote :

  • Recompile semuanya kalau dari awal belum pakai board pynq, termasuk dari vivado dan vitis

Tutorial Perancangan VLSI ๐Ÿ”—

Tutorial ini merupakan kompilasi tutorial mengenai desain VLSI pada vivado:

  1. Modul 1

  2. Modul 2

  3. Modul 5

  4. Modul 6

  5. Tutorial BRAM

  6. Tutorial Xilinx Vivado

Penggunaan Interrupt ๐Ÿ”—

https://github.com/k0nze/zedboard_pl_to_ps_interrupt_example

Penggunaan AXIFull ๐Ÿ”—

https://github.com/k0nze/zedboard_axi4_master_burst_example

Penggunaan GPIO/MIO/EMIO ๐Ÿ”—