Tutorial summary : Reinforcement Learning

Imam Firdaus · Dec 31, 2021 · 9 minute read

MBKM RL SoC

Tutorial LSI Design Contest Okinawa 🔗

http://www.lsi-contest.com/2021/shiyou_3-1e.html

Reinforcemet Learning 🔗

One of the fields of machine learning. Through trial and error, the agent (the body that performs an action) learns the behavior to be given the maximum reward.

Sequence of reinforcement learning 🔗

In reinforcement learning, learning is carried out by utilizing four elements of “Environment”, “Agent”, “Action”, and “Rewards”. Environment: Environments with agents Agent: Acting entities Action: The agent’s behavior Rewards: A reward for an action The sequence of reinforcement learning is shown in Fig. 1 below.

For each time step t, the agent observes the current state from the environment.
The agent chooses the (have the highest Q value) action with the highest value among the actions that the agent can take in the present state.
The agent receives a reward for the good or bad results of the action.
The Q value (an indicator of the value of an action) in the state st, action at is updated.

Maze Exploration using reinforcement learning 🔗

Agents: People State: Where in S1 ~ S25 the agent is

Action: Move in the direction of「→」，「↑」，「←」，「↓」 Rewards: S5, S7, S8, S 14, S 17, S 19, S 20, S 22: Negative reward (demon) S 25: Positive reward (money) Else: No reward

Q-table 🔗

It can be rephrased that the purpose of reinforcement learning is to update the Q value, which is an index of the value of an action, and to complete a table of Q values that will ultimately maximize the reward that an agent can get. Table 1 shows the Q value table of the unlearned state in the example. *The initial value is generated randomly, and S 25 is set to 0 because it becomes a goal, and there is no need to act further.

Table 1 quantifies the value of each action in each state. For example: In S1, the values of “→”, “↑”, “←”, and “↓” are 0.1, 0.3, 0.2, and 0.5, respectively. It turns out that the most valuable action is to go to the “down arrow.”. In this example, the Q ‐ value table is completed through reinforcement learning.

Q-learning 🔗

There are several methods for updating Q-value, and this example deals with Q-Learning (Following: Q Learning and Notation) which is one of them. The equation for Q value update in Q learning is shown below.

The flow of Q value renewal is examined using an actual example. The agent grasps that its current state is S1. From Table 1, the action to advance to “↓” with the highest Q value is selected. Since the transition destination becomes S6, a value of 0 is received as a reward. Q value is updated based on the reward.

Learning Results 🔗

Based on learning example, route taken by agent shown below

Matlab example 🔗

The project file is splitted in 6 matlab files and one program description docs. Those file are

Q_Learning.m : The main file
Action.m : Function that decide the agent’s behavior
Routing.m : Function that display route based on Q_table
Search_Location.m : Function that find the location of the agent
update_Qvalue.m : Function that update Q_values
print_Qtable.m : print Q table
Description_of_the_program.docs : Explantion regarding each file and algorithm

These files are the implementation of Q learning function in matlab. The algorithm has been explained above.

Chalenge 🔗

For level 1, program is run on Vivado. The program input several values and the component that being simulated output max value from several values that it receives.

Commentary/Summary 🔗

Reinforcement learning learns certain behaviour through trial and error and maximizes the maximum rewards. Each best action in each state is implemented through Q tables. After each action, Q tables is updated based Q learning parameters and rewards. Q learning didn’t require the environtment is fully modeled.

Tutorial Singkat tentang Reinfocement Learning 🔗

https://www.youtube.com/watch?v=KHOKV4YsHSU

Salah satu algoritma machine learning yang berkaitan dengan bagaimana agent mengambil keputusan di environtment untuk memaksimalkan reward

Element pad RL 🔗

Policy
Reward
Value
Model Environtment (opt)

Q-Learning 🔗

Alogitma RL Model-free
Merupakan value dari action yang dilakukan dari state tertentu
Dapat direpresentasikan pada tabel NxZ dimana N adalah jumlah state dan Z adalah jumlah action

Algoritma Q-Learning 🔗

Inisialisasi parameter a,g,e
Inisialisasi tabel parameter
Loop n-episode:
    Inisialisasi state S
    Loop hingga terminal state:
        Pilih action A dari S dengan policy Q (e.g.epsilon-greedy)
        Lakukan Action A amati perubahan reward R dan state S
        Update Q(S,A) dengan persamaan
            Q(S,A) = Q(S,A) + a(R + g Max_a Q'(S',a) - Q(S,A))
        S <- S'
    Episode++

Metode pengambilan Action 🔗

Ambil action dengan value terbesar (greedy)
Ambil action secara acak dengan peluang e, sisanya greedy dengan peluang 1 - e (epsilon-greedy)

Demo : 🔗

https://mladdict.com/q-learning-simulator

Commentary 🔗

Pada tutorial ini dijelaskan elemen pada Q learning, penjelasan Q learning, Algoritma Q learning, dan metode pemiliham action pada Q learning. Terdapat juga demonstrasi Q learning yang ditunjukkan.

Tutorial Arsitektur Chip RL 🔗

https://www.youtube.com/watch?v=Kj4J-MdmCZo&list=PLkYjWBQDCTmD-4Vj12jzzDpi0i7pKHGHZ&index=17

System Model 🔗

Agent lives in environtment where it perform action
An interpreter observe the environtment and reward
Reward or reinforcement is quality figure for the action
May be positive and negative number

Q-Learning Algorithm 🔗

One of the most known and employed RL Method
Based on Qualiy matrix
Size is NxZ, N is num of state and Z is num of action
Row is considered as unique state
Action is choosen based on value on column
At the beginning Q matrix is initialized with random or zero values
Updated using value : $$Q_{new}(s_t,a_t) = (1-\alpha)Q(s_t,a_t)+\alpha(r_t+\gamma \max_a Q(s_{t+1},a))$$ where
- $s_t$ and $s_{t+1}$ current state and next state
- $a_t$ and $a_{t+1}$ current action and next action
- $\gamma , \epsilon [0,1],$ discount factor : how much agent account long term reward
- $\alpha , \epsilon [0,1],$ learning rate : how much agent account newest knowledge
- $r_t$ current reward

Q-Learning Architecture 🔗

High level architecture

System : Policy Generator, Q-Learning accelerator
PG is application-dependent
Agent receive state $s_{t+1}$ and reward $r_{t+1}$ from the observer
Next action generated by PG according to the value in QLA
$s_t$, $a_t$, $r_t$ obtained by delaying $s_{t+1}$, $a_{t+1}$, and $r_{t+1}$ by registers
Q learning Accelerator architecture

Block RAM implements dual port ram store entire column of Q matrix for each actions

2. Q Updater implements Qmatrix update equation

3. Max block to choose max Q value. Propagation delay here can be main limitation

Implementation Results, Experiment, and Performance Comparison 🔗

Optimization :

Reduce number of multiplication in $Q_{upt}$
Mux propagation delay
a, g, replaced with right shift

LSI Design Flow

Mathematical Modelling
Architecture design
1. Processing element
2. Memory unit
3. Control unit
4. System integration
Hardware Modelling
RTL Simulation
FPGA Implementation

Commentary 🔗

Pada tutorial ini, terdapat penjelasan mengenai model sistem RL. Terdapat pula pseudocode dari Q learning. Terdapat pula arsitektur hardware yang digunakan untuk mempercepat proses Q learning. Dijelaskan pula metode-metode yang berguna untuk mengoptimasi hardware tersebut. Terakhir, dijelaskan desain flow dari LSI.

Pengantar Reinforcement Learning 🔗

https://www.youtube.com/watch?v=ClJgvgUS_xw

Reinforcemet Learning 🔗

How agent can become proficient in an unknown environment. Given only its percepts and occasional rewards.

Reward: feedback that can be used to help the agent to know that something good or bad happened

Difference with other ML paradigm 🔗

No supervisor
Feedback is delayed
Time matters, sequence considered
The agent’s action affect the data it sees

Variables 🔗

History

Sequence of action and rewards
Ht = O1, R1. A1, …, At-1 Ot, Rt
All obsevable variables up to time t
History influence 1. Agents’ action 2. Rewards

State

Is the information used to determine next event
Is function of history St=f(Ht)
Sa = State of Agent
Se = State of Environtment

Component 🔗

RL agent may include one or mote these components :

Policy : Agent behaviour function
- Map from state to action
- Can be deterministic or stochastic
Value function : Value had when agent in certain state or action
- Is prediction of future reward
- Can be used to select actions
Model : Agent’s representation of environtment
- Predict what environtment will do next
- P predict next state
- R predict next immediate reward

Example 🔗

RL Agent Taxonomy 🔗

Policy vs Value Funtion
- Value based
- Policy based
- Actor Critic
Model
- Model free
- Model based

Commentary 🔗

Pada tutorial ini, dijelaskan mengenai reinforcement learning dan reward. Terdapat perbedaan mengenai RL dan paradigman RL yang lain. Dijelaskan pula variabel yang terlibat dan komponen yang ada pada RL. Ada juga jenis-jenis RL yang dijelaskan.

Tutorial Vivado HLS 🔗

https://drive.google.com/drive/folders/1ZgQ7w-WVJF_Rx3SzOPpMEYJiYxsiosax

Tutorial ini membahas keseluruhan workflow design RTL dengan HLS yang mencakup:

Coding HLS (video 1-4)
1. Membuat project
  1. Menamai project
  2. Memilih board
2. Mendownload library : https://github.com/definelicht/hlslib
3. Include library
4. Membuat dataconfig
5. Memasukkan cflag
6. Representasi vektor
7. Membuat vector reader
  1. Menampilkan arsitektur
  2. Membuat kode
8. Membuat scalar writter
9. Menjelaskan arsitektur pengali, penjumlah, accumulator
10. Membuat pengali
11. Membuat penjumlah
12. Membuat accumulator
13. Membuat top level interface
Selfnote :
- Cflag wajib dumasukkan ke semua file cpp
- Pastikan cflag ada di top tb juga
- Flag hanya bagian belakang tanpa -cflag ""
Simulasi fungsional dengan C testbench (video 5)
1. Membuat tb
2. Menjalankan simulasi
Simulasi RTL dengan waveform (video 6)
1. Memasukkan top level function
2. Membuat directive pada interface module
3. Membaut directive pada adder
4. Membuat directive pada accumulator
5. Membuat directive pada top level
6. Melakukan C synthesis
7. Melakukan Analysis
8. Simulasi timing dengan C/RTL cosimulation
Selfnote :
- Jika gagal compile coba pindahkan ke drive C:
Export IP HLS (video 6)
1. Memilih menu
2. Memilih option
3. Sistesis ke vivado
Vivado Synthesis (video 7)
1. Buat Project Vivado baru
2. Ip catalog -> Add repository
3. Add block design
4. Run block automation
5. Setting PS-PL interface
6. Add HLS
7. Run block automation
8. Setting PS-PL interface
9. Validate design
10. Add wrapper
11. Generate block design
12. Synthesis
13. Implement
14. Generate bitstream
15. Export hardware design and bitsteam
16. Copy hw handoff
17. Rename three files to same name
Selfnote
- Constrait pynq-z1 : https://pynq.readthedocs.io/en/v2.3/overlay_design_methodology/board_settings.html
Membuat code PS pada pynq (video 8)
1. Download file board pynq-z1 : https://pynq.readthedocs.io/en/v2.3/overlay_design_methodology/board_settings.html
2. Install said files to vivado and vitis
3. Copy 3 said files to pynq board. Use ssh, smb, or sftp.
4. Open jupyter notebook in http://192.168.2.99:9090
5. Run command as shown in the tutorial

Setup pynq framework :

Prepare USB cable
Download file image pynq-z1 : https://github.com/Xilinx/PYNQ/releases
Open pynq 2.6 documentation if needed : https://pynq.readthedocs.io/en/v2.3/getting_started/pynq_z1_setup.html
Prepeare Network cable and switch or router
Prepare SD card
Add file board to Vitis and Vivado
Flash image file to SD card
Change jumper position on board
Plug the USB prog to usb power'
Plug the ethernet cable
Flip the power switch

Selfnote :

Recompile semuanya kalau dari awal belum pakai board pynq, termasuk dari vivado dan vitis

Tutorial Perancangan VLSI 🔗

Tutorial ini merupakan kompilasi tutorial mengenai desain VLSI pada vivado:

Modul 1
Modul 2
Modul 5
Modul 6
Tutorial BRAM
Tutorial Xilinx Vivado

Tutorial summary : Reinforcement Learning

Tutorial LSI Design Contest Okinawa 🔗

Reinforcemet Learning 🔗

Sequence of reinforcement learning 🔗

Maze Exploration using reinforcement learning 🔗

Q-table 🔗

Q-learning 🔗

Learning Results 🔗

Matlab example 🔗

Chalenge 🔗

Commentary/Summary 🔗

Tutorial Singkat tentang Reinfocement Learning 🔗

Element pad RL 🔗

Q-Learning 🔗

Algoritma Q-Learning 🔗

Metode pengambilan Action 🔗

Demo : 🔗

Commentary 🔗

Tutorial Arsitektur Chip RL 🔗

System Model 🔗

Q-Learning Algorithm 🔗

Q-Learning Architecture 🔗

Implementation Results, Experiment, and Performance Comparison 🔗

Commentary 🔗

Pengantar Reinforcement Learning 🔗

Reinforcemet Learning 🔗

Difference with other ML paradigm 🔗

Variables 🔗

Component 🔗

Example 🔗

RL Agent Taxonomy 🔗

Commentary 🔗

Tutorial Vivado HLS 🔗

Tutorial Perancangan VLSI 🔗

Penggunaan Interrupt 🔗

Penggunaan AXIFull 🔗

Penggunaan GPIO/MIO/EMIO 🔗