Self-Attention Intro | Zoet’ Blog

Zoet’ Blog

Are you there?

Friendship link

Self-Attention Intro

Jan 15, 2025

| Apr 21, 2025

Words≈0 | Read Time ≈ 0 min

type

status

date

slug

summary

tags

category

icon

password

REF: https://www.bilibili.com/video/BV1v3411r78R/?spm_id_from=333.337.search-card.all.click&vd_source=8be5ba4fcf7c69b9960ed391f70c5fb0

Self-Attention

是一个将有顺序的序列输入提取出结合位置和前后意义的序列输出的方式/层。

解决的问题：多个输入（如voice、graph）

输入向量集的两种encode方式

notion image

三种输出

notion image

中间过程

notion image

缺点是结果与前后不相关，如果扩大window size导致FC的参数过多影响计算结果

notion image

notion image

得到a的两种方法

notion image

notion image

批次根据a获得QKV

notion image

得到attention score

notion image

notion image

得到本位意

notion image

从输入到输出的self-atttention全过程

notion image

Multi-head Self-attention

notion image

notion image

a的位置咨询的表示

notion image

Attention Matrix是两两相关的I*I矩阵，占内存大。解决方法是指纳入前后一段的内容考虑

notion image

CNN is simplified self-attention.

notion image

上图没看懂。

SA vs RNN

每个vector的考虑因素？双向RNN也可以考虑全局

最大区别是RNN的两端记忆难以沟通，SA修改另一端权重即可。

RNN不能平行处理所有输出，但SA可以

运用在graph中时不需要计算Attention score，直接取边的权重，即Graph Neural Network

notion image

SA和transformer同时提出所以二者名字混用，之后用到SA的模型大多叫做xxformer

notion image

Transformer

是一个Sequence-to-Sequence的model，其中输出的长度由自监督决定

notion image

notion image

Encoder

给一排输入，得一批输出

notion image

每个block起到了多个layer的作用

notion image

在Transformer里，与常规SA不同的是，在SA后又使用了residual connection。得到a+b后做layer normalization（而不是batch norm（多类别的意义综合），对相同dimension不同feature/example。依赖全局的统计分布。），算出的m和std（对同feature不同dimension。每个 token 的 embedding 向量会单独计算均值和标准差，因此归一化是局部的，针对当前 token 自身。）

Feature：在序列数据中，一个 token（或 word）的 embedding 向量。假设 embedding 的维度是 d，一个 token 就是一个包含 d 个数值的向量。

Dimension：指 embedding 向量的每一维度，比如一个 d=512 的向量，第 1 维、第 2 维……第 512 维。

Example：通常是指序列中的每个 token，或不同的输入序列（样本）。

经过norm后得到的输出再经过FC然后residual，再norm一次，就能得到Encoder整体的输出

notion image

Transformer的Encoder整体结构：

notion image

Decoder

先输入启动向量然后逐个获得输出

notion image

把上一时刻的输出作为新输入

notion image

在Transformer中的Decoder

notion image

Encoder和Decoder的内部区别

notion image

masked

b2只考虑小于等于2的a的相关向量。理解为遮挡后面，先从全文经过Encoder得到每个词的意思，再从前往后参考前文决定后文的输出意

notion image

Autoregressive(AT)：加入END作为切断，与BEGIN的表示符号可以相同

Non-autoregressive(NAT)：一次性输出所有结果

notion image

Cross attention传递

notion image

去Encoder的k和v和Decoder的q

notion image

不同的信息连接方式

notion image

Training

用cross entropy表示loss

notion image

当测试时没有ground truth如何判断loss呢？

介绍几个训练seq2seq的tips

用其他词作为指令提取原文；取出重要部分的原文作为摘要

notion image

人为指定识别什么和顺序

notion image

Author:Zoet
URL:https://www.zoet.site//safety/Self-Attention%20Intro
Copyright:All articles in this blog, except for special statements, adopt BY-NC-SA agreement. Please indicate the source!

Relate Posts :

Tags:

笔记

Outer Wilds Exploration MVSplat Intro

Loading...

Catalog