modelzoo.common.pytorch.layers.MultiheadAttention
modelzoo.common.pytorch.layers.MultiheadAttention¶
import path: modelzoo.common.pytorch.layers.MultiheadAttention
MultiheadAttention (embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=False, device=None, dtype=None)
embed_dim – Total dimension of the model.
nnum_heads – Number of parallel attention heads. Note that
embed_dim
will be split across num_heads (i.e. each head will have dimensionembed_dim // num_heads
).dropout – Dropout probability on
attn_output_weights
. Default: 0.0 (no dropout).batch_first – If
True
, then the input and output tensors are provided as (batch, seq, feature). Default:True
(seq, batch, feature). We only support batch_first = True now.bias – If specified, adds bias to input / output projection layers. Default:
True
. Replaced withuse_projection_bias
anduse_ffn_bias
add_bias_kv – If specified, adds bias to the key and value sequences at
dim=0
. Default:False
.add_zero_attn – If specified, adds a new batch of zeros to the key and value sequences at
dim=1
. Default:False
.kdim – Total number of features for keys. Default: None (uses
kdim=embed_dim
).vdim – Total number of features for values. Default:
None
(usesvdim=embed_dim
)use_projection_bias – Whether to use bias in the key, query, and value projections.
use_ffn_bias – Whether to use bias in the output projection.
attention_initializer – Projection kernel initializer. Defaults to
xavier_uniform
.output_layer_initializer – If not
None
, use this initializer for the output transform layer. Defaults toNone
.attention_type – The attention variant to execute. Currently accepts
dot_product
andscaled_dot_product
. Defaults toscaled_dot_product
.device – The device to use for models parameters.
forward (query, key, value, key_padding_mask=None, need_weights=True, attn_mask=None, average_attn_weights=True, position_bias=None, rotary_position_embedding_helper=None)
q (Tensor): Queries, shape
[batch_size, seq_length, embed_dim]
.k (Tensor): Keys, shape
[batch_size, seq_length, embed_dim]
.v (Tensor): Values, shape
[batch_size, seq_length, embed_dim]
.attn_mask (Tensor): Attention mask. Can be 2D of shape
[batch_size, seq_length]
, or 3D of shape[batch, query_length, seq_length]
.key_padding_mask (Tensor): If specified, a mask of shape (batch_size, seq_length) indicating which elements within key to ignore for the purpose of attention (i.e. treat as “padding”). Defaults to
None
.need_weights (bool): If specified, returns
attn_output_weights
in addition toattn_outputs
. Default:False
.average_attn_weights (bool): If
true
, indicates that the returnedattn_weights
should be averaged across heads. Otherwise,attn_weights
are provided separately per head. Note that this flag only has an effect whenneed_weights=True
. Default:True
(i.e. average weights across heads)position_bias (Tensor): Tensor containing position bias to apply in attention.
rotary_position_embedding_helper (RotaryPositionEmbeddingHelper): Helper to create rotary embeddings according to the paper RoFormer: Enhanced Transformer with Rotary Position Embedding