tf.layers.AttentionLayer module¶

class tf.layers.AttentionLayer.AttentionLayer(*args: Any, **kwargs: Any)¶

Bases: modelzoo.common.tf.layers.BaseLayer.BaseLayer

Multi-head attention layer. Based on MLCommons model.

Parameters

hidden_size (int) – Number of units in each projection output.
num_heads (int) – Number of attention heads.
use_projection_bias (bool) – Whether to use bias in the key, query, and value projections.
use_ffn_bias (bool) – Whether to use bias in the output projection.
initializer (str) – Projection kernel intializer. Defaults to glorot_uniform.
query_layer_initializer (initializer) – Query kernel initializer. Defaults to None in which case initializer will be used.
key_layer_initializer (initializer) – Key kernel initializer. Defaults to None in which case ``initializer` will be used.
value_layer_initializer (initializer) – Value kernel initializer. Defaults to None in which case initializer will be used.
relative_attention_bias_weight_initializer (initializer) – Relative Attention Bias weight None in which case initializer will be used.
output_layer_initializer (str or initializer) – If not None, use this initializer for the output transform layer. Defaults to None.
kernel_regularizer (Optional[Callable]) – Projection kernel regularizer. Defaults to None.
bias_regularizer (Optional[Callable]) – Projection bias regularizer. Defaults to None.
attention_type (str) – The attention variant to execute. Currently accepts dot_product and scaled_dot_product. Defaults to scaled_dot_product.
dropout_rate (float) – Dropout rate for key-query weights. Defaults to 0.0.
dropout_seed (int) – Seed with which to initialize the dropout layer. Defaults to None.
use_relative_attention_bias (bool) – Whether to use relative position bias when calculating attention.
relative_attention_bias (Tensor) – Tensor with relative attention weights. Shape: [num_relative_attention_buckets, num_heads]. Defaults set to None.
num_relative_attention_buckets (int) – Used to calculate relative position bias when use_relative_attention_bias set to True.
bidirectional_relative_attention (bool) – Whether attention is bidirectional.
boundary_casting (bool) – If True, then outputs the values in half precision and casts the input values up to full precision.
tf_summary (bool) – If True, then saves the activations with summary_layer.

build(input_shape)¶

call(q, v, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)¶

Applies the attention mechanism to queries q and values v. Keys will be set to be same as v.

Parameters

q (Tensor) – Queries, shape [batch_size, seq_length, hidden_size].
v (Tensor) – Values, shape [batch_size, seq_length, hidden_size].
mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].
past_kv (Tensor) – Past keys and values. Has shape [2, batch_size, num_heads, seq_length, hidden_size / num_heads]. The tensors in [0,:,:,:,:] and [1,:,:,:,:] contain the past keys and values, respectively. Defaults to None.
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.
training (bool) – Training the model if True. Needed to call the dropout (after softmax) in the appropriate mode.
position_bias (Tensor) – Tensor containing position bias to apply in attention.
cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

Returns

when cache_present_kv is True and cache_position_bias is True, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.

If cache_present_kv is False, no entry for present keys and values is provided.

If cache_position_bias is False, no entry for position bias is provided.

if both cache_present_kv cache_position_bias are set to False, return a tensor of shape equal to shape of past_kv (see above).

class tf.layers.AttentionLayer.SelfAttentionLayer(*args: Any, **kwargs: Any)¶

Bases: tf.layers.AttentionLayer.AttentionLayer

Multiheaded self-attention layer.

call(x, mask=None, past_kv=None, cache_present_kv=False, training=True, position_bias=None, cache_position_bias=False)¶

Applies the attention mechanism to queries q and values v. Keys will be set to be same as v.

Parameters

q (Tensor) – Queries, shape [batch_size, seq_length, hidden_size].
v (Tensor) – Values, shape [batch_size, seq_length, hidden_size].
mask (Tensor) – Attention mask. Can be 2D of shape [batch_size, seq_length], or 3D of shape [batch, query_length, seq_length].
past_kv (Tensor) – Past keys and values. Has shape [2, batch_size, num_heads, seq_length, hidden_size / num_heads]. The tensors in [0,:,:,:,:] and [1,:,:,:,:] contain the past keys and values, respectively. Defaults to None.
cache_present_kv (bool) – Specifies if the present keys and values must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.
training (bool) – Training the model if True. Needed to call the dropout (after softmax) in the appropriate mode.
position_bias (Tensor) – Tensor containing position bias to apply in attention.
cache_position_bias (bool) – Specifies if position bias must be cached and returned. Needed to speed up the computations when the decoder is called within an autoregressive loop. Defaults to False.

Returns

when cache_present_kv is True and cache_position_bias is True, returns a tuple, where the 0th entry contains the attention output, 1st entry contains a tensor of keys and values computed at the current application of the attention layer, and the 3rd entry contains a tensor of position bias computed at the current application of the attention layer.

If cache_present_kv is False, no entry for present keys and values is provided.

If cache_position_bias is False, no entry for position bias is provided.

if both cache_present_kv cache_position_bias are set to False, return a tensor of shape equal to shape of past_kv (see above).

Software Documentation (Version 1.7.1)

tf.layers.AttentionLayer module

tf.layers.AttentionLayer module¶