Model outputsÂ¶
All models have outputs that are instances of subclasses of ModelOutput
. Those are
data structures containing all the information returned by the model, but that can also be used as tuples or
dictionaries.
Letâ€™s see of this looks on an example:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
tokenizer = BertTokenizer.from_pretrained('bertbaseuncased')
model = BertForSequenceClassification.from_pretrained('bertbaseuncased')
inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
labels = torch.tensor([1]).unsqueeze(0) # Batch size 1
outputs = model(**inputs, labels=labels)
The outputs
object is a SequenceClassifierOutput
, as we can see in the
documentation of that class below, it means it has an optional loss
, a logits
an optional hidden_states
and
an optional attentions
attribute. Here we have the loss
since we passed along labels
, but we donâ€™t have
hidden_states
and attentions
because we didnâ€™t pass output_hidden_states=True
or
output_attentions=True
.
You can access each attribute as you would usually do, and if that attribute has not been returned by the model, you
will get None
. Here for instance outputs.loss
is the loss computed by the model, and outputs.attentions
is
None
.
When considering our outputs
object as tuple, it only considers the attributes that donâ€™t have None
values.
Here for instance, it has two elements, loss
then logits
, so
outputs[:2]
will return the tuple (outputs.loss, outputs.logits)
for instance.
When considering our outputs
object as dictionary, it only considers the attributes that donâ€™t have None
values. Here for instance, it has two keys that are loss
and logits
.
We document here the generic model outputs that are used by more than one model type. Specific output types are documented on their corresponding model page.
ModelOutputÂ¶

class
transformers.file_utils.
ModelOutput
[source]Â¶ Base class for all model outputs as dataclass. Has a
__getitem__
that allows indexing by integer or slice (like a tuple) or strings (like a dictionary) that will ignore theNone
attributes. Otherwise behaves like a regular python dictionary.Warning
You canâ€™t unpack a
ModelOutput
directly. Use theto_tuple()
method to convert it to a tuple before.
BaseModelOutputÂ¶

class
transformers.modeling_outputs.
BaseModelOutput
(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithPoolingÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPooling
(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) â€“ Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithCrossAttentions
(last_hidden_state: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
BaseModelOutputWithPoolingAndCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPoolingAndCrossAttentions
(last_hidden_state: torch.FloatTensor = None, pooler_output: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
torch.FloatTensor
of shape(batch_size, hidden_size)
) â€“ Last layer hiddenstate of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for BERTfamily of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.
BaseModelOutputWithPastÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPast
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
BaseModelOutputWithPastAndCrossAttentionsÂ¶

class
transformers.modeling_outputs.
BaseModelOutputWithPastAndCrossAttentions
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
Seq2SeqModelOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqModelOutput
(last_hidden_state: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for model encoderâ€™s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the decoder of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
CausalLMOutputÂ¶

class
transformers.modeling_outputs.
CausalLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
CausalLMOutputWithCrossAttentionsÂ¶

class
transformers.modeling_outputs.
CausalLMOutputWithCrossAttentions
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
torch.FloatTensor
tuples of lengthconfig.n_layers
, with each tuple containing the cached key, value states of the selfattention and the crossattention layers if model is used in encoderdecoder setting. Only relevant ifconfig.is_decoder = True
.Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
CausalLMOutputWithPastÂ¶

class
transformers.modeling_outputs.
CausalLMOutputWithPast
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
)Contains precomputed hiddenstates (key and values in the selfattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
MaskedLMOutputÂ¶

class
transformers.modeling_outputs.
MaskedLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for masked language models outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Masked language modeling (MLM) loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqLMOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqLMOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for sequencetosequence language models outputs.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Language modeling loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
NextSentencePredictorOutputÂ¶

class
transformers.modeling_outputs.
NextSentencePredictorOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of models predicting if two sentences are consecutive or not.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whennext_sentence_label
is provided) â€“ Next sequence prediction (classification) loss.logits (
torch.FloatTensor
of shape(batch_size, 2)
) â€“ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
SequenceClassifierOutputÂ¶

class
transformers.modeling_outputs.
SequenceClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sentence classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqSequenceClassifierOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqSequenceClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence sentence classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabel
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
torch.FloatTensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
MultipleChoiceModelOutputÂ¶

class
transformers.modeling_outputs.
MultipleChoiceModelOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of multiple choice models.
 Parameters
loss (
torch.FloatTensor
of shape (1,), optional, returned whenlabels
is provided) â€“ Classification loss.logits (
torch.FloatTensor
of shape(batch_size, num_choices)
) â€“num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TokenClassifierOutputÂ¶

class
transformers.modeling_outputs.
TokenClassifierOutput
(loss: Optional[torch.FloatTensor] = None, logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of token classification models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Classification loss.logits (
torch.FloatTensor
of shape(batch_size, sequence_length, config.num_labels)
) â€“ Classification scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
QuestionAnsweringModelOutputÂ¶

class
transformers.modeling_outputs.
QuestionAnsweringModelOutput
(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, hidden_states: Optional[Tuple[torch.FloatTensor]] = None, attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of question answering models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
Seq2SeqQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_outputs.
Seq2SeqQuestionAnsweringModelOutput
(loss: Optional[torch.FloatTensor] = None, start_logits: torch.FloatTensor = None, end_logits: torch.FloatTensor = None, past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None, decoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, decoder_attentions: Optional[Tuple[torch.FloatTensor]] = None, cross_attentions: Optional[Tuple[torch.FloatTensor]] = None, encoder_last_hidden_state: Optional[torch.FloatTensor] = None, encoder_hidden_states: Optional[Tuple[torch.FloatTensor]] = None, encoder_attentions: Optional[Tuple[torch.FloatTensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence question answering models.
 Parameters
loss (
torch.FloatTensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
torch.FloatTensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).past_key_values (
tuple(tuple(torch.FloatTensor))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(torch.FloatTensor)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
torch.FloatTensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(torch.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
torch.FloatTensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(torch.FloatTensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
torch.FloatTensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutput
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(tf.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputWithPoolingÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPooling
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, pooler_output: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
tf.Tensor
of shape(batch_size, hidden_size)
) â€“Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
This output is usually not a good summary of the semantic content of the input, youâ€™re often better with averaging or pooling the sequence of hiddenstates for the whole input sequence.
hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputWithPoolingAndCrossAttentionsÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPoolingAndCrossAttentions
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, pooler_output: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
tf.Tensor
of shape(batch_size, hidden_size)
) â€“Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.
This output is usually not a good summary of the semantic content of the input, youâ€™re often better with averaging or pooling the sequence of hiddenstates for the whole input sequence.
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
TFBaseModelOutputWithPastÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPast
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFBaseModelOutputWithPastAndCrossAttentionsÂ¶

class
transformers.modeling_tf_outputs.
TFBaseModelOutputWithPastAndCrossAttentions
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.FloatTensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
TFSeq2SeqModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqModelOutput
(last_hidden_state: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for model encoderâ€™s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the decoder of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFCausalLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFCausalLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFCausalLMOutputWithCrossAttentionsÂ¶

class
transformers.modeling_tf_outputs.
TFCausalLMOutputWithCrossAttentions
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
TFCausalLMOutputWithPastÂ¶

class
transformers.modeling_tf_outputs.
TFCausalLMOutputWithPast
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) â€“ Language modeling loss (for nexttoken prediction).logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFMaskedLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFMaskedLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for masked language models outputs.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) â€“ Masked language modeling (MLM) loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqLMOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqLMOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, cross_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for sequencetosequence language models outputs.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whenlabels
is provided) â€“ Language modeling loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFNextSentencePredictorOutputÂ¶

class
transformers.modeling_tf_outputs.
TFNextSentencePredictorOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of models predicting if two sentences are consecutive or not.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of nonmasked labels, returned whennext_sentence_label
is provided) â€“ Next sentence prediction loss.logits (
tf.Tensor
of shape(batch_size, 2)
) â€“ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSequenceClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSequenceClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sentence classification models.
 Parameters
loss (
tf.Tensor
of shape(batch_size, )
, optional, returned whenlabels
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqSequenceClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqSequenceClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence sentence classification models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabel
is provided) â€“ Classification (or regression if config.num_labels==1) loss.logits (
tf.Tensor
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
TFMultipleChoiceModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFMultipleChoiceModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of multiple choice models.
 Parameters
loss (
tf.Tensor
of shape (batch_size, ), optional, returned whenlabels
is provided) â€“ Classification loss.logits (
tf.Tensor
of shape(batch_size, num_choices)
) â€“num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFTokenClassifierOutputÂ¶

class
transformers.modeling_tf_outputs.
TFTokenClassifierOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of token classification models.
 Parameters
loss (
tf.Tensor
of shape(n,)
, optional, where n is the number of unmasked labels, returned whenlabels
is provided) â€“ Classification loss.logits (
tf.Tensor
of shape(batch_size, sequence_length, config.num_labels)
) â€“ Classification scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFQuestionAnsweringModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of question answering models.
 Parameters
loss (
tf.Tensor
of shape(batch_size, )
, optional, returned whenstart_positions
andend_positions
are provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
TFSeq2SeqQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_tf_outputs.
TFSeq2SeqQuestionAnsweringModelOutput
(loss: Optional[tensorflow.python.framework.ops.Tensor] = None, start_logits: tensorflow.python.framework.ops.Tensor = None, end_logits: tensorflow.python.framework.ops.Tensor = None, past_key_values: Optional[List[tensorflow.python.framework.ops.Tensor]] = None, decoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, decoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_last_hidden_state: Optional[tensorflow.python.framework.ops.Tensor] = None, encoder_hidden_states: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None, encoder_attentions: Optional[Tuple[tensorflow.python.framework.ops.Tensor]] = None)[source]Â¶ Base class for outputs of sequencetosequence question answering models.
 Parameters
loss (
tf.Tensor
of shape(1,)
, optional, returned whenlabels
is provided) â€“ Total span extraction loss is the sum of a CrossEntropy for the start and end positions.start_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
tf.Tensor
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).past_key_values (
List[tf.Tensor]
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“List of
tf.Tensor
of lengthconfig.n_layers
, with each tensor of shape(2, batch_size, num_heads, sequence_length, embed_size_per_head)
).Contains precomputed hiddenstates (key and values in the attention blocks) of the decoder that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
encoder_last_hidden_state (
tf.Tensor
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(tf.Tensor)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
tf.Tensor
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(tf.Tensor)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
tf.Tensor
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxBaseModelOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxBaseModelOutput
(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxBaseModelOutputWithPastÂ¶

class
transformers.modeling_flax_outputs.
FlaxBaseModelOutputWithPast
(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Dict[str, jax._src.numpy.lax_numpy.ndarray]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for modelâ€™s outputs, with potential hidden states and attentions.
 Parameters
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.past_key_values (
Dict[str, jnp.ndarray]
) â€“ Dictionary of precomputed hiddenstates (key and values in the attention blocks) that can be used for fast autoregressive decoding. Precomputed key and value hiddenstates are of shape [batch_size, max_length].hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxBaseModelOutputWithPoolingÂ¶

class
transformers.modeling_flax_outputs.
FlaxBaseModelOutputWithPooling
(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, pooler_output: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for modelâ€™s outputs that also contains a pooling of the last hidden states.
 Parameters
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) â€“ Sequence of hiddenstates at the output of the last layer of the model.pooler_output (
jnp.ndarray
of shape(batch_size, hidden_size)
) â€“ Last layer hiddenstate of the first token of the sequence (classification token) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pretraining.hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxBaseModelOutputWithPastAndCrossAttentionsÂ¶

class
transformers.modeling_flax_outputs.
FlaxBaseModelOutputWithPastAndCrossAttentions
(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for modelâ€™s outputs that may also contain a past key/values (to speed up sequential decoding).
 Parameters
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and optionally ifconfig.is_encoder_decoder=True
2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and optionally if
config.is_encoder_decoder=True
in the crossattention blocks) that can be used (seepast_key_values
input) to speed up sequential decoding.hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
andconfig.add_cross_attention=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
FlaxSeq2SeqModelOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxSeq2SeqModelOutput
(last_hidden_state: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for model encoderâ€™s outputs that also contains : precomputed hidden states that can speed up sequential decoding.
 Parameters
last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
) â€“Sequence of hiddenstates at the output of the last layer of the decoder of the model.
If
past_key_values
is used only the last hiddenstate of the sequences of shape(batch_size, 1, hidden_size)
is output.past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxCausalLMOutputWithCrossAttentionsÂ¶

class
transformers.modeling_flax_outputs.
FlaxCausalLMOutputWithCrossAttentions
(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for causal language model (or autoregressive) outputs.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Cross attentions weights after the attention softmax, used to compute the weighted average in the crossattention heads.
past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
jnp.ndarray
tuples of lengthconfig.n_layers
, with each tuple containing the cached key, value states of the selfattention and the crossattention layers if model is used in encoderdecoder setting. Only relevant ifconfig.is_decoder = True
.Contains precomputed hiddenstates (key and values in the attention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.
FlaxMaskedLMOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxMaskedLMOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for masked language models outputs.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxSeq2SeqLMOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxSeq2SeqLMOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for sequencetosequence language models outputs.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.vocab_size)
) â€“ Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxNextSentencePredictorOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxNextSentencePredictorOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of models predicting if two sentences are consecutive or not.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, 2)
) â€“ Prediction scores of the next sequence prediction (classification) head (scores of True/False continuation before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxSequenceClassifierOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxSequenceClassifierOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of sentence classification models.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxSeq2SeqSequenceClassifierOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxSeq2SeqSequenceClassifierOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of sequencetosequence sentence classification models.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, config.num_labels)
) â€“ Classification (or regression if config.num_labels==1) scores (before SoftMax).past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxMultipleChoiceModelOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxMultipleChoiceModelOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of multiple choice models.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, num_choices)
) â€“num_choices is the second dimension of the input tensors. (see input_ids above).
Classification scores (before SoftMax).
hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxTokenClassifierOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxTokenClassifierOutput
(logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of token classification models.
 Parameters
logits (
jnp.ndarray
of shape(batch_size, sequence_length, config.num_labels)
) â€“ Classification scores (before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxQuestionAnsweringModelOutput
(start_logits: jax._src.numpy.lax_numpy.ndarray = None, end_logits: jax._src.numpy.lax_numpy.ndarray = None, hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of question answering models.
 Parameters
start_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the model at the output of each layer plus the initial embedding outputs.
attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights after the attention softmax, used to compute the weighted average in the selfattention heads.
FlaxSeq2SeqQuestionAnsweringModelOutputÂ¶

class
transformers.modeling_flax_outputs.
FlaxSeq2SeqQuestionAnsweringModelOutput
(start_logits: jax._src.numpy.lax_numpy.ndarray = None, end_logits: jax._src.numpy.lax_numpy.ndarray = None, past_key_values: Optional[Tuple[Tuple[jax._src.numpy.lax_numpy.ndarray]]] = None, decoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, decoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, cross_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_last_hidden_state: Optional[jax._src.numpy.lax_numpy.ndarray] = None, encoder_hidden_states: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None, encoder_attentions: Optional[Tuple[jax._src.numpy.lax_numpy.ndarray]] = None)[source]Â¶ Base class for outputs of sequencetosequence question answering models.
 Parameters
start_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) â€“ Spanstart scores (before SoftMax).end_logits (
jnp.ndarray
of shape(batch_size, sequence_length)
) â€“ Spanend scores (before SoftMax).past_key_values (
tuple(tuple(jnp.ndarray))
, optional, returned whenuse_cache=True
is passed or whenconfig.use_cache=True
) â€“Tuple of
tuple(jnp.ndarray)
of lengthconfig.n_layers
, with each tuple having 2 tensors of shape(batch_size, num_heads, sequence_length, embed_size_per_head)
) and 2 additional tensors of shape(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)
.Contains precomputed hiddenstates (key and values in the selfattention blocks and in the crossattention blocks) that can be used (see
past_key_values
input) to speed up sequential decoding.decoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the decoder at the output of each layer plus the initial embedding outputs.
decoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoder, after the attention softmax, used to compute the weighted average in the selfattention heads.
cross_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the decoderâ€™s crossattention layer, after the attention softmax, used to compute the weighted average in the crossattention heads.
encoder_last_hidden_state (
jnp.ndarray
of shape(batch_size, sequence_length, hidden_size)
, optional) â€“ Sequence of hiddenstates at the output of the last layer of the encoder of the model.encoder_hidden_states (
tuple(jnp.ndarray)
, optional, returned whenoutput_hidden_states=True
is passed or whenconfig.output_hidden_states=True
) â€“Tuple of
jnp.ndarray
(one for the output of the embeddings + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size)
.Hiddenstates of the encoder at the output of each layer plus the initial embedding outputs.
encoder_attentions (
tuple(jnp.ndarray)
, optional, returned whenoutput_attentions=True
is passed or whenconfig.output_attentions=True
) â€“Tuple of
jnp.ndarray
(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length)
.Attentions weights of the encoder, after the attention softmax, used to compute the weighted average in the selfattention heads.