cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.create_features_auto_lm#
- cerebras.modelzoo.data_preparation.nlp.hdf5_preprocessing.utils.create_features_auto_lm(token_ids, max_sequence_length, short_seq_prob=0, inverted_mask=False, pad_id=0, min_len=10, input_ids_dtype='int32', input_mask_dtype='int32', labels_dtype='int32', rng=None)[source]#
- Given a list of token_ids, generate input sequence and labels. - Parameters
- token_ids (sequence) – List containing token ids for creating features, labels and input mask from. 
- max_sequence_length (int) – Maximum sequence length for data writes. 
- short_seq_prob (float) – Probability of generating short sequences from data. Defaults to 0. 
- inverted_mask (bool) – Invert mask if specified for runtime execution. Defaults to False. 
- min_len (int) – Minimum length of token_ids to be considered a valid sequence. 
- pad_id (int) – Id for pad token. Defaults to 0. 
- input_ids_dtype (str) – Dtype as string for input ids. Defaults to int32. 
- input_mask_dtype (str) – Dtype as string for input mask. Defaults to int32. 
- labels_dtype (str) – Dtype as string for labels. Defaults to int32. 
- rng (random.Random obj) – Instance of random object, with states set. Defaults to None. 
 
- Returns
- Tuple containing features and labels