!wget -q https://github.com/volfenstein1/quarto/raw/refs/heads/main/WOLFGANG_TRAINING.tex
with open('WOLFGANG_TRAINING.tex', 'r', encoding='utf-8') as f:
= f.read() text
Wolfgang-GPT is trained on a data set consisting mainly of my mathematics research papers. I used arxiv-latex-cleaner to clean up the tex files a bit; this mostly means the removal of all commented text.
You can see the training set here: WOLFGANG_TRAINING.tex.
print("Length of WOLFGANG_TRAINING dataset: ", len(text))
Length of WOLFGANG_TRAINING dataset: 3750026
# We can look at the first 1200 characters of this dataset; it consists of plain latex code, and is easily readable on its own.
The Steenrod problem for closed orientable manifolds was solved completely by Thom.
Following this approach, we solve the Steenrod problem for closed orientable orbifolds, proving that the rational homology groups of a closed orientable orbifold have a basis consisting of classes represented by suborbifolds whose normal bundles have fiberwise trivial isotropy action.
Polyfold theory, as developed by Hofer, Wysocki, and Zehnder, has yielded a well-defined Gromov--Witten invariant via the regularization of moduli spaces.
As an application, we demonstrate that the polyfold Gromov--Witten invariants, originally defined via branched integrals, may equivalently be defined as intersection numbers against a basis of representing suborbifolds.
\subsection{The {S}teenrod problem}
The Steenrod problem was first presented in \cite{eilenberg1949problems} and asked the following question:
\textit{Can any homology class of a finite polyhedron be represented as an image of the fundamental class of some manifold?}
In \cite{thom1954quelques},\footnote{The reader should be advised that the commonly available English translation of this paper introduces a few errors which ar
= sorted(list(set(text)))
chars = len(chars)
vocab_size print('Unique characters: ', ''.join(chars))
print('Number of unique characters: ', vocab_size)
Unique characters:
Number of unique characters: 99
A tokenizer splits the training into disjoint chunks and embeds the chunks into a vector space \(\mathbb{R}^n\).
We use a very rudimentary tokenizer, with chunks given by the individual characters: \[ \text{The Steenrod problem for...} \to \text{|T|h|e| |S|t|e|e|n|r|o|d| |p|r|o|b|l|e|m| |f|o|r|...} \]
Each unique character is encoded as a basis element of \(\mathbb{R}^n\) via a one-hot encoding: \[ \{ \text{unique characters} \} \to \mathbb{R}^{\#\{\text{unique characters}\}} \] and the decoder is the inverse.
We could obtain more sophistication by tokenizing on syllables and chunks of latex code, for example, see the tokenizer used by the MathBERTa model.
= { char:idx for idx,char in enumerate(chars) }
string_to_integer = { idx:char for idx,char in enumerate(chars) }
integer_to_string = lambda s: [string_to_integer[c] for c in s]
encode = lambda l: ''.join([integer_to_string[i] for i in l])
print(encode("This sentence is a test. Here is some math $(M,\omega)$."))
print(decode(encode("This sentence is a test. Here is some math $(M,\omega)$.")))
[54, 74, 75, 85, 2, 85, 71, 80, 86, 71, 80, 69, 71, 2, 75, 85, 2, 67, 2, 86, 71, 85, 86, 16, 2, 42, 71, 84, 71, 2, 75, 85, 2, 85, 81, 79, 71, 2, 79, 67, 86, 74, 2, 6, 10, 47, 14, 62, 81, 79, 71, 73, 67, 11, 6, 16]
This sentence is a test. Here is some math $(M,\omega)$.
# Encode the entire dataset WOLFGANG_TRAINING.tex and store it as a pytorch tensor
import torch
= torch.tensor(encode(text), dtype=torch.long)
data print(data.shape, data.dtype)
torch.Size([3750026]) torch.int64
tensor([54, 74, 71, ..., 2, 67, 84])
# Split the data into train and validation sets
= int(0.9*len(data))
n = data[:n]
train_data = data[n:] val_data
Neural Network Model
Given a sequence of tokens, we can train a neural network model to predict the most likely next token: \[ \text{|s|y|m|p|l|e|c|t|i|} \to \text{|c|}. \]
The input consists of a block of tokens up to size block_size
, and the target is the next token.
= 8
block_size +1] train_data[:block_size
tensor([54, 74, 71, 2, 53, 86, 71, 71, 80])
= train_data[:block_size]
x = train_data[1:block_size+1]
y for t in range(block_size):
= x[:t+1]
context = y[t]
target print("When input is", context, "the target is:", target)
When input is tensor([54]) the target is: tensor(74)
When input is tensor([54, 74]) the target is: tensor(71)
When input is tensor([54, 74, 71]) the target is: tensor(2)
When input is tensor([54, 74, 71, 2]) the target is: tensor(53)
When input is tensor([54, 74, 71, 2, 53]) the target is: tensor(86)
When input is tensor([54, 74, 71, 2, 53, 86]) the target is: tensor(71)
When input is tensor([54, 74, 71, 2, 53, 86, 71]) the target is: tensor(71)
When input is tensor([54, 74, 71, 2, 53, 86, 71, 71]) the target is: tensor(80)
torch.manual_seed(= 4 # how many independent sequences will we process in parallel?
batch_size = 8 # what is the maximum context length for predictions?
def get_batch(split):
= train_data if split == 'train' else val_data
data # Randomly select #{batch_size} indices with 0 <= idx < len(data) - block_size
= torch.randint(0, len(data) - block_size, (batch_size,))
idx_x = torch.stack([data[idx:idx+block_size] for idx in idx_x])
x = torch.stack([data[idx+1:idx+block_size+1] for idx in idx_x])
y return x, y
= get_batch('train')
xb, yb print('inputs:')
for b in range(batch_size): # batch dimension
for t in range(block_size): # time dimension
= xb[b, :t+1]
context = yb[b,t]
target print(f"when input is {context.tolist()} the target: {target}")
torch.Size([4, 8])
tensor([[87, 85, 65, 93, 67, 62, 75, 80],
[81, 80, 2, 81, 72, 2, 86, 74],
[ 2, 58, 6, 2, 68, 71, 2, 67],
[80, 73, 2, 79, 67, 82, 14, 1]])
torch.Size([4, 8])
tensor([[85, 65, 93, 67, 62, 75, 80, 2],
[80, 2, 81, 72, 2, 86, 74, 71],
[58, 6, 2, 68, 71, 2, 67, 2],
[73, 2, 79, 67, 82, 14, 1, 68]])
when input is [87] the target: 85
when input is [87, 85] the target: 65
when input is [87, 85, 65] the target: 93
when input is [87, 85, 65, 93] the target: 67
when input is [87, 85, 65, 93, 67] the target: 62
when input is [87, 85, 65, 93, 67, 62] the target: 75
when input is [87, 85, 65, 93, 67, 62, 75] the target: 80
when input is [87, 85, 65, 93, 67, 62, 75, 80] the target: 2
when input is [81] the target: 80
when input is [81, 80] the target: 2
when input is [81, 80, 2] the target: 81
when input is [81, 80, 2, 81] the target: 72
when input is [81, 80, 2, 81, 72] the target: 2
when input is [81, 80, 2, 81, 72, 2] the target: 86
when input is [81, 80, 2, 81, 72, 2, 86] the target: 74
when input is [81, 80, 2, 81, 72, 2, 86, 74] the target: 71
when input is [2] the target: 58
when input is [2, 58] the target: 6
when input is [2, 58, 6] the target: 2
when input is [2, 58, 6, 2] the target: 68
when input is [2, 58, 6, 2, 68] the target: 71
when input is [2, 58, 6, 2, 68, 71] the target: 2
when input is [2, 58, 6, 2, 68, 71, 2] the target: 67
when input is [2, 58, 6, 2, 68, 71, 2, 67] the target: 2
when input is [80] the target: 73
when input is [80, 73] the target: 2
when input is [80, 73, 2] the target: 79
when input is [80, 73, 2, 79] the target: 67
when input is [80, 73, 2, 79, 67] the target: 82
when input is [80, 73, 2, 79, 67, 82] the target: 14
when input is [80, 73, 2, 79, 67, 82, 14] the target: 1
when input is [80, 73, 2, 79, 67, 82, 14, 1] the target: 68
# For parallizeability, data gets passed to the transformer as batches of inputs.
# Example of single batch input:
tensor([[87, 85, 65, 93, 67, 62, 75, 80],
[81, 80, 2, 81, 72, 2, 86, 74],
[ 2, 58, 6, 2, 68, 71, 2, 67],
[80, 73, 2, 79, 67, 82, 14, 1]])
import torch
import torch.nn as nn
class Wolfgang_Language_Model(nn.Module):
def __init__(self, vocab_size):
# nn.Embedding(m,n) can be interpreted simply as an m*n matrix;
# In our case, it takes in the index of the given token and embeds it into a vector space of dimension = vocab_size
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
# B = batch_size
# T = block_size
# C = vocab_size
# Input: an index idx, which is a (B x T)-tensor with entries embedded characters
# Output: a (B x T x C)-tensor
# But what is the interpretation of this output?
# Given an embedded character, the logits are a vector of the predicted probabilities of the next embedded character
# Since there are vocab_size characters, the tensor picks up another dimension
# How does it work?
# Mathematically we think of it as matrix multiplication by the matrix nn.Embedding(vocab_size, vocab_size)
# Functionally, the index tells us which row of the matrix nn.Embedding(vocab_size, vocab_size) to pick out
= self.token_embedding_table(idx) # (B,T,C)
# For these predicted probabilities, return the loss via the cross_entropy loss function
if targets is None:
= None
loss else:
= logits.shape
B, T, C = logits.view(B*T, C)
logits = targets.view(B*T)
targets = nn.functional.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# Given an index idx, which is a (B x T)-tensor with entries embedded characters, predict the next #max_new_tokens characters
# The result is a (B x (T + max_new_tokens))-tensor
for _ in range(max_new_tokens):
# get the predictions
= self(idx)
logits, loss # focus only on the last time step
= logits[:, -1, :] # becomes (B, C)
logits # apply softmax to get probabilities
= nn.functional.softmax(logits, dim=-1) # (B, C)
probs # sample from the distribution
= torch.multinomial(probs, num_samples=1) # (B, 1)
idx_next # append sampled index to the running sequence
= torch.cat((idx, idx_next), dim=1) # (B, T+1)
idx return idx
= Wolfgang_Language_Model(vocab_size)
m = m(xb, yb)
logits, loss print(logits.shape)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
torch.Size([32, 99])
tensor(4.7986, grad_fn=<NllLossBackward0>)
5's;n{eδzXsC019;*W1[s}^X:?%rhk'H] Ou(ISDj)*+a&L
WS')0!bwTpfIH[^+7)SD|p*gK:Sg/&[#?|CbmZ2.5 @ovδ1<Qsuk
# Create a pytorch opuniquetimizer
= torch.optim.AdamW(m.parameters(), lr=1e-3) optimizer
# Can minimize the loss directly via the pytorch library
= 64
batch_size for steps in range(10000):
# sample a batch of data
= get_batch('train')
xb, yb
# evaluate the loss
= m(xb, yb)
logits, loss =True)
Having trained the model, we can see what it outputs starting from an empty input. At this point, it can pick out some basic patterns between the placements of vowels, consonants, and spaces. We see some outputs that almost resemble words.
# Naive output for the model
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))
\ tor s U$z)$ *}_{s $. scet ima_nds,
\st on stalorba2n-sscolpoore p acthofinsihesedioo Gisuss d ge perineena atwetoly
$ $\rs oe t{d LQ) &= L_bet(S^{(n oucr{E$\p WZ}. tumalu'')=Ded Dr $mot ma' taruc{ioig $
\re ipowh_-it{N$ \e tis \woo too\tred}(Z\rureq_1,δ1\ti
Thio prat_hifiz_\ndunthin $$ abmbde the t h {
Co X}(T Zetbon atruca g $ asctin ad{glimabdela$ $\iquphas, bdond ocatithima, \, d{rissuct S}_A,je iof O$ mangin re $\ouc$, F_{Th N$, map'(�&Weri$ WZ, S\Lererovaves.
\sct, "Lan So}
The previous results were unintelligible, obviously. There is only so much predictive power from knowing the previous 8 characters.
The self-attention is defined by the intially opaque equation: \[ \text{Attention} (Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V. \] In what follows, we will decipher the meaning of this equation.
In [30]:
torch.manual_seed(# torch.tril masks a matrix to produce a lower triangular matrix
# Helpful to enforce characters are influenced only by preceding characters
= torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
a = torch.randint(0,10,(3,2)).float()
b = a @ b
c print('a=')
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])
tensor([[2., 7.],
[6., 4.],
[6., 5.]])
tensor([[2.0000, 7.0000],
[4.0000, 5.5000],
[4.6667, 5.3333]])
torch.manual_seed(= 4,8,2 # batch, time, channels
B,T,C = torch.randn(B,T,C)
x x.shape
torch.Size([4, 8, 2])
# We define xbow = x 'bag of words' via the average of previous characters, i.e.,
# x[b,t] = avg{x[b,i]}_{i <= t}
= torch.zeros((B,T,C))
xbow for b in range(B):
for t in range(T):
= x[b,:t+1] # (t,C)
xprev = torch.mean(xprev, 0) xbow[b,t]
# We can achieve the same but more efficiently using matrix multiplication
= torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
wei print(wei)
= wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
xbow2 torch.allclose(xbow, xbow2)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
[0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
[0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
# version 3: use Softmax
= torch.tril(torch.ones(T, T))
tril = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = nn.functional.softmax(wei, dim=-1)
wei = wei @ x
xbow3 torch.allclose(xbow, xbow3)
# version 4: self-attention!
torch.manual_seed(= 4,8,32 # batch, time, channels
B,T,C = torch.randn(B,T,C)
# let's see a single Head perform self-attention
= 16
head_size = nn.Linear(C, head_size, bias=False)
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = key(x) # (B, T, 16)
k = query(x) # (B, T, 16)
q = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
= torch.tril(torch.ones(T, T))
tril #wei = torch.zeros((T,T))
= wei.masked_fill(tril == 0, float('-inf'))
wei = nn.functional.softmax(wei, dim=-1)
= value(x)
v = wei @ v
out #out = wei @ x
torch.Size([4, 8, 16])
0] wei[
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
[0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
[0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
[0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
= torch.randn(B,T,head_size)
k = torch.randn(B,T,head_size)
q = q @ k.transpose(-2, -1) * head_size**-0.5 wei
0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1) torch.softmax(torch.tensor([
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot torch.softmax(torch.tensor([
tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])
class LayerNorm1d: # (used to be BatchNorm1d)
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
def __call__(self, x):
# calculate the forward pass
= x.mean(1, keepdim=True) # batch mean
xmean = x.var(1, keepdim=True) # batch variance
xvar = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
xhat self.out = self.gamma * xhat + self.beta
return self.out
def parameters(self):
return [self.gamma, self.beta]
torch.manual_seed(= LayerNorm1d(100)
module = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x x.shape
torch.Size([32, 100])
0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs x[:,
(tensor(0.1469), tensor(0.8803))
0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features x[
(tensor(-9.5367e-09), tensor(1.0000))
Raw Code
Code all in one place. Can run this as a single cell if you want!
import torch
import torch.nn as nn
from torch.nn import functional as F
# hyperparameters
= 16 # how many independent sequences will we process in parallel?
batch_size = 64 # what is the maximum context length for predictions?
block_size = 10000
max_iters = 100
eval_interval = 1e-3
learning_rate = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 200
eval_iters = 64
n_embd = 4
n_head = 4
n_layer = 0.0
dropout # ------------
with open('WOLFGANG_TRAINING.tex', 'r', encoding='utf-8') as f:
= f.read()
# here are all the unique characters that occur in this text
= sorted(list(set(text)))
chars = len(chars)
vocab_size # create a mapping from characters to integers
= { ch:i for i,ch in enumerate(chars) }
stoi = { i:ch for i,ch in enumerate(chars) }
itos = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
encode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
# Train and test splits
= torch.tensor(encode(text), dtype=torch.long)
data = int(0.9*len(data)) # first 90% will be train, rest val
n = data[:n]
train_data = data[n:]
# data loading
def get_batch(split):
# generate a small batch of data of inputs x and targets y
= train_data if split == 'train' else val_data
data = torch.randint(len(data) - block_size, (batch_size,))
ix = torch.stack([data[i:i+block_size] for i in ix])
x = torch.stack([data[i+1:i+block_size+1] for i in ix])
y = x.to(device), y.to(device)
x, y return x, y
def estimate_loss():
= {}
out eval()
model.for split in ['train', 'val']:
= torch.zeros(eval_iters)
losses for k in range(eval_iters):
= get_batch(split)
X, Y = model(X, Y)
logits, loss = loss.item()
losses[k] = losses.mean()
model.train()return out
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
= x.shape
B,T,C = self.key(x) # (B,T,C)
k = self.query(x) # (B,T,C)
q # compute attention scores ("affinities")
= q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
wei # perform the weighted aggregation of the values
= self.value(x) # (B,T,C)
v = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
out return out
class MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size):
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
= torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
out return out
class FeedFoward(nn.Module):
""" a simple linear layer followed by a non-linearity """
def __init__(self, n_embd):
self.net = nn.Sequential(
4 * n_embd),
nn.ReLU(),4 * n_embd, n_embd),
def forward(self, x):
return self.net(x)
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
= n_embd // n_head
head_size self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
= x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
x return x
# super simple bigram model
class wolfgang_LLM(nn.Module):
def __init__(self):
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
= idx.shape
B, T
# idx and targets are both (B,T) tensor of integers
= self.token_embedding_table(idx) # (B,T,C)
tok_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
pos_emb = tok_emb + pos_emb # (B,T,C)
x = self.blocks(x) # (B,T,C)
x = self.ln_f(x) # (B,T,C)
x = self.lm_head(x) # (B,T,vocab_size)
if targets is None:
= None
loss else:
= logits.shape
B, T, C = logits.view(B*T, C)
logits = targets.view(B*T)
targets = F.cross_entropy(logits, targets)
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
= idx[:, -block_size:]
idx_cond # get the predictions
= self(idx_cond)
logits, loss # focus only on the last time step
= logits[:, -1, :] # becomes (B, C)
logits # apply softmax to get probabilities
= F.softmax(logits, dim=-1) # (B, C)
probs # sample from the distribution
= torch.multinomial(probs, num_samples=1) # (B, 1)
idx_next # append sampled index to the running sequence
= torch.cat((idx, idx_next), dim=1) # (B, T+1)
idx return idx
= wolfgang_LLM()
model = model.to(device)
m # print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')
# create a PyTorch optimizer
= torch.optim.AdamW(model.parameters(), lr=learning_rate)
for iter in range(max_iters):
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
= estimate_loss()
losses print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# sample a batch of data
= get_batch('train')
xb, yb
# evaluate the loss
= model(xb, yb)
logits, loss =True)
# generate from the model
= torch.zeros((1, 1), dtype=torch.long, device=device)
context print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
0.216163 M parameters
step 0: train loss 4.8271, val loss 4.8334
step 100: train loss 3.0829, val loss 3.1061
\; \bigcup_{i'\in U'}_{\lambda} (v_y\times W_{x',0,k}).
Assume that $w=\Theta)
\rightarrow (a,v)=\{0,1\}$.
To bumpple, that this properts closed to a parameter a correspond $Z$\bm{W}$, where there $\Phi_x$.
\end{$(I-s_a,o}, D_j$.
\ssc^\infty\in W\setminus \oplus W\Q^+=\wh{e}^2&@<\arrow{\Gamma^\ast}|_{x_{a_1\wh{\neq_x^*}_2 (a,v,\tau(g'),l(o_i)\to \abs{\tau (0)} \leq \a@ \mathscr{C}^{-1}^{\iota}\phi^{\tau_0}} \times{WZ+HWZ8}(\sigma)\rightarrow \mu (\tau)\circ T (-\bf deffer to $t_X\rightarrow {\mathscr{S}$}_t {t_x}(y_{x_p})$. If those proved that $E\leq m$ local brive isomorphism, such. $X\subset W$ with $f>P_a(a,o_a)$
where are istension. Apparacompact maps reall $\zi$ and $\lambda)'\to (\Gamma_a)\circ\frac{1}(F, \alpha^+,\alpha^+,-)$, defined to the image filted in $C_x$ has a tangent of $I$\xi$ holows fromov-wing-fiber-compactness prove that $k=\sigma$.
Here tangent $f'_k$ conclude type, the map $(U, \phi^{-1}): \pi
\cU| \mathscr{C}^{-1}(\cF)\to O^\pm\in \cP(\cW^{}_a; \rightarrow \abs{ b}_X :\Tti-(\rT_x \bigl|) \cdot, t_H^* \cap (p_-, A_+)$ and we such use that with are isots that solutions hysuality for the sensions of $\tau$ of a zero defined only
of next the point solution provese $\abs{\beta(a)}$ also linear operators,
which, ands and the map \rho$ are $ is b.
Hance we have this sc-smoothly of $X'_{{0,1}}\oplus W$ of ep-groupoidal $z\in \w-hat{s}_{\iota}'(\Sigma)$) is sc-smooth, which nece orbifold with the equest closed on $\delbar_x=\ast\ov\circ s^{t,y$.
\item[(\beta) \begin{le}\ \\ over(y)| \alpha \in F^\ast_{x,y} +h}^+\ast$ to equ
see. the morphism
w_{\ast\colon X} =(\tau (\bigl( \phi & \wh{V}_\ast) (\big )\circ d_{m+i}(TD,x',\tau,g)
(\phi, \psi \exp_p \bm{M}(\phi, \beta, {\phi})\rightarrow [0,1] \to \sum_{X_x'$-acts of a stable.
In bhoose a finite-section be cholowse which associated for whoosen \ref{rm-stratned(3)-suborbifold}}} we have metrizable
many, and the indices recalized t