!wget -q https://github.com/volfenstein1/quarto/raw/refs/heads/main/WOLFGANG_TRAINING.tex
with open('WOLFGANG_TRAINING.tex', 'r', encoding='utf-8') as f:
= f.read() text
WOLFGANG-GPT
Wolfgang-GPT is trained on a data set consisting mainly of my mathematics research papers. I used arxiv-latex-cleaner to clean up the tex files a bit; this mostly means the removal of all commented text.
In [15]:
You can see the training set here: WOLFGANG_TRAINING.tex.
In [16]:
print("Length of WOLFGANG_TRAINING dataset: ", len(text))
Length of WOLFGANG_TRAINING dataset: 3750026
In [17]:
# We can look at the first 1200 characters of this dataset; it consists of plain latex code, and is easily readable on its own.
print(text[:1200])
The Steenrod problem for closed orientable manifolds was solved completely by Thom.
Following this approach, we solve the Steenrod problem for closed orientable orbifolds, proving that the rational homology groups of a closed orientable orbifold have a basis consisting of classes represented by suborbifolds whose normal bundles have fiberwise trivial isotropy action.
Polyfold theory, as developed by Hofer, Wysocki, and Zehnder, has yielded a well-defined Gromov--Witten invariant via the regularization of moduli spaces.
As an application, we demonstrate that the polyfold Gromov--Witten invariants, originally defined via branched integrals, may equivalently be defined as intersection numbers against a basis of representing suborbifolds.
\section{Introduction}
\subsection{The {S}teenrod problem}
The Steenrod problem was first presented in \cite{eilenberg1949problems} and asked the following question:
\textit{Can any homology class of a finite polyhedron be represented as an image of the fundamental class of some manifold?}
In \cite{thom1954quelques},\footnote{The reader should be advised that the commonly available English translation of this paper introduces a few errors which ar
In [18]:
= sorted(list(set(text)))
chars = len(chars)
vocab_size print('Unique characters: ', ''.join(chars))
print('Number of unique characters: ', vocab_size)
Unique characters:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~δ�
Number of unique characters: 99
Tokenizer
A tokenizer splits the training into disjoint chunks and embeds the chunks into a vector space \(\mathbb{R}^n\).
We use a very rudimentary tokenizer, with chunks given by the individual characters: \[ \text{The Steenrod problem for...} \to \text{|T|h|e| |S|t|e|e|n|r|o|d| |p|r|o|b|l|e|m| |f|o|r|...} \]
Each unique character is encoded as a basis element of \(\mathbb{R}^n\) via a one-hot encoding: \[ \{ \text{unique characters} \} \to \mathbb{R}^{\#\{\text{unique characters}\}} \] and the decoder is the inverse.
We could obtain more sophistication by tokenizing on syllables and chunks of latex code, for example, see the tokenizer used by the MathBERTa model.
In [19]:
= { char:idx for idx,char in enumerate(chars) }
string_to_integer = { idx:char for idx,char in enumerate(chars) }
integer_to_string = lambda s: [string_to_integer[c] for c in s]
encode = lambda l: ''.join([integer_to_string[i] for i in l])
decode
print(encode("This sentence is a test. Here is some math $(M,\omega)$."))
print(decode(encode("This sentence is a test. Here is some math $(M,\omega)$.")))
[54, 74, 75, 85, 2, 85, 71, 80, 86, 71, 80, 69, 71, 2, 75, 85, 2, 67, 2, 86, 71, 85, 86, 16, 2, 42, 71, 84, 71, 2, 75, 85, 2, 85, 81, 79, 71, 2, 79, 67, 86, 74, 2, 6, 10, 47, 14, 62, 81, 79, 71, 73, 67, 11, 6, 16]
This sentence is a test. Here is some math $(M,\omega)$.
In [20]:
# Encode the entire dataset WOLFGANG_TRAINING.tex and store it as a pytorch tensor
import torch
= torch.tensor(encode(text), dtype=torch.long)
data print(data.shape, data.dtype)
print(data[:1200])
torch.Size([3750026]) torch.int64
tensor([54, 74, 71, ..., 2, 67, 84])
In [21]:
# Split the data into train and validation sets
= int(0.9*len(data))
n = data[:n]
train_data = data[n:] val_data
Neural Network Model
Given a sequence of tokens, we can train a neural network model to predict the most likely next token: \[ \text{|s|y|m|p|l|e|c|t|i|} \to \text{|c|}. \]
The input consists of a block of tokens up to size block_size
, and the target is the next token.
In [22]:
= 8
block_size +1] train_data[:block_size
tensor([54, 74, 71, 2, 53, 86, 71, 71, 80])
In [23]:
= train_data[:block_size]
x = train_data[1:block_size+1]
y for t in range(block_size):
= x[:t+1]
context = y[t]
target print("When input is", context, "the target is:", target)
When input is tensor([54]) the target is: tensor(74)
When input is tensor([54, 74]) the target is: tensor(71)
When input is tensor([54, 74, 71]) the target is: tensor(2)
When input is tensor([54, 74, 71, 2]) the target is: tensor(53)
When input is tensor([54, 74, 71, 2, 53]) the target is: tensor(86)
When input is tensor([54, 74, 71, 2, 53, 86]) the target is: tensor(71)
When input is tensor([54, 74, 71, 2, 53, 86, 71]) the target is: tensor(71)
When input is tensor([54, 74, 71, 2, 53, 86, 71, 71]) the target is: tensor(80)
In [24]:
123412)
torch.manual_seed(= 4 # how many independent sequences will we process in parallel?
batch_size = 8 # what is the maximum context length for predictions?
block_size
def get_batch(split):
= train_data if split == 'train' else val_data
data # Randomly select #{batch_size} indices with 0 <= idx < len(data) - block_size
= torch.randint(0, len(data) - block_size, (batch_size,))
idx_x = torch.stack([data[idx:idx+block_size] for idx in idx_x])
x = torch.stack([data[idx+1:idx+block_size+1] for idx in idx_x])
y return x, y
= get_batch('train')
xb, yb print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)
print('----')
for b in range(batch_size): # batch dimension
for t in range(block_size): # time dimension
= xb[b, :t+1]
context = yb[b,t]
target print(f"when input is {context.tolist()} the target: {target}")
inputs:
torch.Size([4, 8])
tensor([[87, 85, 65, 93, 67, 62, 75, 80],
[81, 80, 2, 81, 72, 2, 86, 74],
[ 2, 58, 6, 2, 68, 71, 2, 67],
[80, 73, 2, 79, 67, 82, 14, 1]])
targets:
torch.Size([4, 8])
tensor([[85, 65, 93, 67, 62, 75, 80, 2],
[80, 2, 81, 72, 2, 86, 74, 71],
[58, 6, 2, 68, 71, 2, 67, 2],
[73, 2, 79, 67, 82, 14, 1, 68]])
----
when input is [87] the target: 85
when input is [87, 85] the target: 65
when input is [87, 85, 65] the target: 93
when input is [87, 85, 65, 93] the target: 67
when input is [87, 85, 65, 93, 67] the target: 62
when input is [87, 85, 65, 93, 67, 62] the target: 75
when input is [87, 85, 65, 93, 67, 62, 75] the target: 80
when input is [87, 85, 65, 93, 67, 62, 75, 80] the target: 2
when input is [81] the target: 80
when input is [81, 80] the target: 2
when input is [81, 80, 2] the target: 81
when input is [81, 80, 2, 81] the target: 72
when input is [81, 80, 2, 81, 72] the target: 2
when input is [81, 80, 2, 81, 72, 2] the target: 86
when input is [81, 80, 2, 81, 72, 2, 86] the target: 74
when input is [81, 80, 2, 81, 72, 2, 86, 74] the target: 71
when input is [2] the target: 58
when input is [2, 58] the target: 6
when input is [2, 58, 6] the target: 2
when input is [2, 58, 6, 2] the target: 68
when input is [2, 58, 6, 2, 68] the target: 71
when input is [2, 58, 6, 2, 68, 71] the target: 2
when input is [2, 58, 6, 2, 68, 71, 2] the target: 67
when input is [2, 58, 6, 2, 68, 71, 2, 67] the target: 2
when input is [80] the target: 73
when input is [80, 73] the target: 2
when input is [80, 73, 2] the target: 79
when input is [80, 73, 2, 79] the target: 67
when input is [80, 73, 2, 79, 67] the target: 82
when input is [80, 73, 2, 79, 67, 82] the target: 14
when input is [80, 73, 2, 79, 67, 82, 14] the target: 1
when input is [80, 73, 2, 79, 67, 82, 14, 1] the target: 68
In [25]:
# For parallizeability, data gets passed to the transformer as batches of inputs.
# Example of single batch input:
print(xb)
tensor([[87, 85, 65, 93, 67, 62, 75, 80],
[81, 80, 2, 81, 72, 2, 86, 74],
[ 2, 58, 6, 2, 68, 71, 2, 67],
[80, 73, 2, 79, 67, 82, 14, 1]])
In [26]:
import torch
import torch.nn as nn
123412)
torch.manual_seed(
class Wolfgang_Language_Model(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# nn.Embedding(m,n) can be interpreted simply as an m*n matrix;
# In our case, it takes in the index of the given token and embeds it into a vector space of dimension = vocab_size
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
def forward(self, idx, targets=None):
# B = batch_size
# T = block_size
# C = vocab_size
# Input: an index idx, which is a (B x T)-tensor with entries embedded characters
# Output: a (B x T x C)-tensor
# But what is the interpretation of this output?
# Given an embedded character, the logits are a vector of the predicted probabilities of the next embedded character
# Since there are vocab_size characters, the tensor picks up another dimension
# How does it work?
# Mathematically we think of it as matrix multiplication by the matrix nn.Embedding(vocab_size, vocab_size)
# Functionally, the index tells us which row of the matrix nn.Embedding(vocab_size, vocab_size) to pick out
= self.token_embedding_table(idx) # (B,T,C)
logits
# For these predicted probabilities, return the loss via the cross_entropy loss function
if targets is None:
= None
loss else:
= logits.shape
B, T, C = logits.view(B*T, C)
logits = targets.view(B*T)
targets = nn.functional.cross_entropy(logits, targets)
loss
return logits, loss
def generate(self, idx, max_new_tokens):
# Given an index idx, which is a (B x T)-tensor with entries embedded characters, predict the next #max_new_tokens characters
# The result is a (B x (T + max_new_tokens))-tensor
for _ in range(max_new_tokens):
# get the predictions
= self(idx)
logits, loss # focus only on the last time step
= logits[:, -1, :] # becomes (B, C)
logits # apply softmax to get probabilities
= nn.functional.softmax(logits, dim=-1) # (B, C)
probs # sample from the distribution
= torch.multinomial(probs, num_samples=1) # (B, 1)
idx_next # append sampled index to the running sequence
= torch.cat((idx, idx_next), dim=1) # (B, T+1)
idx return idx
= Wolfgang_Language_Model(vocab_size)
m = m(xb, yb)
logits, loss print(logits.shape)
print(loss)
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))
torch.Size([32, 99])
tensor(4.7986, grad_fn=<NllLossBackward0>)
5's;n{eδzXsC019;*W1[s}^X:?%rhk'H] Ou(ISDj)*+a&L
WS')0!bwTpfIH[^+7)SD|p*gK:Sg/&[#?|CbmZ2.5 @ovδ1<Qsuk
In [27]:
# Create a pytorch opuniquetimizer
= torch.optim.AdamW(m.parameters(), lr=1e-3) optimizer
In [28]:
# Can minimize the loss directly via the pytorch library
= 64
batch_size for steps in range(10000):
# sample a batch of data
= get_batch('train')
xb, yb
# evaluate the loss
= m(xb, yb)
logits, loss =True)
optimizer.zero_grad(set_to_none
loss.backward()
optimizer.step()
print(loss.item())
2.787367343902588
Having trained the model, we can see what it outputs starting from an empty input. At this point, it can pick out some basic patterns between the placements of vowels, consonants, and spaces. We see some outputs that almost resemble words.
In [29]:
# Naive output for the model
print(decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))
\ tor s U$z)$ *}_{s $. scet ima_nds,
\st on stalorba2n-sscolpoore p acthofinsihesedioo Gisuss d ge perineena atwetoly
$ $\rs oe t{d LQ) &= L_bet(S^{(n oucr{E$\p WZ}. tumalu'')=Ded Dr $mot ma' taruc{ioig $
\re ipowh_-it{N$ \e tis \woo too\tred}(Z\rureq_1,δ1\ti
Thio prat_hifiz_\ndunthin $$ abmbde the t h {
Co X}(T Zetbon atruca g $ asctin ad{glimabdela$ $\iquphas, bdond ocatithima, \, d{rissuct S}_A,je iof O$ mangin re $\ouc$, F_{Th N$, map'(�&Weri$ WZ, S\Lererovaves.
\sct, "Lan So}
Self-attention
The previous results were unintelligible, obviously. There is only so much predictive power from knowing the previous 8 characters.
The self-attention is defined by the intially opaque equation: \[ \text{Attention} (Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V. \] In what follows, we will decipher the meaning of this equation.
In [30]:
42)
torch.manual_seed(# torch.tril masks a matrix to produce a lower triangular matrix
# Helpful to enforce characters are influenced only by preceding characters
= torch.tril(torch.ones(3, 3))
a = a / torch.sum(a, 1, keepdim=True)
a = torch.randint(0,10,(3,2)).float()
b = a @ b
c print('a=')
print(a)
print('--')
print('b=')
print(b)
print('--')
print('c=')
print(c)
a=
tensor([[1.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000],
[0.3333, 0.3333, 0.3333]])
--
b=
tensor([[2., 7.],
[6., 4.],
[6., 5.]])
--
c=
tensor([[2.0000, 7.0000],
[4.0000, 5.5000],
[4.6667, 5.3333]])
In [31]:
1337)
torch.manual_seed(= 4,8,2 # batch, time, channels
B,T,C = torch.randn(B,T,C)
x x.shape
torch.Size([4, 8, 2])
In [32]:
# We define xbow = x 'bag of words' via the average of previous characters, i.e.,
# x[b,t] = avg{x[b,i]}_{i <= t}
= torch.zeros((B,T,C))
xbow for b in range(B):
for t in range(T):
= x[b,:t+1] # (t,C)
xprev = torch.mean(xprev, 0) xbow[b,t]
In [33]:
# We can achieve the same but more efficiently using matrix multiplication
= torch.tril(torch.ones(T, T))
wei = wei / wei.sum(1, keepdim=True)
wei print(wei)
= wei @ x # (B, T, T) @ (B, T, C) ----> (B, T, C)
xbow2 torch.allclose(xbow, xbow2)
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5000, 0.5000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.3333, 0.3333, 0.3333, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2000, 0.2000, 0.2000, 0.2000, 0.2000, 0.0000, 0.0000, 0.0000],
[0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.1667, 0.0000, 0.0000],
[0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.1429, 0.0000],
[0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250, 0.1250]])
False
In [34]:
# version 3: use Softmax
= torch.tril(torch.ones(T, T))
tril = torch.zeros((T,T))
wei = wei.masked_fill(tril == 0, float('-inf'))
wei = nn.functional.softmax(wei, dim=-1)
wei = wei @ x
xbow3 torch.allclose(xbow, xbow3)
False
In [35]:
# version 4: self-attention!
1337)
torch.manual_seed(= 4,8,32 # batch, time, channels
B,T,C = torch.randn(B,T,C)
x
# let's see a single Head perform self-attention
= 16
head_size = nn.Linear(C, head_size, bias=False)
key = nn.Linear(C, head_size, bias=False)
query = nn.Linear(C, head_size, bias=False)
value = key(x) # (B, T, 16)
k = query(x) # (B, T, 16)
q = q @ k.transpose(-2, -1) # (B, T, 16) @ (B, 16, T) ---> (B, T, T)
wei
= torch.tril(torch.ones(T, T))
tril #wei = torch.zeros((T,T))
= wei.masked_fill(tril == 0, float('-inf'))
wei = nn.functional.softmax(wei, dim=-1)
wei
= value(x)
v = wei @ v
out #out = wei @ x
out.shape
torch.Size([4, 8, 16])
In [36]:
0] wei[
tensor([[1.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.1574, 0.8426, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.2088, 0.1646, 0.6266, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
[0.5792, 0.1187, 0.1889, 0.1131, 0.0000, 0.0000, 0.0000, 0.0000],
[0.0294, 0.1052, 0.0469, 0.0276, 0.7909, 0.0000, 0.0000, 0.0000],
[0.0176, 0.2689, 0.0215, 0.0089, 0.6812, 0.0019, 0.0000, 0.0000],
[0.1691, 0.4066, 0.0438, 0.0416, 0.1048, 0.2012, 0.0329, 0.0000],
[0.0210, 0.0843, 0.0555, 0.2297, 0.0573, 0.0709, 0.2423, 0.2391]],
grad_fn=<SelectBackward0>)
In [37]:
= torch.randn(B,T,head_size)
k = torch.randn(B,T,head_size)
q = q @ k.transpose(-2, -1) * head_size**-0.5 wei
In [38]:
k.var()
tensor(1.0449)
In [39]:
q.var()
tensor(1.0700)
In [40]:
wei.var()
tensor(1.0918)
In [41]:
0.1, -0.2, 0.3, -0.2, 0.5]), dim=-1) torch.softmax(torch.tensor([
tensor([0.1925, 0.1426, 0.2351, 0.1426, 0.2872])
In [42]:
0.1, -0.2, 0.3, -0.2, 0.5])*8, dim=-1) # gets too peaky, converges to one-hot torch.softmax(torch.tensor([
tensor([0.0326, 0.0030, 0.1615, 0.0030, 0.8000])
In [43]:
class LayerNorm1d: # (used to be BatchNorm1d)
def __init__(self, dim, eps=1e-5, momentum=0.1):
self.eps = eps
self.gamma = torch.ones(dim)
self.beta = torch.zeros(dim)
def __call__(self, x):
# calculate the forward pass
= x.mean(1, keepdim=True) # batch mean
xmean = x.var(1, keepdim=True) # batch variance
xvar = (x - xmean) / torch.sqrt(xvar + self.eps) # normalize to unit variance
xhat self.out = self.gamma * xhat + self.beta
return self.out
def parameters(self):
return [self.gamma, self.beta]
1337)
torch.manual_seed(= LayerNorm1d(100)
module = torch.randn(32, 100) # batch size 32 of 100-dimensional vectors
x = module(x)
x x.shape
torch.Size([32, 100])
In [44]:
0].mean(), x[:,0].std() # mean,std of one feature across all batch inputs x[:,
(tensor(0.1469), tensor(0.8803))
In [45]:
0,:].mean(), x[0,:].std() # mean,std of a single input from the batch, of its features x[
(tensor(-9.5367e-09), tensor(1.0000))
Raw Code
Code all in one place. Can run this as a single cell if you want!
In [46]:
import torch
import torch.nn as nn
from torch.nn import functional as F
# hyperparameters
= 16 # how many independent sequences will we process in parallel?
batch_size = 64 # what is the maximum context length for predictions?
block_size = 10000
max_iters = 100
eval_interval = 1e-3
learning_rate = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 200
eval_iters = 64
n_embd = 4
n_head = 4
n_layer = 0.0
dropout # ------------
1337)
torch.manual_seed(
with open('WOLFGANG_TRAINING.tex', 'r', encoding='utf-8') as f:
= f.read()
text
# here are all the unique characters that occur in this text
= sorted(list(set(text)))
chars = len(chars)
vocab_size # create a mapping from characters to integers
= { ch:i for i,ch in enumerate(chars) }
stoi = { i:ch for i,ch in enumerate(chars) }
itos = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
encode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
decode
# Train and test splits
= torch.tensor(encode(text), dtype=torch.long)
data = int(0.9*len(data)) # first 90% will be train, rest val
n = data[:n]
train_data = data[n:]
val_data
# data loading
def get_batch(split):
# generate a small batch of data of inputs x and targets y
= train_data if split == 'train' else val_data
data = torch.randint(len(data) - block_size, (batch_size,))
ix = torch.stack([data[i:i+block_size] for i in ix])
x = torch.stack([data[i+1:i+block_size+1] for i in ix])
y = x.to(device), y.to(device)
x, y return x, y
@torch.no_grad()
def estimate_loss():
= {}
out eval()
model.for split in ['train', 'val']:
= torch.zeros(eval_iters)
losses for k in range(eval_iters):
= get_batch(split)
X, Y = model(X, Y)
logits, loss = loss.item()
losses[k] = losses.mean()
out[split]
model.train()return out
class Head(nn.Module):
""" one head of self-attention """
def __init__(self, head_size):
super().__init__()
self.key = nn.Linear(n_embd, head_size, bias=False)
self.query = nn.Linear(n_embd, head_size, bias=False)
self.value = nn.Linear(n_embd, head_size, bias=False)
self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
self.dropout = nn.Dropout(dropout)
def forward(self, x):
= x.shape
B,T,C = self.key(x) # (B,T,C)
k = self.query(x) # (B,T,C)
q # compute attention scores ("affinities")
= q @ k.transpose(-2,-1) * C**-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)
wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) # (B, T, T)
wei = F.softmax(wei, dim=-1) # (B, T, T)
wei = self.dropout(wei)
wei # perform the weighted aggregation of the values
= self.value(x) # (B,T,C)
v = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)
out return out
class MultiHeadAttention(nn.Module):
""" multiple heads of self-attention in parallel """
def __init__(self, num_heads, head_size):
super().__init__()
self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
self.proj = nn.Linear(n_embd, n_embd)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
= torch.cat([h(x) for h in self.heads], dim=-1)
out = self.dropout(self.proj(out))
out return out
class FeedFoward(nn.Module):
""" a simple linear layer followed by a non-linearity """
def __init__(self, n_embd):
super().__init__()
self.net = nn.Sequential(
4 * n_embd),
nn.Linear(n_embd,
nn.ReLU(),4 * n_embd, n_embd),
nn.Linear(
nn.Dropout(dropout),
)
def forward(self, x):
return self.net(x)
class Block(nn.Module):
""" Transformer block: communication followed by computation """
def __init__(self, n_embd, n_head):
# n_embd: embedding dimension, n_head: the number of heads we'd like
super().__init__()
= n_embd // n_head
head_size self.sa = MultiHeadAttention(n_head, head_size)
self.ffwd = FeedFoward(n_embd)
self.ln1 = nn.LayerNorm(n_embd)
self.ln2 = nn.LayerNorm(n_embd)
def forward(self, x):
= x + self.sa(self.ln1(x))
x = x + self.ffwd(self.ln2(x))
x return x
# super simple bigram model
class wolfgang_LLM(nn.Module):
def __init__(self):
super().__init__()
# each token directly reads off the logits for the next token from a lookup table
self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
self.position_embedding_table = nn.Embedding(block_size, n_embd)
self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
self.ln_f = nn.LayerNorm(n_embd) # final layer norm
self.lm_head = nn.Linear(n_embd, vocab_size)
def forward(self, idx, targets=None):
= idx.shape
B, T
# idx and targets are both (B,T) tensor of integers
= self.token_embedding_table(idx) # (B,T,C)
tok_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,C)
pos_emb = tok_emb + pos_emb # (B,T,C)
x = self.blocks(x) # (B,T,C)
x = self.ln_f(x) # (B,T,C)
x = self.lm_head(x) # (B,T,vocab_size)
logits
if targets is None:
= None
loss else:
= logits.shape
B, T, C = logits.view(B*T, C)
logits = targets.view(B*T)
targets = F.cross_entropy(logits, targets)
loss
return logits, loss
def generate(self, idx, max_new_tokens):
# idx is (B, T) array of indices in the current context
for _ in range(max_new_tokens):
# crop idx to the last block_size tokens
= idx[:, -block_size:]
idx_cond # get the predictions
= self(idx_cond)
logits, loss # focus only on the last time step
= logits[:, -1, :] # becomes (B, C)
logits # apply softmax to get probabilities
= F.softmax(logits, dim=-1) # (B, C)
probs # sample from the distribution
= torch.multinomial(probs, num_samples=1) # (B, 1)
idx_next # append sampled index to the running sequence
= torch.cat((idx, idx_next), dim=1) # (B, T+1)
idx return idx
= wolfgang_LLM()
model = model.to(device)
m # print the number of parameters in the model
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')
# create a PyTorch optimizer
= torch.optim.AdamW(model.parameters(), lr=learning_rate)
optimizer
for iter in range(max_iters):
# every once in a while evaluate the loss on train and val sets
if iter % eval_interval == 0 or iter == max_iters - 1:
= estimate_loss()
losses print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
# sample a batch of data
= get_batch('train')
xb, yb
# evaluate the loss
= model(xb, yb)
logits, loss =True)
optimizer.zero_grad(set_to_none
loss.backward()
optimizer.step()
# generate from the model
= torch.zeros((1, 1), dtype=torch.long, device=device)
context print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))
0.216163 M parameters
step 0: train loss 4.8271, val loss 4.8334
step 100: train loss 3.0829, val loss 3.1061
step 200: train loss 2.8736, val loss 2.9270
step 300: train loss 2.7634, val loss 2.8190
step 400: train loss 2.6443, val loss 2.7482
step 500: train loss 2.5412, val loss 2.6754
step 600: train loss 2.4129, val loss 2.5889
step 700: train loss 2.2853, val loss 2.5031
step 800: train loss 2.1972, val loss 2.4321
step 900: train loss 2.1018, val loss 2.3783
step 1000: train loss 2.0110, val loss 2.3254
step 1100: train loss 1.9628, val loss 2.2730
step 1200: train loss 1.9165, val loss 2.2424
step 1300: train loss 1.8705, val loss 2.2132
step 1400: train loss 1.8289, val loss 2.1685
step 1500: train loss 1.7821, val loss 2.1482
step 1600: train loss 1.7548, val loss 2.1220
step 1700: train loss 1.7397, val loss 2.1116
step 1800: train loss 1.7028, val loss 2.0718
step 1900: train loss 1.6819, val loss 2.0624
step 2000: train loss 1.6528, val loss 2.0338
step 2100: train loss 1.6457, val loss 2.0340
step 2200: train loss 1.6332, val loss 2.0217
step 2300: train loss 1.6139, val loss 2.0074
step 2400: train loss 1.5900, val loss 2.0060
step 2500: train loss 1.5812, val loss 1.9844
step 2600: train loss 1.5582, val loss 1.9894
step 2700: train loss 1.5602, val loss 1.9646
step 2800: train loss 1.5339, val loss 1.9552
step 2900: train loss 1.5371, val loss 1.9606
step 3000: train loss 1.5207, val loss 1.9484
step 3100: train loss 1.5069, val loss 1.9419
step 3200: train loss 1.5063, val loss 1.9331
step 3300: train loss 1.4954, val loss 1.9216
step 3400: train loss 1.4831, val loss 1.9340
step 3500: train loss 1.4764, val loss 1.9227
step 3600: train loss 1.4586, val loss 1.9143
step 3700: train loss 1.4632, val loss 1.8919
step 3800: train loss 1.4521, val loss 1.9127
step 3900: train loss 1.4434, val loss 1.8909
step 4000: train loss 1.4355, val loss 1.8856
step 4100: train loss 1.4277, val loss 1.8673
step 4200: train loss 1.4235, val loss 1.8965
step 4300: train loss 1.4221, val loss 1.8726
step 4400: train loss 1.4249, val loss 1.8771
step 4500: train loss 1.4029, val loss 1.8744
step 4600: train loss 1.4124, val loss 1.8709
step 4700: train loss 1.3969, val loss 1.8486
step 4800: train loss 1.4056, val loss 1.8457
step 4900: train loss 1.3909, val loss 1.8585
step 5000: train loss 1.3914, val loss 1.8445
step 5100: train loss 1.3928, val loss 1.8413
step 5200: train loss 1.3803, val loss 1.8388
step 5300: train loss 1.3799, val loss 1.8704
step 5400: train loss 1.3802, val loss 1.8549
step 5500: train loss 1.3716, val loss 1.8478
step 5600: train loss 1.3795, val loss 1.8505
step 5700: train loss 1.3680, val loss 1.8476
step 5800: train loss 1.3554, val loss 1.8403
step 5900: train loss 1.3514, val loss 1.8214
step 6000: train loss 1.3598, val loss 1.8366
step 6100: train loss 1.3657, val loss 1.8369
step 6200: train loss 1.3599, val loss 1.8375
step 6300: train loss 1.3558, val loss 1.8100
step 6400: train loss 1.3452, val loss 1.7913
step 6500: train loss 1.3359, val loss 1.8332
step 6600: train loss 1.3323, val loss 1.8241
step 6700: train loss 1.3402, val loss 1.8393
step 6800: train loss 1.3321, val loss 1.8248
step 6900: train loss 1.3251, val loss 1.8137
step 7000: train loss 1.3207, val loss 1.8105
step 7100: train loss 1.3095, val loss 1.7979
step 7200: train loss 1.3389, val loss 1.8208
step 7300: train loss 1.3211, val loss 1.7852
step 7400: train loss 1.3193, val loss 1.8113
step 7500: train loss 1.3185, val loss 1.8143
step 7600: train loss 1.3173, val loss 1.8183
step 7700: train loss 1.3191, val loss 1.8131
step 7800: train loss 1.3206, val loss 1.7923
step 7900: train loss 1.3022, val loss 1.7816
step 8000: train loss 1.3022, val loss 1.7949
step 8100: train loss 1.2953, val loss 1.8006
step 8200: train loss 1.2989, val loss 1.8067
step 8300: train loss 1.2898, val loss 1.8000
step 8400: train loss 1.2928, val loss 1.8070
step 8500: train loss 1.2860, val loss 1.8001
step 8600: train loss 1.2986, val loss 1.8043
step 8700: train loss 1.2869, val loss 1.8100
step 8800: train loss 1.2833, val loss 1.7854
step 8900: train loss 1.2935, val loss 1.8020
step 9000: train loss 1.2798, val loss 1.8097
step 9100: train loss 1.2862, val loss 1.7941
step 9200: train loss 1.2801, val loss 1.7839
step 9300: train loss 1.2780, val loss 1.8103
step 9400: train loss 1.2906, val loss 1.8046
step 9500: train loss 1.2822, val loss 1.8033
step 9600: train loss 1.2734, val loss 1.7810
step 9700: train loss 1.2714, val loss 1.8044
step 9800: train loss 1.2748, val loss 1.7960
step 9900: train loss 1.2643, val loss 1.8026
step 9999: train loss 1.2674, val loss 1.7782
\; \bigcup_{i'\in U'}_{\lambda} (v_y\times W_{x',0,k}).
\item
Assume that $w=\Theta)
\rightarrow (a,v)=\{0,1\}$.
To bumpple, that this properts closed to a parameter a correspond $Z$\bm{W}$, where there $\Phi_x$.
\qed
\end{$(I-s_a,o}, D_j$.
\sum_{A}
\ssc^\infty\in W\setminus \oplus W\Q^+=\wh{e}^2&@<\arrow{\Gamma^\ast}|_{x_{a_1\wh{\neq_x^*}_2 (a,v,\tau(g'),l(o_i)\to \abs{\tau (0)} \leq \a@ \mathscr{C}^{-1}^{\iota}\phi^{\tau_0}} \times{WZ+HWZ8}(\sigma)\rightarrow \mu (\tau)\circ T (-\bf deffer to $t_X\rightarrow {\mathscr{S}$}_t {t_x}(y_{x_p})$. If those proved that $E\leq m$ local brive isomorphism, such. $X\subset W$ with $f>P_a(a,o_a)$
where are istension. Apparacompact maps reall $\zi$ and $\lambda)'\to (\Gamma_a)\circ\frac{1}(F, \alpha^+,\alpha^+,-)$, defined to the image filted in $C_x$ has a tangent of $I$\xi$ holows fromov-wing-fiber-compactness prove that $k=\sigma$.
\qed
\end{definition}
Here tangent $f'_k$ conclude type, the map $(U, \phi^{-1}): \pi
\cU| \mathscr{C}^{-1}(\cF)\to O^\pm\in \cP(\cW^{}_a; \rightarrow \abs{ b}_X :\Tti-(\rT_x \bigl|) \cdot, t_H^* \cap (p_-, A_+)$ and we such use that with are isots that solutions hysuality for the sensions of $\tau$ of a zero defined only
of next the point solution provese $\abs{\beta(a)}$ also linear operators,
which, ands and the map \rho$ are $ is b.
Hance we have this sc-smoothly of $X'_{{0,1}}\oplus W$ of ep-groupoidal $z\in \w-hat{s}_{\iota}'(\Sigma)$) is sc-smooth, which nece orbifold with the equest closed on $\delbar_x=\ast\ov\circ s^{t,y$.
\item[(\beta) \begin{le}\ \\ over(y)| \alpha \in F^\ast_{x,y} +h}^+\ast$ to equ
see. the morphism
$$
w_{\ast\colon X} =(\tau (\bigl( \phi & \wh{V}_\ast) (\big )\circ d_{m+i}(TD,x',\tau,g)
(\phi, \psi \exp_p \bm{M}(\phi, \beta, {\phi})\rightarrow [0,1] \to \sum_{X_x'$-acts of a stable.
In bhoose a finite-section be cholowse which associated for whoosen \ref{rm-stratned(3)-suborbifold}}} we have metrizable
many, and the indices recalized t