AI • Data • LLM
August 24
Tips on improving your LLM's training, with Pokémon, again.
Last week we talked about training an LLM and how to me it was similar to Pokémon. Well, we are back from the Safari Zone again, me and all my Pokémon and LLM knowledge are going for the championship. What do I mean by this?
1) I love analogies and Pokémon
2) We'll discuss best practices for
LLM documentation
If you are completely new to LLMs and just need to understand how they work on a surface level, I recommend my blog post about LLMs or even my case study on how I implemented one. Today's focus is nitty-gritty details. In Pokémon terms, we aren't playing just for fun, we are here to win competitive tournaments. That involves looking at stats and calculating Effort Values (EVs).
So, let's pretend we are training an LLM to talk about Pokémon. Ready to get very technical today?
On my way to catch some Pokémon.
At least 150 or more to see...
First some technical details.
The smallest piece of information LLMs handle
are called tokens. These can be characters,
words, or sentences. The way they are calculated
depends on the method the LLM is using to
process words and turn them into numbers that it can
more easily understand. You will see tools like
ChatGPT or Lovable having a limit of
how many tokens you can use unless you upgrade to a
paid plan.
Since LLMs can process only a limited amount of information at a time, the best thing to do is to break down your information into smaller pieces. If you want to learn more about this process, I recommend this article on Medium →.
The amount processed works for both ways, both from the user's query and on the files the LLM is drawing from to answer. For documentation purposes we just need to focus on the training information being digestible for our LLM.
In our case my LLM could only pull data from around 3 files, so I was very careful with my information architecture I didn't want to break down my data into a million pieces. To make sure my structure worked, a data engineer added a column that indicated the token count of my file plus a visual indicator if we were going over the limit. This way I could quickly scan all the files as I scrolled through the corpus. Ideally, I should stay around 1000 tokens.
Ok, now let's use an example to land these concepts in Pokémon terms. What if I wanted a single file that listed all the Pokémon names? Should we go over the token limit? (By the way, the total discovered Pokémon as of summer 2025 is 1,025. )
I would prefer to break them down into alphabetical entries, something like “Pokémon Names A.” That way, if we are looking for my boy Haunter, we can skip straight to H. Here’s how I would structure it:
[English name] [Pokémon type] [Generation] [Evolutions Y/N]
What other relevant information could we add? Maybe more details about evolutions or the Pokémon’s ID number. But since we want to keep our lists under the token limit, we need to balance the amount of information. It’s ok if a file ends up being smaller than others — we need to account for categories where we might have more data.
What happens if the file goes over the
token limit? It would cut off. The
LLM would only read up until the token limit,
so if crucial information was at the end of the
file, it wouldn’t show up. How would you know how to
get a Haunter in the game???
Better break down your data.
Image credit: How to get Haunter in Pokemon Gold/Silver/Crystal [#093] by Mr. Pokédex
Ok, you might notice a pattern — I love Ghost types.
A new Pokémon type has been discovered!
When Steel and Dark type Pokémon dropped during the second generation release, I was ecstatic. My beautiful Sneasel was added to the list of Pokémon, and I had to have her in my team!
But how do we add this new information to our documentation? Here is where planning ahead will save you. It’s easy to organize everything if your information pool won’t grow. Sadly, that’s almost never the case. We have to be flexible and be prepared for when new data comes our way.
Let’s think of a concrete example: what if I wanted to teach the LLM about Eevee and their evolutions? My strategy would be to create a single file per Pokémon that contained its Pokédex entry: name, evolution, brief description, type, move list, etc. This works much better than having a single file with ALL the information of ALL the Eevee evolutions. Even if a single entry feels easier to update, it would be way too long, and the LLM wouldn’t process it efficiently.
I could also create a specific entry that explained the Eevee evolutionary tree in case a user wanted to learn about this topic. That way, the LLM would prefer drawing from this file instead of opening every single file on Eevee, Flareon, Vaporeon, Jolteon... etc.
So, if we wanted to add Umbreon as a new Eevee evolution, our to-do list would look like this:
Image credit: Pokémon Database, Umbreon
Honestly I couldn't decided between Espeon and Umbreon but I had to go for the new type.
Pikachu, Pikachu, Pikachu
Growing up watching Pokémon in Spanish and playing the games in English, it never occurred to me Pokémon had different names in other regions. Bulbasaur was the same guy on both sides of the border, but across the ocean, in Germany, he is known as Bisasam. I was shocked too; sometimes when me and my favorite engineer talk about Pokémon we need to Google who we are talking about.
However, this is not the case for Pikachu. Pikachu is called Pikachu here in Europe, on the American continent, and even in its native Japan.
So what happens if we want to implement our LLM in other regions?
This puzzle took me a while to figure out— as I mentioned in my previous article (read it here), I wanted to avoid translating all the information into all the languages and then having to edit a cascade of files when we had to make edits. Can you imagine? The issues with version control, having to reach out to translators to edit all the relevant files in their language, the cost, the logistical nightmare that would land a critical hit?
So how did I go about it? I tried a two-part strategy; let’s use Charmander as an example.
In the documentation I would specify all the language variations, something like this:
[Pokédex entry in English: description, type, evolutions, move list, etc.]
In the pre-prompt I added the specification that the LLM should respond in the user's language and it should use the translation specified, thus avoiding that we would get a literal translation of Charmander— idk, you could end up with a Mexican Charmander being called Armando or something.
With this strategy a German user who wanted to know when Glumanda learns Flammenwurf, the LLM would pull the information in English (Charmander learns Flamethrower at level 24) and deliver a correct answer in their language. Toll!
Quick disclaimer: this strategy wasn't bulletproof; it worked in initial testing, but we had to make the pre-prompt very robust and the documentation very streamlined. Again, more testing is needed, but this strategy might put you on the right path, particularly if you are low on resources and end up being the single LLM Pokémon trainer.
Oh no, I look too sleepy from cycling all night to catch a Haunter...
Working with new tools is always exciting — you are discovering things, experimenting, and trying new approaches almost daily. It can be a bit lonely, if I’m honest. Not having anyone else working on the documentation and not finding any LLM training best practices at the time was rough. So I hope these PokeTips help future trainers.
The technology is still evolving day to day, and things will require a bit of trial and error. Let’s keep testing together and find new strategies. It can be exciting, no? Thinking you might discover a new Pokémon as you go. Maybe we can help each other become LLM champions.
What do you think? What should I focus on next?
Who is your favorite Pokémon? Did you ever train with EVs or went competitive? Are you also surprised at different Pokémon names? I can't get over that confusion...
Let me know—shoot me an email! 😊
📩
sifuentesanita@gmail.com