This is my first time using LLM embeddings for a program. The goal is to get magic cards similar to a card I’m searching for, which I do quite often.
Magic cards are quite similar. For example:
Llanowar Elves
Elvish Mystic
Fyndhorn Elder
I’d like to use an embedding to
Search for a card by name
Find cards similar to that
I used an off-the-shelf model (all-MiniLM-L6-v2), downloaded card data from scryfall, then encoded cards like so:
# format
name|type_line|mana_cost|oracle_text
# example
'Searslicer Goblin'|'Creature — Goblin Warrior'|'{1}{R}'|'Raid — At the beginning of your end step, if you attacked this turn, create a 1/1 red Goblin creature token.'
I used Chat-GPT to scaffold the data processing and encoding scripts.
In other words, the model will turn that string into a vector. Ideally, similar strings (i.e. cards) will be near eachother.
Here I set everything up. I wanted to try using duckdb to query the embeddings.
Code
import pandas as pdimport duckdbimport numpy as npimport jsonfrom pathlib import Pathfrom sentence_transformers import SentenceTransformer# Load filesdata_dir = Path('../../.data')embeddings = np.load(data_dir/"embeddings.npy")withopen(data_dir/"id_map.json") as f: id_map = json.load(f)# load modelmodel = SentenceTransformer("all-MiniLM-L6-v2")# data prepemb_dim = embeddings.shape[1]card_embedding = ( pd.DataFrame([ {"id": card["id"], "embedding": vec.tolist()}for card, vec inzip(id_map, embeddings) ]))parquet_file=data_dir /'oracle-cards-20250425090226.parquet'# databasedb = duckdb.connect(":memory:")db.execute(f""" INSTALL vss;LOAD vss;create table card_embedding as select id , embedding::DOUBLE[{emb_dim}] as embedding from card_embedding;create table card as select * from '{parquet_file}';""")# how many cards?db.sql('select count(*) as num_cards from card inner join card_embedding using(id)')
Llanowar Tribe│ {G}{G}{G} │ {T}: Add {G}{G}{G}. is the “top” result
Fyndhorn Elves │ {G} │ {T}: Add {G}. and other cards functionally identical to Llanowar Elves are rated lower.
I think this happens because I encode the name of the card in the embedding, which I chose to make search easier.
Perhaps I should make two embeddings if my use-case is scoped this narrowly.
The tradeoff is that I would need another API to do generalized search e.g. “{T}: Add {G}”
Extending
We can make the query more flexible to account for different contexts. With this we combine the power of traditional querying with vector search.
Code
def sql_array_distance_from_search(conn, model, search_term):id= fetch_top_search_id(conn, model, search_term) ref_vec = ( conn.sql(f"select embedding from card_embedding where id='{id}'") .fetchone()[0] ) emb_dim =len(ref_vec) ref_vec_str ="ARRAY["+",".join(f"CAST({x} AS DOUBLE)"for x in ref_vec) +"]"returnf"array_distance(embedding, {ref_vec_str}::DOUBLE[{emb_dim}])"db.sql(f""" select name , oracle_text , {sql_array_distance_from_search(db, model, 'when you cycle')} as distfrom card_embeddingleft join card using(id)where 1=1 and layout = 'normal' and cmc = 2order by dist asclimit 9""")
┌────────────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────────────────────┐
│ name │ oracle_text │ dist │
│ varchar │ varchar │ double │
├────────────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────────────────────┤
│ Drannith Healer │ Whenever you cycle another card, you gain 1 life.\nCycling {1} ({1}, Discard this card: Draw a card.) │ 4.804520544858501e-17 │
│ Drannith Stinger │ Whenever you cycle another card, this creature deals 1 damage to each opponent.\nCycling {1} ({1}, Discard this card: Draw a card.) │ 0.7654649610995645 │
│ Witherbloom Apprentice │ Magecraft — Whenever you cast or copy an instant or sorcery spell, each opponent loses 1 life and you gain 1 life. │ 0.9166975304148206 │
│ Blessed Wine │ You gain 1 life.\nDraw a card at the beginning of the next turn's upkeep. │ 0.9202103106272829 │
│ Revitalize │ You gain 3 life.\nDraw a card. │ 0.9227281467739591 │
│ Disciple of Law │ Protection from red\nCycling {2} ({2}, Discard this card: Draw a card.) │ 0.9279500413320395 │
│ Soul Shepherd │ {W}, Exile a creature card from your graveyard: You gain 1 life. │ 0.9443892064122495 │
│ Syndic of Tithes │ Extort (Whenever you cast a spell, you may pay {W/B}. If you do, each opponent loses 1 life and you gain that much life.) │ 0.950814787176584 │
│ Samite Herbalist │ Whenever this creature becomes tapped, you gain 1 life and scry 1. (Look at the top card of your library. You may put that card on the bottom.) │ 0.9514616126869583 │
└────────────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────────────────────┘
What’s next?
I think I need better recommendations before I go any further. The results need to be “good enough,” but they’re not quite there. I’ll have to look into what it takes to refine the model for my use-case.
Making an app (probably a simple webapp) would be the most helpful interface. Even 5-10 recommendations would be helpful.