Digital Twin (my playground)

2024-02-25

Artificial Intelligence, Coding, Programming

I am embarking on a side project that involves memory and multimodal understanding for an assistant so I wanted to start exploring the technologies involved. My “Digital Twin” is one of my playgrounds for learning and practicing with relevant technologies such as Python, LLMs, Langchain, Ollama, etc. I also started using Swift but that’s a story for another post.

The idea behind the Digital Twin is simple. I asked myself whether I could build a chat bot that would answer career- or education-related questions as if it was me. It would use information from my LinkedIn profile and/or my web pages. Could I use availabe LLMs to build such a digital twin?

My goal was to learn the technologies involved, not to release a polished or even functional chatbot.

I am sharing some of these learnings from my short journey so far in case others find them useful. Also, I welcome any pointers to mistakes I made and suggestions to improvements or alternative technologies/approaches that I should consider.

First things first… everything I have done can be found on my Digital Twin github project.

The concept

The idea was for my digital twin chatbot to answer questions such as these:

human	digital twin
What is your last name?	Parastatidis
Where do you currently work and what is your role?	I currently work at Meta as a Software Engineer.
What’s the topic of your PhD	The title of my PhD thesis is “Run-time Support for Parallel Object-Oriented Computing”.

You get the idea.

The setup

I wanted to get some hands-on experience with langchain and try out some of the existing open source LLMs. I used Ollama so that I didn’t have to worry about hosting and interacting with those LLMs.

I had the sources of information that I wanted my digital twin to consider… the web pages on this web site (frontpage, about, CV, and contact) and my LinkedIn profile.

I thought of trying out two different langchain configurations (chains): one that utilized a vector database (data | vectordb | prompt | llm) and one that embedded the documents directly as context (data | prompt | llm) into the LLM prompt. Unfortunately, some of the LLMs had issues with the large context I was generating so I postponed the second chain for a later investigation.

I created a dataset of 30 questions. I wrote 10 questions myself. I asked Microsoft’s Copilot (using GPT4) to generate 10 education-related and 10 career-related questions. This is the prompt I used for the former.

List 10 questions that you would ask someone about their education. Focus on information that is usually found in a CV.

You can find all the 30 questions on github: questions.txt. In retrospect, I should have done a better job at curating the questions to ensure that the first experiment was easier on the chatbot (spoiler alert: it was a disaster).

The approach

I run an experiment with all 30 questions under different configurations:

Context: Web pages only, LinkedIn information only, Web pages + LinkedIn information
LLMs: Llama2-7B, Llama2-13B, Gemma-2B, Gemma-7B, Mistral, Mixtral

I then evaluated the quality of every response (that’s 30 x 3 x 6 = 540 questions/answers). I used a spectrum of 0-4 to evaluate whether the LLM made things up vs figured out the right information. I also used the same scale to evaluate whether the LLM’s response could be believed as that of a digital twin.

The implementation

I used the following as the prompt to all the LLMs.

You are an AI that answers questions as if you were the human called Savas Parastatidis. You always respond in the first person. Be brief. Use one sentence answers when possible. It is ok to say "I don't know" if you don't know the answer. Only use information that you find in the provided context.

The code uses Mistral for generating the embeddings to populate the vector database from the respective set of documents. The given Web pages are downloaded. I manually downloaded my LinkedIn Profile and converted the CSV files to JSON (which might have been a mistake). Then the code creates the langchain chain and every question is sent for each separate configuration.

Results

It was a disaster. Most of the answers were inaccurate. In many occasions they were hallucinated and they were long. The chatbot did get few things right but only for the easy ones. To be fair to the chatbot, some of the questions did welcome hallucination since the information needed to generate them wasn’t present in the context. But then again, the LLMs were instructed to not make things up. I need to dive further into why my explicit instructions in the prompt weren’t followed (“Be brief. Use one sentence answers when possible. It is ok to say ‘I don’t know’ if you don’t know the answer. Only use information that you find in the provided context.”). I just need to spend a bit more time on that.

Here are some representative results.

What is your current job?	I am a Research Engineer at Meta’s Reality Lab Research.	✅
What is the highest degree you have received?	The highest degree I have received is a Ph.D. in Computer Science from the University of Sheffield, UK.	❌
What is your highest level of education and what degree or certificate did you obtain?	I have a PhD in Computer Science.	✅
How long have you been working in this position and what are your achievements?	I have been working as a Research Engineer at Meta‘s Reality Lab Research for an undisclosed period. Before that, I worked at Amazon as a Senior Principal Engineer in Alexa AI and was the architect and co-creator of Cortana. My achievements include building technologies that support the natural interaction between humans and AI-powered experiences, and working on language/conversational systems for about 11 years.	not that bad but still ❌

Model: Mixtral, Context: Web Pages only

I tried out all the questions under the different configurations on my MacBook Pro M1 Max (64GB RAM). Ollama can use the GPU to host the LLMs. I measured the time it took to answer each question.

I was impressed but how fast Gemma appeared to be but then I saw the answers it generated (they were short but bad). The average time for an answer for Mistral might be higher because of an outlier question. I just need to run the experiment multiple times and remove the outliers.

Here’s what Gemma generated for many of the questions:

What is your current job?

The context does not provide any information about Savas Parastatidis’ current job, so I cannot answer this question from the provided context.

❌

Model: Gemma-2b, Context: Web pages only

I probably need to spend some more time with Gemma models in case they need different prompts or they require some additional configuration. It’s not that Gemma didn’t answer any of the questions:

What is the highest degree you have received?

Sure, here is the answer to the question:

The text states that Savas Parastatidis has received a Master’s degree and a PhD. Therefore, the answer to the question is “Master’s and PhD”.

Not that bad but still ❌

Model: Gemma-7b, Context: Web pages only

After I annotated all 540 answers, I was going to compare the models based on quality of their generated answers under different configuration. However, given how many of the answers missed the mark, I don’t think that such a comparison is going to be interesting at this stage.

Potential next steps

I have few ideas about what to try next if I find time…

Get LLMs to produce natural language descriptions of the JSON files with my LinkedIn infromation and then import those into the vector DB.
Answer a set of 10-20 questions (perhaps generated by an LLM) manually. Then use the questions and answers as additional context into the vector DB. Actually, that’s a nice extension to the digital twin… the digital twin can ask those questions as a way of getting to know the human it represents.
Provide example question-answer pairs to the LLMs so that their responses aren’t all over the place. Just have a look at the Excel workbook with all the results.
Figure out how to fine-tune a model and whether that would be useful in this case.

3 responses to “Digital Twin (my playground)”

Gary

2024-02-26

You’re harsh in grading the test results but my question is, who gets the bad grade, you or your twin . . . or both?
Brave of you to be so outward-facing in your research — that’s how science advances.
Richard Bender

2024-02-26

Moin Savas, ich als Außenstehender und Systemunbedarfter, finde diese Aktion und deine Herangehensweise trotzdem sehr interessant. Als ” Normalo ” ist es teilweise
beängstigent, aber auch wieder toll zu sehen, welche Ideen zur Weiterentwicklung unserer Gesellschaft hervorgebracht werden.
Ich kann Dir nur insoweit helfen, dass ich deinen Mut hierzu bewundere.
1. Savas Parastatidis
  
  2024-02-26
  
  Danke Richard. Wir spielen hier nur herum, um zu lernen. Ich wünsche allen dazu alles Gute.