This is the first post, in what I think is going to be a series, about another side project of mine: BrainExpanded (yes, I also own the .com and .ai domains, just in case). As mentioned in my “Digital twin follow up” post, BraindExpanded is yet another way for me to learn new technologies and explore ideas.
The technological building blocks available today (and in the near future) make it possible for a single person to build assistant-like experiences relatively easily. So, that’s what I will try to do. Along the way, I will document system design and implementation learnings.
Humans, enhanced!
For few years now, I, amongst many others in the field, have been talking about how AI assistants can “enhance human abilities”. While I worked on Cortana and Alexa it was our goal to build digital assistants that could help humans complete tasks or even take over. We wanted to deliver technology to our users that enhanced their memory, helped them be creative, and helped them focus or gain time back.
I like finding opportunities to link to “The Panda is Dancing”, my favorite poem about building technologies that enhance human experiences. Take a break to watch the video and then come back 🙂
I always thought that to accomplish the vision we had for Cortana and Alexa, the assistants needed access to more egocentric context, both physical and digital, always in a privacy-preserving manner. Humans acquire context via their senses: touch, hearing, vision, etc. At the time, we focused on building technologies for assistants so that they could hear and understand language. We were, of course, nowhere near to what LLMs can do today on that front. I thought, and many others, that the next step in context acquisition should be eyes. The assistants needed to see what the user sees. That was the primary reason I went to Amazon but a glasses with camera consumer product never happened during my time there. And that was the reason I moved to Meta, where smart and AR glasses are very much an active area of product investment as an interation medium with Meta AI, Meta’s assistant. Check these out:
The boardwalk scene
There is this scene in the movie “Her” that inspired me in terms of the experiences I want to build. The protagonist walks on a boardwalk. There are many people all around. A lot is happening. A small device in his shirt’s pocket sticks out just enough for a camera to see the world. That camera is his digital assistant’s eyes. The human and the digital assistant are having a conversation about what is happening around them, as any two humans would. The digital assistant is seeing and understnading the world in the same way as a human.
Fun fuct: the Cortana product and design team advised the producers of that movie, which is why they brought the movie to Bellevue, WA, for an early screening, just for the Cortana team).
With the emergence of LLMs and the improvements in hardware for devices that capture egocentric context (e.g. smart or AR glasses), that experience is going to be possible very soon. We are already seeing examples:
- Ray-Ban Meta Glasses Add Live AI, Live Translation, & Shazam Support
- Project Astra: Exploring the future capabilities of a universal AI assistant
- OpenAI’s Day 11 (collaborative session with ChatGPT, which understands the app context in what the user is trying to do)
Continuous sensing and memory
Upcoming devices that continuously capture, understand, and stream context are almost here. That context will have to be stored in some form of memory in order for that user’s AI to access it. The big companies are making LLMs available with an ever-increasing context size but it would never be large enough to accommodate days, weeks, months, years worth of egocentric context. And we need to be able to do that if we want to enable experiences like the ones we have seen in Sci-Fi movies/series. The “Entire History of You” episode in the Black Mirror series comes to mind.
Back in the Cortana days, we had the “Do I need an umbrella?” as a way to test and showcase Cortana’s ability to understand the user’s request, retrieve the weather, reason over it, and then return a response. It became a joke within the team. Today’s equivalent seems to be “where are my keys?”
Google demonstrated such a prototype few months ago (Project Astra: Our vision for the future of AI assistants… check out the part where the user asks “Do you remember where you saw my glasses?”). Granted, the demo wasn’t about keys but you get the point… The context-capture device observes the user’s activities and environment. It sees what the user sees and it remembers it. Upon request, it can help the user recall information, even information that the user didn’t explicitly record.
And this is the starting point for BrainExpanded. The next post will get us started on the journey.
Leave a Reply