If you think about it, Mattur's apps, Facebook, Instagram, WhatsApp, Messenger, Oculus, these are some of the busiest online destinations on the planet. It's how half of humanity connects, they connect with the loved ones, how people make a living, where they turn to coordinate, services when there are emergencies, and the magic, the magic in all of this, is that, if you can call it magic by the way, because I think about it as hard work, is that it just works. And when it works, which hopefully is all of the time, people do not even think about it.
Hey folks, my name is Senthosh Narvan, and I head up infrastructure here at Mattur. I've been here a long time, 13 years to be exact, and I've been very lucky to be part of this journey. You're facing another inflection point in our infrastructure journey, the reality is this, AI is no longer a minor workload, it is the workload. We are now AI first, and AI is central to every single thing we do here, at infrastructure at Mattur.
It enables better personalization, it allows for safer and fairer products, it also delivers much richer experiences for people, while helping businesses reach the audiences, they really care about the most. AI is also different. It's a game changer, because it comes with great promise, but also with great challenges. The hardware and the software needs required a support, and developer AI are profoundly different from the basic compute technologies that we have been familiar for, for a couple of decades at this point.
So, we design and custom build our DCs, we design custom builder hardware, we own the kernel, the silicon, the software stack, and we know our workloads. We have PyTorch, that is the clue between all of this. We can vertically build an integrator stack like few others can. This has inspired us to rethink completely what we are doing for AI, and also do it at an immense scale. And as we build, we look for ways to open source our breakthroughs so that the world can benefit, and people can innovate with us, alongside us.
We'll share today how we are transforming everything in our infrastructure. Across the six presentations you'll hear today, we'll export some of our latest innovations, including the latest research coming from a fully built RSC, which is one of the fastest AI supercomputers in the world. We'll share progress on our in-house custom build silicon investments for inference, video encoding, and as I mentioned earlier, the DC designs for AI is absolutely critical.
So, we hear about how we designing our facilities to support future. We need liquid cooling for AI hardware, and we're doing this with shorter construction timelines and much lower cost. Now, we have goals to use AI internally as well in our consumer products. So, we'll share about Metas Generative AI-based coding assistant, and the longer-term vision to help developers across the software development cycle, the SDLC, if you will. Finally, we'll talk about PyTorch, and the impact it has on our overall AI-infravision.
To wrap up the day, we'll bring to you the leaders across our infra-organization in a panel. I'm super excited about this, to share the perspective on the future of AI infrastructure. More reading the discussion is Irene Kaufman. She's Metas PM leader, responsible for all our AI efforts that span across many teams into the company. She and other infra leaders will be exploring the challenges and opportunities that lie ahead, and how meta plans to focus on delivery, the long-term value, and impact to guide our vision here.
I want to thank all of you again for joining us. All through the event if you have questions if you'd like to answer, or topics you'd like us to cover in future at-scale events, or in our meta technical blogs, simply visit our website at scaleconference.com, or scan the QR code right on your screen.
We'll kick off presentations today with RSC, our AI Supercomputer. Ensuring that our platform is open, to ask many diverse cultures, languages, perspectives, is a significant challenge at a scale, as you can imagine, and that requires pretty intensive, large-scale AI models. Complexity is a fading of virtual reality meta-verse, who increases the challenge space. This requires much larger models, greater number of modalities, parameters, as you can imagine.
We anticipated the challenges and have built out the research super-cluster AI Supercomputer, a dedicated, high-performance state of the art cluster, to accelerate AI research. Technical program managers Scott, Jeshoenik, and software engineer, Kalyan Saladi, will join us now. We will present the architectural choices that went into building the cluster. Enjoy, you're going to have a fun ride.
Hi everyone, and thank you for joining us today. We're excited to share an update about our research super-cluster. Recently completed this past fall. We had started effort on it last year, in 2021, actually, and completed the first phase in 2022, and finalized everything at the end of October, November, time frame. But for now, I'm going to set the groundwork for you as to why we built a research super-cluster in the first place.
Meta has been using AI in a lot of different ways for many years. Whether you're talking about flagging harmful or toxic or biased content in our application space, or if you're talking about translating a language instantaneously, if you've been in the Facebook app, and you've clicked on the Translate button, you know that that's machine learning that's doing machine translation for you. This is a significant area of investment, and later on in the presentation, I'll talk about one of the projects, no language left behind that's in this space. And then also the advent of AR and VR, whether you're using an Oculus today, or you're thinking about what it will be like in the future, the use of AI is essential to these platforms. Whether you're talking about the placement of you within your room, or how your hands are moving, these are all things that are driven by AI. And as these platforms grow in importance and scope, AI is going to play an even more important role. And being able to do this across billions of users in many different countries on all of our platforms, this is just a very ambitious scope of work, and it requires research to underpin it.
In addition to that, we have to be aware of our use of data to train our research models, being able to track that data and make sure that it's logged and it's stored in an appropriate way, and that it's encrypted, all these things are important. But also it's important to make sure that the access to our research platform is a controlled environment, and that unauthorized access is a default feature. And so my peer calearn will be talking a little bit about some of the technology that we've put into the RSC that helped in this way.
In terms of how we approach research and how we continue to improve our AI functions in meta, our research community is constantly looking at ways of increasing the amount of data, or the quality of the data, or the source of the data to improve our training. The verge, different modalities of data, whether you're talking about training a language model with just text, or perhaps you want to train your language model with text, and images to enhance the richness of the output. That's one way. In addition, you may want to increase the complexity of the model, whether you're talking about adding parameters or you're talking about pre-repost processing, that's being added into like a model workflow, those are areas of focus too. But to do all of these things, you have to learn from what you did, and make improvements, and tune, and iterate. And that requires rapid iteration and rapid innovation, which involves time. And so that has a direct correlation to how much resource you're able to run on for your investigation, or for your final output.
An example of how this has become important to us is in the large language space. If you look at the chart here, this is the number of parameters being used for large language models in the last five years. With the advent of the transformer technology, that allowed for a lot more parallelization, and scale, which meant that you could add many, many more parameters. And you can see the growth of parameter counts has gone exponentially since 2018. We're now approaching a trillion. The more parameters you add, the more data that you add, the longer it's going to take the process, if you don't have sufficient scale. And we knew this because we've been using AI for a good while now, and we decided to make an investment in a large cluster.
You can't build bees overnight, you have to plan. There have been some challenges in the past few years, as I'm sure we're all aware that COVID has impacted supply chains. And we had to factor all these things in. So this was a multi-year effort, very forward looking, and we're very happy that we've come out on the other side with a fully functional cluster. And on that note, I'd like to hand off to my peer, Kalyan, who's going to talk about the details of the cluster and the lessons that we learned.
Thank you, Scott, for the introduction. Hello, everyone. My name is Kalyan. I'm a software engineer in AI Research Infrastructure Team at Meta. I happen to work on ML Training, production ML Training, and large-scale distributed systems at Meta and VMware before joining the current team. Let's jump into the question, why build a custom super cluster instead of using the existing data center technology that Meta has deployed all over the globe? It really comes down to understanding and realizing the unique demands large-scale AI training places on the infrastructure. This translates into our need to control the physical parameters.
What are the physical parameters? I'll highlight three of them. Number one is cooling technology. Airflow-based cooling was not meeting the mark for large-scale AI training for us. So we had to go with liquid cooling, which was a departure from Meta production data centers. Given the rack density, given the number of GPUs we wanted to pack in the data center building, this meant that the power requirements significantly deviated again from the production setup. But there's one more important aspect. That is the specialized flat backend network. This is low latency high bandwidth network with constraints on the cable length. Again, constraining the physical parameters of how far you can spread these GPUs. When you put these three together, we had to make a choice that we needed a custom cluster.
这些物理参数指的是什么?我会重点介绍其中三个。第一个是冷却技术。基于气流的冷却技术在我们的大规模人工智能训练过程中无法满足要求。所以我们选择了液体冷却,这与 Meta 的生产数据中心不同。考虑到机架密度和我们想要放入数据中心建筑物的GPU数量,这意味着功耗要求与生产设置显著偏离。但还有一个重要方面是专门的后端网络。这是一个低延迟、高带宽的网络,对电缆长度有限制。这再次限制了如何传播这些 GPU 的物理参数。当你将这三者组合在一起时,我们必须做出一个选择,我们需要一个定制的集群。
Let's look at what is research super cluster at a high level. I want to quote one number that is aggregate compute power from RSC is up to five X of lobs of compute power. That is 1 billion billion operations per second times five. So this is an armor scale. How do you get this scale? What does it take to support the scale? Let's drill into the building blocks of the cluster in the next few minutes.
The faster and foremost is the faster and flat network. What's fast about it? Each server has eight infinite band links. This is 2x more than the prior generation. Each link is 200 gigabits per second, again 2x faster. And most importantly, there is no over subscription in the network. When you zoom out from the server level to the entire fabric, we believe this is one of the largest known flat IB fabrics in the world. To quote some numbers, the fabric has 48,000 links, more than approximately 2000 switches, including agent spine. This is a lot of entities, right? I believe 20K nodes is what is reported by infinite band network there.
We talked about the speed and the scale, but there is a very good, very important qualitative aspect to the network, the way we designed. We repeatedly emphasize the flat nature of the network. What are the benefits of flat network? It presents the scheduler sees a homogeneous set of resources, again translating to researchers perspective. They are free from having to worry about what performance they get if they land on XYZ nodes versus ABC nodes. This is like a degrees of freedom that our researchers really appreciate. Work loads are free from topology awareness. As a result, there is no resource fragmentation in the cluster. And we get to train more jobs and at a larger scale.
Let's put these design aspects together and see what bandwidth do we get from this network. As an example, if we run a 4,096, that is 4K GPU, Nickel, Alde, Reduce, benchmark, we get 157 Gbps in-place bus bandwidth. And that's already pretty good, right? This is without optimization. As recently as a couple of weeks ago, we managed to optimize the network further with sharp. And we are seeing close to 211 Gbps Nickel Alde, Reduce, bandwidth. That is a tremendous performance that we are able to extract out of the network.
Let's move on to the next building block that is the powerful compute. Even if you have the fast and flat network, if the compute nodes are not good, we are going to get the 5X of the labs that we promised, right? What do we have as the compute nodes? We have 2,000 NVIDIA DGXA100 systems, which was the latest generation available at the time of the cluster build out. Each server has 800 GPUs. Each GPU has ADGB memory, totaling up to 640 GB. And we also have a front-end network, Ethernet network at 200 Gbps throughput. The goal was to fit as many of these GPUs as possible, which ties back to the need for custom cluster.
Now that we have talked about the compute, let's move on to the next building block of the cluster, that is storage and data. AI training at the scale is nothing if we cannot supply the data fast enough to the GPUs, right? With that in mind, let's see why we had to build a custom storage engine.
Early on, we realized that the special properties of training data and the demands it places on the storage systems, meant that we had to create a special purpose storage engine called ASTOR, the AI Research Storage Service. ASTOR improves the data loading performance and scalability. What happens in the background is that we have dozens of flash arrays and we have hundreds of cache nodes. We pre-process the training data and distribute the bundles across the flash arrays.
The cache nodes fetch the data and keep it ready for the GPUs to pull the data when they need a particular sample. ASTOR is a complex storage service that we built in which requires a dedicated presentation of its own. I'm only highlighting the important aspects. When you orchestrate this, what this means is that you are able to achieve 10-200x reduction of disk seeks and RPCs. This is very critical to this is very important to keeping the hungry training nodes busy with data.
Let's look at the other aspect of storage, that is NFS. We have 10 petabytes of flash storage mounted and visible to every device in the cluster. This storage is used for intermediate checkpointing and code, logs and other transient data that jobs produce. Why is this very important aspect of the cluster in a jobs lifecycle is that jobs can start training, interrupt themselves, resume on a different node from a previously taken checkpoint. This makes it easy to handle both failure sub-job as well as stop suspend and resume of job at a different point in time, because you have the checkpoint available across the cluster.
I want to take the opportunity to re-emphasize the commitment to privacy and security in the cluster, because we encrypt data at rest and in line. All samples are only decrypted when the GPU wants to consume them and then the data overall has a TTL, beyond which the data is deleted and wiped away from the cluster.
Now we have the compute, the network and the storage systems. This makes for a pretty powerful compute infrastructure. How do you offer it to our researchers? Researchers need to be able to consume the resources in an easy manner. For that we built a purpose built control plane and we use an open source scheduler called slum. This is a HPC scheduler that makes job management fairly easy for researchers as well as the cluster administrators.
We have a talk through a flow user simply submit a batch job and that enters the job queue based on the priority based on the resources available and resources required. Slum picks up the job and places it on a set of nodes. Remember our cluster is homogenous that meant that means that slum is able to pick any subset of nodes and place the job. And as jobs finish slots become available new jobs can enter the new jobs can land on the nodes.
Now that I have described the building blocks and highlights of various storage systems that we built. Let's move on to some of the lessons we learned while building and operating the cluster. I'll start with one significant aspect. This entire cluster of 16,000 GPUs was not built in daily word as a one shot exercise. In fact, we did this in two phases during COVID and remotely.
But more importantly this the phase build out was done in a non disruptive manner where 40% of the capacity was made available year and a half ago and we continue to build and expand the cluster. This required some good planning upfront so that we have the ability to extend the cluster without disrupting workloads. A few more lessons that we learned and we incorporated into the second phase of the cluster number one is failure rates.
The hardware failure rates were higher than we anticipated and this required us to build better detection remediation mechanisms to ensure a stable and a level cluster and offer a seamless experience to our researchers. Next up as I mentioned before our fabric is one of the largest flat IB fabrics in the world, especially at the scale. This forced us to do pioneering and groundbreaking work to find the bottlenecks find the performance bottlenecks tune and you know sustain the performance of the cluster over a long period of time. And these lessons were incorporated into how we brought up the rest of the cluster in phase two.
The third aspect is when you have a high performance compute infrastructure and a lot of projects want to run their jobs on this cluster. How do you offer these resources in a manner that is controllable and can prioritize and can implement the business priorities. We work closely with slums scheduling and priority primitives to incorporate the resource quotas and priority so that the right jobs consume resources at the right time.
I want to cover a few more lessons that we picked up during the operational phase of the cluster across both phase one and phase two. As mentioned before GPUs fail in multiple ways they both hard and soft failures. This requires different detection and remediation strategy for soft failures you can get away with tooling that can fix the state of the GPU. But sometimes you may have to go towards a parts replacement and have a longer remediation cycle.
The second part is network stabilization especially at this scale when you're talking about 48,000 links is a pretty hard problem. Every component becomes a suspect starting from the nick attached to the server, the cables linking nodes to switches, the agent spine switches themselves and the internal switch links, ISLs. When you have these thousands of fiber cables, a lot of physical factors play a role here including the band radius, the temperature and any other degradation can affect both performance and functionality. The network stabilization is a hard problem, both during the build out phase as well as ensuring a stable cluster.
One other key takeaway that I want to emphasize is that in such a large and high performance setup when your workloads misbehave or fail the symptoms are several layers removed from the root cause. This means that you have to have very good observability and debuggability infrastructure investments to be made because you can have bad nodes, memory, faulty cables, all of them can result in a completely opaque symptom experienced by the job.
Now that I've talked about the why and what are the cluster I want to hand it back to Scott who will go over case studies of model strain down RSC. Thanks, Calian.
Now I'd like to talk to you about a couple of the projects and then we'll wrap up. So I'd like to start with the Lama project we just recently released. The research team was trying to address a few different concerns or tackle a few different problems. One question was let's say you're a researcher and you want to understand how large language models work and you're in the general community or you're at a college or whatever and you can't really run something like a GPT because you don't have the scale of facility, the scale of infrastructure to do that.
So one of the drivers was to release a set of foundational models with much smaller parameter counts that would be available to the research community to help them better understand LLMs and to potentially innovate beyond the foundational model themselves. So that was one of the agenda items.
The other goal was could a model trained for a longer period of time with perhaps more data but smaller parameter counts yield similar results to some of the state of the art language models in terms of quality of responses and accuracy. So I encourage you to take a look at the research paper that will give you all of the findings. They had some interesting results.
One of the keys is that they were able to use the RSC to do rapid training. They were able to speed up their training significantly from prior efforts in other environments that helped them get to their deadline much faster. In addition, the stability of the cluster during their training epics allowed them to run without interruption for multiple weeks at a time which allowed them to get to their results quicker as well. So this is an excellent example of a project that was able to leverage the scale.
The largest model, the 65 billion parameter model, was trained on 2048 GPUs and they were able to dramatically decrease the amount of time it took to complete that. In addition, I mentioned this earlier in the presentation. We had the no language left behind project NLB. I mentioned this a little earlier in the presentation is a project that was designed to look at could we do machine translation for languages without a lot of data on the internet.
And the team has done a wonderful job. They released an excellent research paper. I encourage you to look at the research supercluster enabled this team to dramatically speed up their training times from where they were on the previous generation cluster. They were around a month. Parapic running on RSC. They were able to drop their training times down to around a week, which allowed them to speed up and improve the output and get to their research results in time.
So in conclusion, the research supercluster offers two key ways to scale. One, the ability to scale the state of the art. So can we take a single model and go big? Can we go with thousands of GPUs for a single model? That is a key aspect of the RSC. But the second way is the ability to scale out and have many projects running because of the nature of the network and the architecture, the ability to run multiple simultaneous projects without interruption or without impinging on each other is available to our research community. This means that we'll be able to take on projects, offboard projects quickly and scale our research efforts here at Meta.
In terms of infrastructure innovation, we will continue to look at how we build RSC, how we operationalize it, better engineering practices, how can we improve it, what can we learn from our efforts that we could apply to any future endeavors. And also to share our learnings and to share our thoughts and processes with our peers in production as we drive towards the future. And with that, I thank you all for attending our presentation. Thanks, Gordon Kalyan.
Having enough physical AI capacities essential to the future of our company and supporting all our AI workload at scale requires a very different approach than what we do to scale our regular web services, right? Our new DC design will support the next generation of AI systems. We are building an increased level of flexibility into this design. This allows us to pivot and response to shifts and changes that we see in the AI space. Please welcome engineering director Ellen Duong to take us inside the vision for Meta's next generation AI Optimized DC design. Take it away, Ellen.
AI is the foundation of our discovery engine and our ads business while improving our user experience. AI based ranking, computer vision and language translation has drastically changed how we built and deployed services across all of our apps. So advancing AI, building into every one of our products, as well as enabling many new products and services is critical to our future. It's evolving, so it's pretty complex for data centers.
Hi, I'm Alan. People call me AD, director of engineering for data centers. I've been with Meta for nine years, but I've dedicated my entire career to designing and constructing data centers. Something about creating an idea, drawing it on paper, conceptualizing it, creating blueprints, and actually seeing it physically constructed coming to life. And along the way, I get to work with thousands of amazing people in our industry to make it happen. Nothing better than that. So we're going to continue to make AI happen in our data centers. Newer AI technology calls for new hardware. Our new hardware needs a new home. That new home is our next-gen data center design. Our next-gen design will enable AI technology for today and for future generations. But also more importantly, we need to plan for roughly 4x scale. So how do we do that? How do we plan for the scale? Scaling is not new for us. We've done it before.
As you can see in this graphic, since 2010, we've scaled our infrastructure by over 10x. It all started out as a growth journey. We experienced exponential use of growth and engagement. We had to rethink our approach to our infrastructure stack, meaning how do we control our own destiny? We believe that innovation requires a full technical stack approach. What that means is we develop our own products, our own software. We're in the business of going to develop our own hardware, network, and building our own data centers. So a resilient and portable software that sat on an open, modular, and disaggregated hardware was the approach that led to super efficient data center design.
So the image you see on the left is a napkin sketch. And this was an example of engineers from the hardware team and the data center team got together to design a fully integrated power distribution system that led to a world class efficient data center design. In 2009, we completed our very first data center building, Primeville, Oregon. That build was 38% more efficient to build, and 24% less expensive to run than any of our previous facilities.
In 2014, we quickly surpassed a billion users. This was right around the time that I joined the company. Needless to say, we had to scale our infrastructure exponentially by a factor of 10x. We called that the wall. So our foundation was strong or design was simple, repeatable with world class energy efficiency. Basically, we're asked to just do it. We're really good at just a lot more of it.
In 2018, a slight wrench was thrown into the mix. It was a new design, a new network design specifically to support disaggregated flash that how can I simply put it broke our first gen design. So we had to redesign the data center, but we had to redesign it in mind with jobs that were in flight in the middle of the construction. We had to rethink of this design for our existing fleet and how we can reconfigure it for future projects without really breaking it all apart.
So in the same theme of change in 2020, we set some new goals and those goals were to reach net zero emission for our entire value chain by 2030. So again, more change that drove change into our current design. In 2022, additional change hit us and that change was to be water positive, which means our company will restore more water than we consume for our global operations. So with all that change, with all that growth, that resulted in 12 plus years of building world class data centers.
We landed on scale and more importantly, a homogeneous infrastructure. The uniformity provided us the ability to rapidly deploy and efficiently maintain our global fleet. So here we are today, different set of requirements, but in my opinion, similar challenges, we're experiencing growth in AI tech. We need to make sure that our data center can adapt to something that's still evolving. So all that change and we need to scale it by 4x roughly.
So for the rest of the talk, I'm going to walk us through how we enable 4x through innovation, efficiency, while meeting our goals and commitments that we've made for sustainability. So it all starts with innovation. AI is relatively new and is still evolving. I've said it a couple of times now. And really, it just makes innovating in data center design really complex.
I often ask myself, like, how do we balance our design around what we know today versus how much we should plan for or how much we should future proof. So for example, over the next four years, we may see a 1.5 to 2 times growth in power consumption per accelerator and roughly 1.5 time for high bandwidth memory. This evolution is all happening as we're planning to break ground on our first next gen data center, like today or this year. And so if we're just starting on construction for that data center by the time we're done, we might be obsolete.
In addition to that, depending on what our services and products will need, we can see smaller scaled clusters of 1000 accelerators to potentially 30,000 plus for much larger jobs. Each of these configurations, as well as the accelerator that we utilize will require a slightly different approach to hardware and network systems designs. So the data center will need to accommodate for all this.
So here's how we're thinking about it. Data center design innovation, we need to focus on flexibility for long term compatibility and scale in deployment. So it starts with a building design, as well as some power distribution components. So we have to enable co-location of server and network hardware.
So in the case of AI training, the server, which are built around accelerators and the network system operate as one, if we're looking to scale it up or down. So there's a dependency in the co-location of this equipment. And depending on our products and services, that all changes. That could change. So we're going to share physical infrastructure for these two types of hardware. It's going to be really important for us if we're going to plan for flexibility or fungibility.
This also enables efficiency in our fiber deployment, because we need a significant fiber to interconnect the servers. So co-locating them closer together will allow us to gain some efficiencies there. And when I think about it, adding this flexibility within the white space itself still enables a home genaic approach to data center deployment and operations. So it gives us some flexibility there from that perspective as well.
Server type flexibility. These servers are going to require different types of cooling. That means that as we think about our new design, we're developing cooling systems that will support 100% air cooling, as well as a large percentage of liquid cooling. This allows us to support and continue to support the traditional service today like compute and storage.
It also allows us to support the first generation of Metas AI enabled hardware, as well as any future enabled hardware. Delivering power infrastructure closer to the server rack will be simpler and more efficient with our new design. We're eliminating as much equipment as possible through our power distribution chain.
So as you can see from the graphic, we're eliminating the low voltage switch gear that creates what you might want to call a bottleneck of capacity. Eliminating that allows us to, you know, allows the server rack to grow in density in the future with minor modifications to our infrastructure. And it continues to allow for greater power utilization.
So we pride ourselves on world class power utilization. And today, you know, we were utilizing roughly 70 plus percentage of the power that we deploy. What does that mean? It means that we strand less power and it eventually means we build less data centers, which is all good mixes more efficient.
So innovation to enable flexibility and scale. That's important, but efficiency in our design has always been core to our business. Ultra flexibility, future proofing requirements for both air and water cooling doesn't do us any favors in power efficiency or reducing costs or deploying data centers faster. So we had to make some tradeoffs as we progress through the design.
So over the last year, there's a couple examples that I'll share with you. We've had to make some tradeoffs. So when you think about liquid to chip cooling, that's important for us to enable future generations of AI, but deploying too much to early is inefficient. In our design, we've already made a wholesale shift to facility water and we've created these AI scaling units as I previously shared. But for efficiency, we're going to only deploy a small percentage of liquid to chip cooling on day one and we'll scale it up as we need. This means more complex upfront rack placement and planning.
We haven't had to do that in the past. So this is a much very complicated process for us, but it allows us to save some capital. Right, it allows us to deploy faster, less equipment means we can we can build this thing faster and it limits the unnecessary maintenance of equipment that become unused. We're going to continue to lean into our software resiliency. And some hardware buffer versus relying too heavily on physical sort of resiliency, like we do in the industry. This allows us to write size or physical backup infrastructure, like using fewer diesel generators, saving time and deployment again, less equipment, less time, and it reduces emissions and and and continues to make our operations more efficient.
But the risk is this means that we're going to take on some unknown risk associated with software for our AI workloads. We're still learning about that as we're deploying this in scale. And so as we learn more, we might adjust our strategy. Increase and power usage to enable water neutrality in liquid cooling. So when you think about it, liquid cooling doesn't come for free. We can't simply just open our windows and rely on free air cooling anymore. We can't keep leveraging evaporation to reject heat because that will continue to be a challenge for us as we go into regions that are water constrained and it was we continue to scale out our operations.
So this means that we'll be using a little bit more power to cool our equipment, but on the flip side will reduce our water consumption. So with all these tradeoffs and there's many, many more tradeoffs that I can share that I don't have time for, but where do we land? We anticipate that our next gen data center will be 31% more cost effective. And we're going to be able to build it two times faster for a complete full region when you compare that to our current generation data center.
So these are all tough tradeoffs that we've had to make or continue to make decisions on a side by side case based on the constraints we have. So this leads me to our continual commitment to sustainability. We've committed already to reaching net zero across our value chain by 2030. We continue to support 100% of our operations with renewable energy. And we've committed to reach water positivity by 2030. To date, we've restored 2.2 million cubic meters of water.
So how does our next gen design contribute to this? Well, number one is just use less material. Less is more. That's the easy button. That's why color that green, right? Just press the easy button, right? Use less. Design for smaller. Think about a region that is significantly smaller than what we have today. That means just less equipment. Less underground infrastructure. Deeper supplier engagement.
So drive for greater supply chain transparency in developing share goals and share to mission targets with our suppliers. We committed to 20. We committed to net zero by 23 across our entire supply chain. That means everything, every component that goes into our data center. And then lastly, but not least, switch to low carbon alternative materials. For example, we see a ton of opportunities in concrete. Prior to our new design by 2030, our emissions footprint measured in metric toned carbon dioxide equivalent was projected to grow by roughly 3x. Just follow that dotted line to the right. Just by using less as a first step, our next gen design is tracking roughly 75% less carbon intensive. So as we continue to progress our design, explore alternative materials and engage deeper with our suppliers. We're confident that the data center will do its part in helping us reach our goals.
So in close, AI tech continues to evolve at a rapid pace. Flexibility design is key for long term success. Balancing trade offs between efficiency compatibility with a continued commitment to our sustainability goals is key. Lastly, the journey is only 1% finished. We will continue to innovate and evolve our design and drive for greater efficiency while enabling future generations of AI technology. Thank you.
Thanks, Alan. After almost 5 years, we finally create the technologies that made it possible to compile any PyTorch model, resulting in a step function change in PyTorch's approach to execution efficiency. We call it PyTorch 2.0. PyTorch 2.0 delivers significant performance improvements over a wide variety of models, often with a simple one-line change. Engineering manager Pung Wu joins us now to explore two important technologies that underline PyTorch 2.0. Tosh Dynamo and Tosh Inductor. Take it away, Pung.
Welcome. My name is Pung Wu. I support the PyTorch compiler team. On March 15, 2023, we announced PyTorch 2.0, a step function change to PyTorch performance via this new mode called the Graph Mode. The most remarkable aspect of PyTorch 2.0 is that we are able to offer Graph Mode without sacrificing the ease of use UX that made PyTorch successful in the first place. I want you to bookmark two phrases, Graph Mode and ease of use.
It was long believed by the industry that machine learning frameworks cannot have both Graph Mode and ease of use. Pt 2.0 actually challenged that conventional system. So, in today's talk, I'm going to tell the story of Pt 2's unique Graph Mode and essentially how could we have the cake and eat it too. But before we jump into 2.0, let's talk about PyTorch 1.0 first.
So, 1.0 was announced about five years ago and to give a little bit of a historic context, at the time, the whole industry of machine learning frameworks are mostly embracing and designing around Graph Mode. It was believed that Graph Mode allow computer optimizations so could potentially provide better performance. But the catch of Graph Mode is that it requires the developer to think in Graphs and this is really counterintuitive, hard to express and even harder to debug. So, 1.0 at the time made a bold bet. We decided to value ease of use above everything else, including Graph Mode.
So, PyTorch 1.0 boldly chose to embrace non-graph mode. We call it eager mode. We're the intention to quickly draw adoption from researchers. So, this bet paid off. 2.5 years after the 1.0 release, PyTorch reached 50% adoption, making it number one machine learning frameworks used by researchers. And after that period, we still see healthy year to year growth to date. So, today, PyTorch is the de facto training engine for most of the most advanced ML models out there.
So, if the story of 1.0 is about making a strategic bet of using ease of use to attract research adoption, then 2.0 came into being by pure technical innovation. So, in PyTorch 2.0, we introduced this Torch.com file API as the primary graph mode API. It's very easy to use. You program as if in the eager mode and you just need this one liner Torch.com file decorator to your code, and then the graph engine would kick into place behind the things. And this graph engine would be able to offer out-of-box performance boost from 30% on average to 70% on average over a wide range of OSS models.
So, essentially, we did have the cake and ate it too. You may wonder, why do we make such a fuss about PyTorch graph mode? It is because what made PyTorch be loved, its flexibility and dynamism, is exactly what made it hard to come out. So, the moment we figure out how to have graph mode while maintaining the ease of use API, we know that a step function change is happening on PyTorch. And that moment is Torch Dynamo.
Torch Dynamo solved a long-standing problem in PyTorch graph capture. In fact, this is not our first attempt. We have offered several generations of graph capture techniques for PyTorch, all of which require significant manual effort. The solutions ranges from capturing correct graphs, but some of the graphs require human intervention to make the graphs captureable. On the one spectrum or on the other spectrum, you can always capture a graph, but the graph may not be correct. Dynamo solved both issues.
To give you some intuition, there were two key insights. To make graphs always captureable, then we'll let go of the requirement of always capturing whole graph. Instead, we capture partial graphs, so basically we will stop graph capture if we encounter something that Dynamo does not recognize and fall back to eager and then resume capturing graphs when we reach a region that we recognize. To solve the second problem about capturing a graph that is not correct for the execution, Dynamo introduce guards that would be validated at the runtime. And of course, if guard fails, we have the ability to recapture graphs just in time. Eventually, these three key designs, partial, guarded graphs with just in time recompilation is what made Tors Dynamo both sound and auto box.
Just to give you an example from the previous code fragment we showed. So this example, there is a deliberative introduce graph break in the if statement. To the right hand side, we're printing out the graphs. There are actually three graphs as highlighted in the color bar. And this is actually by design. This is exactly what made Tors Dynamo operating completely transparent from end users.
So I just want to give you a glimpse of the magic behind Tors Dynamo to the left hand side is the normal C Python interpreter. This is what's going to happen when you execute in eager mode to write inside is the contraption build by Tors Dynamo. A few things I want to highlight number one is dynamo solution is built on top of a standard Python using standard feature called pep 523. And the second part I want to highlight is that all the added box. Transparently handle the things we talked about before such as graph picture graph breaks guard validation of guards and recapture a runtime compiling code and executing it fall back to eager. So a lot of these complexity are completely handled seamlessly by toward dynamo execution engine.
So dynamo solve the graph capture problem, but keep in mind that capturing graphs do not necessarily actually do not improve performance. So this is where torching doctor came into play. The torching doctor is a pie torching native optimizing compiler and it's too magic behind the 2.0 performance. It is also one of the field training compilers out there for pie torch, but by far the offers the best out of box performance and cover the most models.
I do not have enough time to go into details of inductor, but if there is only one thing that I'm allowed to highlight, it is the unique IR design of towards inductor. So inductors designed to handle real models that meant from the very beginning we designed IR to handle the very tricky cases of pie torch, semantics, including the large op surface, the mutation, semantics and dynamic shape. And all of these contributed to inductor being the best performing training compiler for pie torch and for the best coverage for pie torch models as well.
Just wrap up this picture summarizes the journey from 1.0 to 2.0. Five years ago, 1.0 surprised the industry by fully embracing non graph mode execution, which we call eager with the intention of quickly attracting adoption from researchers and let us to be the number one machine learning framework. Today, 2.0 surprised the industry again by introducing this special graph mode under the hood without sacrificing the ease of use UX that makes 1.0 successful. Behind the 2.0 technology are two really cool innovations.
The first one is torch dynamo. This out of box graph capture opened the pathway from eager mode to graph mode so that most pie torch models can seamlessly transition to graph mode without any human effort. Once we open that pathway, the second technology is the optimizing compiler towards inductor and today inductor is the best performing training compiler for pie torch and also covers the most pie torch semantics and handle all the tricky semantics of pie torch.
So today, you can already use pie torch as it's been released and since March, we have seen countless user testimonies of 2.0 by adding this symbol one liner of torch icon pile, improving their performance from 30% to 2x. We have also seen our partners vendors embracing the PD2 stack by integrating their back and compiler into the PD2 stack. So going forward, the short term focus of 2.0 is to continue to improve performance. We do believe that the 2.0 release and the impressive number we just show before is actually the starting point of the 2.0 journey in terms of pie torch entering graph mode. There is still a lot on the table. And the second part we want to improve is to improve interoperability between 2.0 and other core features of pie torch. There are still some of the features that we have not got the chance to make to work with 2.0 and we don't want users to choose between these core features and 2.0. And keep in mind that graph mode really has a lot of possibilities. So in terms of our longer term go roughly a year mark or beyond. We do see that there are 2 huge venues that we need to invest in.
Number one is distributed compiler. Because today's training workloads are increasingly large and distributed is an indispensable aspect of training as well as for inference. So with distributed compiler we would be able to optimize both compute and communication. And the second important major feature is pie torch export and this feature would speed up the transition from pie torch from research production and would allow 2.0 to be used in many of the production use cases, both in training and inference. So if you are excited about the 2.0 story, I would invite you to try out towards a compile and give us feedback. And if you are a pie torch developer or a compiler developer, I would invite you to participate in the community and continue to make pie torch the number one machine learning framework in the world. Thank you.
We have traditionally relied on using CPU based servers for running AI workloads with the increasing compute and memory requirements of these huge AI models has pushed us towards using specialized solutions such as GPUs and other specialty hardware accelerators. Please welcome engineering director Roman Levenstein AI Infra research scientist Amin Firo Shah Young, software engineer Joel Kuban and ASIC engineer Olivia Wu to share first look at MTIA, the meta inference accelerator. MTIA is a very first silicon design in house specifically for our internal AI workloads and systems.
我们传统上依赖使用基于CPU的服务器来运行AI工作负载,但这些巨大的AI模型的计算和内存要求越来越高,使我们转向使用专门的解决方案,如GPU和其他专用硬件加速器。请欢迎工程总监Roman Levenstein、AI基础设施研究科学家Amin Firo Shah Young、软件工程师Joel Kuban和ASIC工程师Olivia Wu分享MTIA——元推断加速器的首次亮相。MTIA是我们内部AI工作负载和系统专门的第一片自主设计的硅芯片。
Roman mean Joel in Olivia, they are going to share details on the design, talk about the challenges and opportunities of developing custom silicon. Take it away folks. We are very excited today to announce MTIA is the meta's first in house accelerator for AI with MTIA we only entire system design from the silicon to the platform to the software stack to the application and it allows us to customize for our unique recommendation workloads and really control our destiny in providing cutting edge AI for our users. So let's talk about why this is such an important and exciting step for us.
And meta deep learning recommendation models or DLRMs are a key part of the company's business. They're at the heart of our family of applications such as Facebook, Instagram and WhatsApp. In this graph, we're looking at an important trend we've seen for models serving in production in the data center. So significant growth over time in terms of model size, that is the memory footprint, both in terms of the embedding stored on the device, the yellow line, the model as a whole, blue line and the complexity or the number of computations require per sample and that's the pink line. So keeping up with this model growth in AI requires we deliver ML platform solutions that provide the expected ROI for our business.
I'm Mr. Olkobern and I work on AI hardware software co-designed at meta. This means I work on designing systems across the hardware software boundary to help deliver platform solutions that will address these model demands. So how do we do this? Traditionally, CPUs were used for serving inference models in production in the data center, but they're not a cost effective solution to keep up with this growth. Hardware acceleration can address the power and performance issues. It provides much more efficient way to serve inference requests and it also provides the compute headroom to scale to future models.
So take a look at this graph showing our server capacity increase over a two year deployment period. You can see that the initial demand for increased capacity was met with the NMPI accelerator, so we're switching from the CPU and blue to NMPI and pink. So you can see the requirements for inference quickly outpaced the NMPI capabilities and metapivoted GPUs because they provided greater compute power to meet the growing demand.
But it turns out that while GPUs provide a lot of memory bandwidth and compute throughput, they were not designed with inference in mind. Their efficiency is low for real models despite significant software optimizations. And this makes them challenging and expensive to deploy in practice. So this is why we need MTAA with our in-house accelerator design. We can directly address the requirements of DLRM workloads and adapt to model trends over time.
So let me give a brief overview of our approach with MTAA and describe what makes it successful. So the goal of MTAA is to improve user experience in metapplications. That is, we want to provide more accurate and interesting predictions, increased watch time, higher clickthrough rates, all things that improve the user experience and are driven by better capabilities in AI. So we do this by providing better developer efficiency and better per-protec O over existing solutions.
So developer efficiency, this means we can lower the effort to enable new models, write new kernels and optimize performance so we can get models into production quickly and with high efficiency. And we do this by providing a development ecosystem built on popular and familiar libraries and infrastructure. So we integrate with PyTorch for building models, we innovate in the area of DSLs for kernel authoring and we integrate with emerging technologies like Triton and MILR.
For efficiency, per-protec O and time to production, we focus on doing a chip and system design with open source components and leveraging vendor partnerships. With this, we can take advantage of the risk 5 ecosystem, leveraging external IP and open source ISA and the LLVM compiler. All these things allow us to focus on the really critical part for our business which is designing the custom ML acceleration logic that makes Sparzen dense operations more efficient.
为了提高效率、保护 O 和投入生产时间,我们专注于使用开源组件进行芯片和系统设计,并利用供应商合作。借助此方式,我们可以利用风险 5 生态系统,利用外部 IP 和开源 ISA 以及 LLVM 编译器。所有这些使我们能够专注于我们业务的真正关键部分,即设计定制的 ML 加速逻辑,使 Sparzen 密集运算更加高效。
We'll now go into more detail if the MTA design. Amin will present the architecture, Roman will follow and discuss the software stack and Olivia will describe the trends in the design and challenges to scale to future models. And now I welcome Amin to talk about the architecture.
If you're sure for great introduction and motivation, at this point of the presentation, I would like to review with you the architecture of the accelerator and the design of the systems that are used to deploy these accelerators in the data centers. But before we go and dive into that topic, let's briefly recap what the idea of the acceleration means.
We have our workloads typically on the CPUs inside the servers, but the CPUs are not equipped enough to handle high demand workloads such as AI. Therefore, these workloads are typically offloaded and are run on adjacent systems that are coupled with the CPU and they are called accelerators. Accelerators either provide a lot more compute power or specialize on performing specific forms of compute such as processing graphics in the GPUs. These are typically tightly coupled with the CPUs in the servers and are controlled and managed by the CPUs.
My name is Amin. I'm a research scientist in the first structure organization and I have been active in the field of computer architecture for 15 years. With that brief overview, let's take a look at the first in-house silicon that meta has built for its own workloads. In this photo, you can see the silicon die of the MTIH chip, which is fabricated in the 7nm technology from the SMC.
It runs at 800 MHz and it's about 317mm square. It has a tight power budget of 25 watts and within that tight power budget, provides 102 tops of integer 8 accuracy computation or 51.2 terraflops of FP16 accuracy computation. The accelerator uses both on-chip and off-chip memories and can provide up to 800GB per second of on-chip memory bandwidth or 176GB per second of off-chip DRM bandwidth.
Now that you have seen the die photo, let's take a look at the high level accelerator architecture as you can see in this slide. This slide shows the high level architecture of the accelerator chip. As you can see, the accelerator is organized as an 8x8 grid of processing elements that are connected to each other via a mesh network. There are memory resources on the sides of the mesh that are connected to the PEs and can be used by them.
These on-chip memory resources, which total of 128 MB, can either be used as addressable memory or they can be configured as a memory side cache. In which case, they are supported by 16 LPDDR5 channels that provide connectivity to off-chip DRM chips. There is a dedicated control subsystem and dedicated host interface unit that you can see on the bottom right that connects the accelerator to the CPU on the server.
Now let's do a zoom in and dive into the internals of a PE. This diagram shows internal organization of a given PE. As you can see, the PE is equipped with two processor cores, which are based on RISK5 Open Instruction Set Architecture and are heavily customized to perform their tasks. One of the processor cores is also equipped with the RISK5 Vector Extension and can handle any form of general purpose vector compute. On the right hand side of the diagram, you can see fixed function units that are specialized in performing dedicated forms of compute such as matrix multiplication or calculation of nonlinear functions, or even specialized data movements within the PE or between the PE and the external memory.
PE has 128 kB of on-chip memory that can be used by the processor cores or the fixed function units. There is a central command processor connecting the processor cores to the fixed function units. It receives the stream of commands from the processors and distributes and orchestrates their execution on the fixed function units. On the left hand side, you can see general purpose components such as timers or interrupt controllers or a very elaborate debug subsystem that are required for proper functionality of the PEs.
Now, after reviewing the architecture of the accelerator, let's take a look at the design of the systems that are used to deploy these accelerators. In this slide, you can see a picture of the test board for the MTI Accelerator chip with the chip sitting right in the middle. It is using a dual M.2 form factor and has a power budget of 35 watts. It is connected to the host using 8 links of PCIe Gen4 for a total of 12.8 gigabytes of bandwidth to the host. The small form factor and power budget allows us to deploy multiple of these accelerator cards within a given system.
In this slide, you can see the topology of the system that are used to deploy the accelerators in the data center. Up to 12 accelerator cards can be housed inside the single system. And they are connected to the host CPU and to each other using a hierarchy of PCIe switches. This particular topology allows the accelerators to talk to the host CPU as well as to each other in a peer-to-peer manner, which does not involve the host CPU and does not interrupt the host CPU. The parameters of the system, which is based on the U7A TV3 server specification from OpenCompute Project, are carefully chosen.
The amount of host CPU processing power, amount of host DRAM, storage, network bandwidth and acceleration compute power are all balanced such that they are optimal for our current and future workloads. When fully populated, the system consumes around 780 watts of power. But I should note that hardware is only half the story. For having a successful deployment, you also need a very powerful and flexible software stack that can map the resources of the hardware to the needs of the application. And with that, I would like to turn that over to Roman to talk about our software stack.
Thank you, I'm in for the intro. I'm Roman Levinstein, I'm with Meta for over five years and I'm leading the development of MTAE Software Stack, which I'm going to talk about in my presentation. MTAE Software Stack aims to provide a developer efficiency in high performance. It is fully integrated with PyTorch to provide a familiar developer experience. Using PyTorch with MTAE is as easy as using PyTorch with CPUs or GPUs.
The MTAE Software Stack benefits a lot from flourishing PyTorch developer's ecosystem and tooling. On the slide, you can see that MTAE Software Stack consists of multiple logical layers. On the top, you can see the application layer, which represents, for example, a serving stack of a recommendation system. It is operating on top of PyTorch and it's mostly hardware diagnostics supporting back-and-target such as CPUs, GPUs, and MTAE. Below is PyTorch layer, which includes compilers and runtime.
Let's talk about compilers first. Compilers are responsible for converting PyTorch models into efficiently executable NTA code. First, we have the model compiler, which uses PyTorch FX Intermediate Representation for model-level graph transformations and optimizations. It's responsible for making sure that the work and compute and data is distributed among processing element greets, and that the fixed function units accelerating the compute are always kept busy. It gradually converts the PyTorch graph into lower-level representation, which is finally converted into a low-VM Intermediate Representation.
Next, we have the knife, the main specific language. This is our own development, and it's responsible for out-of-generation of efficient MTAE kernels from short, high-level descriptions of ML operators. The library of ML kernels is mostly developed using this domain-specific language, but some of the most performance critical operators, like fully-connected layers or embedded backs, are developed by human experts using low-level C++ and hardware APIs to make full use of available hardware resources. At the bottom of CompilerStack, we have LLVM, which is based on open-source LLVM compiler toolchain with MTAE extensions. It is responsible for the last level of optimizations, such as in-lining, register allocation, and emission of risk-5 executable code for the device.
Below that, we have PyTorch runtime. PyTorch runtime is responsible for multiple things. It provides such abstractions like MTAE tensors, memory allocation, and most of all, CUDA-like streaming APIs, which are needed for streaming and scheduling operators on the device. It's important to mention that PyTorch runtime for MTAE supports different modes of model execution, including eager mode and graph mode, which is a full-model compilation to maximize performance on the device. It also supports running multiple models partitioned across multiple cards, providing the necessary synchronization and communication channels between them.
Below PyTorch runtime is the host-side device driver, which is responsible for communication between the host and MTAE devices. And finally, at the bottom, we have a firmware running on the MTAE device, which accepts commands from the host-side runtime and driver and manages the execution of models on MTAE device. It's worth mentioning that MTAE software stack is still evolving, and that we are working on making the compilers and runtime even more powerful by integrating them with the recently released PyTorch 2.0, which was presented by PEMC in her presentation. We also work on integrating MTAE software stack with such new emerging technologies like Torch Dynamo, Torchant Doctor, and we are working on extending Triton domain-specific language to support MTAE ML accelerators. We are also looking into using MLIR Intermediate Representation for more advanced compiler optimizations.
In the next slide, we are going to look at the performance and efficient evaluation of MTAE. We have evaluated MTAE against NNPI and GPUs using a set of DLRM models, representative of what we run in production. They are shown in the following table. We see their Low-Complexity Model, Medium-Complexity Model, and High-Complexity Model. The models vary widely in model size, up to 160 times, and in model complexity, up to 32 times. MTAE must perform well across this whole range of models.
The PyTorch on the right shows a typical breakdown of where the time is spent in a typical DLRM model. We can see that the majority of the time is actually spent on fully connected layers, followed by embedded back layers, and then trailed by such long teleparations like CONCAP, transpose, quantize, and decontize. And others. The breakdown gives us also insight into where and how MTAE is more efficient. MTAE can reach up to two times better perfper what on fully connected layers compared to GPUs.
Now, let's look at MTAE efficiency across the set of models. Just as a note, the MTAE software stack is still evolving. So it is a production software stack that must both adapt to the latest environment, and changes like moving from PyTorch to PyTorch to Point-O, but at the same time, it must operate well across a range of models to provide stability, accuracy, performance, as well as usability. We can see on the slide that MTAE achieves near perfper what parity with GPUs and exceeds perfper what often in Py in all cases. Roofline modeling indicates that there is still much room for improvement.
MTAE achieves impressive gains up to three times better perfper what on low complexity models, and trails behind GPUs on high complexity models. Which is an area we have not yet focused on optimizing in the software stack, but we are looking into it in the upcoming halves. Modestyles about this results can be found in our upcoming paper for the ISCA conference industry track later this year.
MTAE has been deploying off-to-shelf CPUs and GPUs in our data center. In this slide, we have a graph that shows the scaling trend of compute, memory, and network bandwidth on CPU and GPUs in the past 20 years. In this graph, compute is in orange, memory bandwidth in green, and interconnect bandwidth in blue. As you can observe here, that the compute's capability has been scaling at twice the pace of memory and interconnect bandwidth across multiple generations of CPU and GPUs.
MTAE 在我们的数据中心中部署了现成的 CPU 和 GPU。在此幻灯片中,我们展示了过去 20 年中 CPU 和 GPU 的计算、内存和网络带宽的扩展趋势图。在此图表中,计算为橙色,内存带宽为绿色,互连带宽为蓝色。正如您在这里所观察到的一样,计算能力在多代 CPU 和 GPU 上的增长速度是内存和互联带宽的两倍。
As we scale our system to support much larger and complex workload, this imbalance had manifested itself as bottleneck in our data center. You can see in the lower-right graph here that some of our workload had spent as much as 58% of the time on networking and data transfer.
By designing the AI Silicon in-house, we are finally able to gain control over the full stack from application to software, system, and silicon. This enables us to finally close the gap and optimize the full stack for our workload and control our own destiny. MTAE is our first ML accelerator that we developed in-house and we learned a lot throughout this process.
As we develop our next generation of ML silicon, we will continue to optimize every aspect of our architecture to strive for a balance between computation, capabilities, memory, bandwidth, and network bandwidth to achieve optimal performance for our workload.
One of the key advantage of designing in-house silicon is the ability to co-design the architecture with our software team. As Roman had covered earlier, our software stack is fully integrated into our PyTorch ecosystem. With feedback from our co-design team, we are able to introduce new custom instructions and compute primitive for model innovation, create construct that will enable faster operator launch, memory allocation, and easy prototyping, and incorporate features that will allow us to future-proof the silicon design and scale with future workload.
The advancement in AI is going to provide a tremendous opportunity for us to innovate and push the boundary of technology. Our in-house accelerator will allow us to optimize all the components of the silicon, system, and tooling to improve the cost efficiency of our infrastructure. It enables our software developers to create AI models that will provide more relevant contents, recommendation, and elevates the user experience to the next level.
Thanks Roman, Amin, Jolin, orivia. We will now be doing a short break. When we come back, we will share another locator in-house silicon efforts. We are going to focus specifically on video processing. See you shortly.
Infrastructure fundamentally is what I call the engine behind the growth of this company. We have been building data centers since 2010. We are talking about serving maybe a half of humanity. Now when we are talking about AI, you need a similar engine behind AI so that AI can achieve the potential that we are sort of dreaming.
AI workloads are growing at a pace of 1,000X every two years. I mean, just contemplating what that means for our systems, our silicon, our software stack, we're in the middle of a pivot to the next age of information. The models themselves are becoming hundreds or thousands of times larger and more complex and what this is going to require is infrastructure at the XF-LOP level.
We are creating new hardware, including our own silicon. We're building out new kinds of network architectures. We're reimagining the software stack like PyTorch, thousands of engineers are innovating on this large-scale infrastructure infrastructure that's built specifically for AI. Meta training and inference accelerated. It's Meta's first in-house silicon. It was designed solely for recommendation models in mind and it is a piece that fits with the rest of the system like software ecosystem for writing applications and deploying ML models. And it's built in-house. By having it in-house, we are able to optimize every single nanometer of the chip so we don't have any part of the architecture that is wasted and that helps to bring down the power.
The fundamental target in designing this MTIA is to provide highest performance in the lowest power and in the process we achieved twice the efficiencies compared to today's GPUs.
Another silicon product built by Meta is MSVP, Meta-scalable video processor. People spend more and more time producing videos and sharing videos. That means more and more pixels will hit our data centers. MSVP processes these videos nine times faster than the traditional software encoders, maintaining the same video quality at par with the software encoders and at half the energy. MSVP can be the final engine that takes all the generative AI, content that people create, eventually needs to be encoded. It can never traverse the internet in its raw format. All these requirements put together, necessitated the design and manufacturing away MSVP.
So our new silicon hardware is going to need a new home. We're working on the next generation data center design. We're moving to sort of AI machines that can leverage GPUs or custom silicon that we're developing ourselves and that's going to require a more dense data center design. We're going to be leveraging higher dense racks. The service themselves will be liquid cool to the chip. And with flexibility in mind, we're going to collate service and network together to enable future generations of AI.
We believe our research supercluster is one of the fastest AI supercomputer in the world. It is one of the unique places where you can run truly large scale jobs. We have 16,000 GPUs interconnected with 1.6 terabits per second of infinite band network, producing approximately 5X of flops of compute power. We also have almost half an X-abyte of storage, backing the compute plus network.
The advancements in AI are coming at the right time so that RSC can be put to use to take advantage of the infrastructure and the data and then move fast. We design, create, run and operate all of our infrastructure. So it's a sweet spot of being able to work at scale, work at cutting edge and also do work that literally hits billions.
Infrastructure at scale is what our long term research requires and innovation without it is impossible. I feel like I'm in the middle of a revolution. It's just an incredibly exciting time to be a meta.
We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do.
我们还有很多工作要做。重复了四遍。
We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do.
我们还有很多工作要做。
We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do. We also have a lot of work to do.
我们还有很多工作要做。重复了十次以上,强调我们面临的工作量很大。
Welcome back. Our next presenters will introduce you to MSVP, Metas, scalable video processor. It's the first generation, so-ve grade video processing hardware accelerator of its kind that we have developed here at Meta. Technical lead manager, Hare Krishna, and Video Infraresource Scientists, Janice, Katzevindis, will describe why we needed to build this, the architecture behind it, and some of the novel algorithms we developed to achieve high quality video transcoding. They will also get into describing how the hardware accelerators are used in Metas DCs to support processing and transcoding billions of videos every single day, and to provide premium video quality to end users, all of this while saving us power. Take it away guys.
Hello and welcome. I'll be talking to you about how we process video set Meta, and especially how do that by maintaining the best quality by being very energy efficient using MSVP, Metas scalable video processor. My name is Janice Katzevindis, and I'm part of Video Infra.
Everybody is familiar with Metas family of apps, Facebook, Messenger, Instagram, and WhatsApp, and our hardware products such as the Oculus Quest. There are more than 3.7 billion monthly active users, and more than 2.9 daily users. But you probably didn't know that video overall makes up more than 50% of the time spent on Facebook. Video is king, and that shows up in many of our products, such as our source form video called Reels, premium music video, the unique social experience of watching together, and of course live video.
What makes unique video processing at Meta is the wide variety of content. We have everything including video demand, live and real time processing, and that includes both user generated and professional content. Here is how and why we process video set Meta.
Everything starts with video on your mobile phone that first gets uploaded to our data centers. There it gets transcoded into different formats and different resolutions. For example, one may need to deliver it to a mobile phone at 700 kilobits per second, or a tablet connected on a Wi-Fi at 2 megabits per second, or your browser on your computer at 20 megabits per second.
There are 4 basic processing steps when transcoding videos. After the video is uploaded, first step is to decode it into frames or pixels, then resize it into smaller resolutions, next step is to encode it into more sophisticated codec, such as AV1, and last but not least, calculating the quality of that transcode using standard quality metrics.
Now, there is a 2-part trade-off that everybody is familiar with, and that's a trade-off between quality and bit-spent. But at Meta, we have a third component, and that's the amount of compute we spend to do all this processing. And that trade-off means that you cannot improve all three at the same time. For example, in order to keep constant quality and spend less compute, you pay that by using more bits.
Now it's time for my friend, Harry, to explain how we do it using MSWP. Thank you, Yanni, for the great introduction. Hello, I'm Harry Krishna. Before we go through the MSWP architecture, let's look at the motivation behind building MSWP. Meta's billion-scale at video needs an energy efficient and a low latency video transcoding solution. We mostly process pre-encoded videos, which means the video quality is already degraded from the source. So we need an encoder that is at par with the best in-class software encoders to preserve the video quality. So these stringent requirements led us to building the MSWP.
As Yanni mentioned, these are the key components of the transcoder pipeline. Every uploaded video first needs to be decoded to produce the original pixels. We support H264, H265, VP9, and AV1 codec formats. These pixels are then sent through overlay composition, cropping, and rotation as required before being resized to produce the various resolutions we need for the encoding. The resized frames are then encoded for H264 and VP9 format. We also have a calltometric module to compute the similarity metrics for every encoded video.
In this pipeline, the pixels mostly are exchanged through these modules directly. If not, we also have a large on-chip cache to exchange the pixels which need a slightly longer lifetime. And for those pixels which need much more than a frame worth of life, we send them through off-chip memory. And before we send them to off-chip memory, we also compress the pixels to save energy. This pipeline is programmable to support various quality presets through the firmware running on this RISKFI controller.
Apart from this, we also have a JPEG image transcoded in this pipeline. And this pipeline is programmable either to operate as a single pass encoding to for the applications which need very low latency or as a multi pass encoding to produce for high quality videos. And this pixel, this pipeline at the peak, can support single input multiple output transcoding at 1 billion pixels per second with less than 10 watts. Compared to software encoding, this is 9 times faster with half the energy. This pipeline can also support transcoding from 4k to Q-sive and the frame rate varies and is directly proportional to the video resolution.
Within the pre-processor, apart from OLED composition, scalar is the key component there. In our scalar, we use 25 tap 2D filters. These offer very high precision filtering and to offer a far superior quality compared to any conventional scalars used in the industry. We also need to support arbitrary frame sizes in our use cases. And then we also support scaling from 4k to Q-sive in a single step.
Within the transcoding pipeline, encoder is the most compute intensive module and the architecture choices we make pretty much dictate the video quality of all the videos coming out of the transcoder. We use a 3 stage motion search which is programmable and can support a very wide search range. We support plus or minus 512 pixels in the horizontal direction and a plus or minus 160 pixels search in the vertical direction across multiple reference frames. We also support a near exhaustive mode decision using rate distortion in every decision. Rate distortion optimizer or RDO is one of the best known practices in video compression to determine the optimal mode decision. The distortion calculation itself is very compute intensive but parallelizable. However, the rate estimation is very serial in nature. We use a novel rate estimation model in MSVP to allow us using multiple of these RDO engines in parallel to get us the speed we need. We also use many smart quantization techniques and also other proprietary algorithms in our video pipe. And finally, we also use 3 of these encoder pipes to process 3 consecutive macro block or super macro block rows in a way front parallel manner.
Here we show you the quantum metric module. The way we implement and use quantum metric in our video traffic is very unique to MSVP. So, we support SSIM, multi-scale SSIM, WIF, PSNR and no reference matrix like BLUR in our quantum metric module. And also the quantum metric module is pretty compute intensive in the sense that for every in a typical case, for every uploaded video, we need to produce 5 different encoding resolutions. And for each of these encoding resolutions, we need to compute similarity matrix at 5 different viewport resolutions, which means total we need about 25 quantum metric computations for every uploaded video.
So much goodness here. So now it's my turn to show you how we use it in our data centers. The wide variety of videos and the distribution of popularity dictates different treatment for different videos. Videos with less views get the so-called basic treatment while the more popular videos get our advanced family of encodings. This is all enabled by MSVP. In this example, you see how by doing multiple encodings at different resolutions, we can get the optimal settings that allows us to get the best quality for a different bitrate. But the best part of it is that after you do all these multiple encodings using MSVP, you can get the corresponding settings and do another pass using a software encode to get even better quality. In this example, you see how the convex hole, which is those best quality settings, can translate from AVC to AV1, giving us an additional 65% bitrate savings. So in summary, we use MSVP because we have a wide variety of content, both premium and user generated, VOD live in real time. We use those best practices and that gives us the best end-to-end quality. MSVP allows us to do both the basic encoding using the lowest amount of latency all the way to the advanced tabier encoding to push quality to the max.
Finally, here, we show you how all the transcode IPs we discussed earlier is packaged with the PCIE controller to connect to the host. And then we also have a memory controller to connect to the off chip, a memory where we store all the intermediate pixels. We also have a secure boot processor to authenticate the firmware running on this ASIC. We also have many peripherals to help us on the debug and diagnostics. On the right, we show you the die shot. This is a 100 millimeter square chip in 7 nanometer and encoder takes more than 50% of the area here. And finally, this SOC along with LPDDR modules is packaged onto this M.2 connector. We have two of these connectors with the heat sinks going on to into a GPV3. This GPV3 is coupled with a host server into a sled and we have two of these slides going into a cubby and we have several of these cubbies going into the data center. I'll now hand it off to Yani. We'll talk about how MFCP is used in our data centers. Thank you very much.
It also gives us many potential other opportunities, such as to use it in live and broadcasting, where the even lower latency matters a lot. To advance pre-processing, both to improve quality and also use it for video understanding. Our north side is always to have flowed the majority of those stable and mature video processing in order to use software only as a thin layer to further boost quality wherever that matters.
So are we done yet? Well, we'd like to do many more of these by adding more video code extender, such as AV1. We're going to support more and better video quality metrics. We're going to further improve video and audio quality by doing pre and post-processing, such as denoising or de-blocking and artifact removal. We're going to push even more pixels through the same die, and we'd like to stay focused on matters most important use cases, such as to reduce the compute and storage footprint, to improve quality for our short form videos, and to enable the immersive 360XRVR videos and other metaverse content. Thank you, and we welcome your collaboration.
Thanks, Hari and Yarnis. With thousands of developers across meta working in multiple programming languages across the company and writing billions of lines of code, software development is truly the core of our business. As we look to help our developers continue to be as productive as they possibly can be, we develop code compose. It's a GenieB's coding assistant. Code compose suggests code that a developer might be likely to type next. It's intended to be quick, unobtrusive, and helps accelerate the process of code authoring. Met a software engineer Michael Ballin joins us now with an inside look at code compose, and a longer term vision to help developers across the whole SDLC, a software development lifecycle.
Take it away Michael. Hi, my name is Michael Bolin, and I'm a software engineer at Meta. Today, I'm going to show you how we leverage generative AI for code authoring using an in-house tool we have developed called code compose. Code compose is a code completion service that suggests code as you type in an editor such as VS code. The underlying model is built on top of public research from fair that we have tuned for our internal use cases and code bases. On the product side, we are able to integrate code compose into any surface where our developers or data scientists work with code.
As we will show, by taking ownership of the product from end to end, we've been able to build a compelling code completion service for developers at Meta, deployed at scale, which contributes to a meaningful portion of code authored at the company.
Today, we'll start by examining the Gen AI model that powers code compose. Next, we'll provide a brief look at the product architecture, followed by a demo where you can see it in action. Finally, we'll discuss the impact that code compose has had at Meta.
Let's start by digging into the model. About a year ago, Fair released encoder, a generative AI model, trained on code. This model comes in two variants, a smaller model with 1.3 billion parameters and a larger model with 6.7 billion parameters. For the code compose service, we use both variants of encoder, which, based on context, lets us optimize the trade-off between quality and latency. A key difference between encoder and other large language models currently used for code generation is its ability to perform in-filling. In-filling lets the model suggest text within existing text, which is an action that is important for code authoring and editing. While the public version of encoder has been an invaluable starting point for code compose, we have made a number of notable improvements along the way.
For starters, we fine-tune the model on first-party code, exposing it to our internal libraries and frameworks, so code compose can incorporate them into its code suggestions. In particular, at Meta, we are heavy users of Hack and Flow, which are programming languages that were not well represented when the original encoder model was trained. Fine-tuning on our first-party code helps close that gap.
Further, incurrating the code used for fine-tuning, we exclude files that contain patterns we want to discourage, such as deprecated APIs like React, Create, Class, and Flow, or code containing errors suppressed via an H-FixMe annotation and Hack. We also supplement our training data with code that does not live in source control, such as Jupyter notebooks. The net result of these investments in training data have yielded impressive results.
To assess the impact of fine-tuning, we ran an offline evaluation of different versions of the model to measure its ability to infill an exact match on first-party code across various languages. For the experiment, we prepared three versions of the model. As a baseline, we took the original 1.3 billion encoder model and used it to infill a random sampling of mass snippets of first-party Python, Hack, and Flow code. As expected, the model performed the best on Python, reproducing the original code over 20% of the time. Surprisingly, the model was also able to yield an exact match over 10% of the time for Hack and Flow, despite those languages not being well represented in the original training data.
Next, we took the original 1.3 billion encoder model and fine-tuned it exclusively on first-party Python code and re-ran the infilling analysis. As expected, this improved the exact match for Python in our experiment, jumping from 23% to 36%. Though an unexpected result was how fine-tuning on Python also improved the scores for Hack and Flow, illustrating the potential benefits of transfer learning in large language models.
Finally, we did an additional fine-tuning to include first-party Hack and Flow code and re-ran the analysis once more. As expected, this further improved the model's ability to infill exact matches for Hack and Flow, though the score for Python had a small decrease of less than 1%. As you can see, being able to fine-tune on first-party code is a significant advantage of building our own code completion service.
In addition to experimenting with the training data, we also made some changes to the model architecture itself. In our current code-compose integration in VS Code, we only request completions when the user's cursor is in an empty block or if it occurs at the end of a line after a hard-coded set of trigger characters, such as a period or parenthesis.
In practice, this means we provide completions not at arbitrary offsets within a document, but only in a limited set of circumstances. However, the original encoder model was trained using its own causal masking objective in which source code is tokenized using byte-parent coding before the region to be masked is selected. Because mask boundaries were not guaranteed to align with trigger character boundaries in the original source code, we discovered that this led to a mismatch between training and inference that resulted in poor mid-word predictions.
To address this, we refined the training objective to something we call language causal masking, in which the code is partitioned on trigger character boundaries and the mask is selected from the resulting segments. Only after this is done do we tokenize the three segments individually, the code before, the code after, and the target code that was masked. This resulted in gains of up to 61% in offline evaluation of exact match.
So now, we've talked a lot about the model, but let's see how we can leverage it to provide code suggestions to end users. At meta, we have a tier of machines equipped with powerful GPUs, each with sufficient memory to host both the 1 billion and 6 billion variants of the code compose model. Clients can make requests to this tier via thrift.
The caller specifies the code before and after the cursor, as well as the file path, language, and which model to use to process the request. The caller also decides whether to request a single line or multi-line code completion. In practice, clients request single line suggestions from the smaller model and request multi-line suggestions from the larger model.
To mediate requests between the client and server, we implemented a language server and Rust that we reuse across our various editor integrations. For editors such as VS code that support LSP natively, we require relatively little glue code to create the code compose extension. For editors such as Android Studio that do not have native LSP support, we build a small adapter to proxy requests to our LSP.
Further, this architecture makes it straightforward to integrate code compose with our in-house developer tools, such as bento, which is a web-based Jupyter Notebook UI created internally at meta. Because bento also supports LSP, it was easy to provide our data scientists with the same AI-powered code suggestions as developers working in VS code. This ability to plug code compose into any code editing surface internally is an advantage of owning the entire stack. Finally, in addition to reducing the amount of integration work required to support a new surface, making the LSP responsible for the bulk of the client-side logic ensures that metrics are recorded consistently across these surfaces. Implementing fine grain telemetry to accurately capture the impact of code compose is important in improving both the model and the product. This tight feedback loop has helped us make fast progress in this space.
此外,这种架构使得将代码组合与我们内部开发工具(如 bento,一个由 Meta 内部创建的基于 web 的 Jupyter Notebook 用户界面)集成非常简单。由于 bento 还支持 LSP,因此我们的数据科学家能够像在 VS code 中编写代码的开发人员一样获得 AI 提供的代码建议。将代码组合插入任何代码编辑界面的能力是拥有整个堆栈的优势。最后,在减少支持新界面所需的集成工作量的同时,使 LSP 负责大部分客户端逻辑可以确保在这些界面上记录的指标保持一致。实施细粒度的遥测以准确捕捉代码组合的影响对于改进模型和产品都非常重要。这种紧密的反馈循环帮助我们在这个领域取得了快速进展。
We've talked a lot about code compose, but now it's time to see it in action. We'll start in VS code. Code compose makes suggestions as the user types that can be accepted by pressing tab. It shows single line suggestions to complete a line or multi-line ones to fill in an entire block of code. Code compose can take advantage of the surrounding code to provide better suggestions. Here we see that adding an import statement and refining the function signature helps code compose produce a more specific implementation of the function. Code compose also uses prose comments as signal and generating code, as shown here, updating the doc string to mention the use of PS causes code compose to suggest a new implementation that does just that. And a comment to specify the use of PSAUX refines things further. Because code compose takes code before and after the cursor into account, it can suggest things like annotations and import statements.
Now let's look at code compose in bento. Just like in VS code, you can use comments to help code compose generate suggestions. Given the interactive nature of notebooks, this results in a tight feedback loop for exploring data. As you can see, code compose considers the code from the surrounding cells, allowing each step to build upon the previous steps.
Now that you've seen code compose in the wild, let's see what sort of impact did this had at meta. Anacdotally, we have received a lot of positive feedback about how code compose has helped people write code better and faster. But let's look at some numbers as well. Python was the first language supported by code compose and usage has grown steadily as we continue to improve the service. Today, thousands of employees at meta are accepting suggestions from code compose every week. For suggestions visible to the user for at least 750 milliseconds, our acceptance rate is over 20% in climbing.
We are extremely excited with our progress on code compose today and we believe that our developers are best served by bringing this work in-house. First, as we have shown today, having the flexibility to experiment with the model has made it possible to support languages like hack and flow, avoid undesirable code patterns and perform all sorts of analyses and explorations to improve the service that would not have been possible if the model were a black box. Second, because we control the product from end to end, we can integrate code compose anywhere in any of our vast array of internal tools and our LSP architecture makes it economical to do so. Third, privacy and security are fundamental to all the work we do here at meta. Building code compose in-house ensures we do not share our code with third parties and that the telemetry we collect is subject to access controls and there you have it.
We covered a lot today, but if you still want to learn more, we are publishing a blog post and paper with more details about code compose. That said, we are still early in our work in this space, so we look forward to sharing more about our progress going forward. Thanks for listening. I hope you enjoy this conversation with us.
I also took on the role as the head of product for our newly formed generative AI team. Today I'm going to dive into some of our AI infrastructure projects with this great group of people. I'll kick it off with some intros, Rachel first.
Great, thanks, Serena. Hi, I'm Rachel Peterson and I lead the Data Center strategy team, which looks at our data center capacity needs of today, but also ensures that we're prepared for the future. Since 2010, Meta has been designing, building and operating its own data centers. We've always done this with a mindset of innovation, efficiency, and reliability. Over at least the last 18 or so months, our data center organization has been really essentially focused on AI and the future of AI and preparing us for this. And so while our current data center designs have been supporting all of our AI workloads, we're really looking towards the future. And our teams have been really focused on designing and deploying a next generation data center, which is not only able to support our current needs around AI and our current products and services, but also for our needs of the future.
I'm going to pass it off to Alexis next. My name is Alexis Berling and I support our AI systems and accelerated platforms teams here at Meta. Our teams are responsible for designing, delivering, and supporting our compute and storage platforms at scale. In addition to building and delivering our compute storage and networking hardware, all the way down to the silicon itself. One of the biggest missions and visions of our team is ultimately to deliver software defined hardware. That is really taking into account the needs of our evolving workloads, developer efficiency, our entire software stack, so that we can deliver hardware that is optimized for the needs of our users. We're in a unique position here at Meta to do that because we've got the end-to-end ecosystem in-house. Kim, and then we work together a bunch, though. Absolutely, yeah. Thanks so much, Arita. Hi, everybody. My name's Kim Hazelwood. Broadly, I lead the organization that provides our AI research infrastructure solutions for our cutting-edge AI research that's happening within the research organizations. So we develop things like the AI research supercluster, as well as the entire software stack that is used by the researchers in our organization. We're also doing some cutting-edge systems research as well, right at that intersection between systems and machine learning. And all of this is being used to unlock some of the latest breakthroughs in AI. And next, the partner in crime on the XAI team? Hi, my name is a partner of money, and I am responsible for a data AI and developer infrastructure. And so collectively, my team's responsible for all the components across the data and software life cycles. So we start with the teams that build the systems for prepping data, and having it be ready for training. We have teams that build the systems for training and inference stacks. And we also have a team of specialists that are just incredible engineers working on all of the developer experiences. So everything from enabling both machine learning engineers, as well as software engineers, that build applications that serve the three-billion users that we have. It's an exciting time to be in AI, both for us at Meta, because we have this incredible, ambitious roadmap ahead of us for AI. But also just looking at the transformative power of AI across our lives on a daily basis.
我接下来会转交给Alexis。我叫Alexis Berling,负责支持Meta公司的AI系统和高速平台团队。我们的团队负责设计、交付和支持大规模计算和存储平台,以及开发和交付计算、存储和网络硬件,甚至到硅本身。我们团队最大的任务和愿景之一是最终交付软件定义的硬件,考虑到我们不断演进的工作负载、开发者效率和整个软件栈的需求,以便为我们的用户优化硬件。我们在Meta公司处于独特的地位,因为我们拥有内部完整的生态系统。Kim,我们经常一起工作。当然,谢谢你,Arita。大家好,我叫Kim Hazelwood。我领导的广泛组织为我们在研究机构中正在进行的尖端AI研究提供AI研究基础设施解决方案。所以我们开发类似AI研究超级集群以及研究人员在我们组织中使用的整个软件栈。我们还在系统和机器学习之间的交叉口进行一些尖端的系统研究。所有这些都被用来解锁一些最新的AI突破。接下来是XAI团队的搭档?你好,我叫Partner of money,我负责数据AI和开发人员基础架构。因此,我的团队集体负责数据和软件生命周期的所有组成部分。我们从构建准备数据的系统开始,使其可以用于培训。我们有构建培训和推理堆栈的系统的团队。我们还有一支由令人难以置信的工程师组成的专业团队,致力于所有开发者体验。从让机器学习工程师和软件工程师都能够开发为我们所拥有的30亿用户提供服务的应用程序。这是一个令人兴奋的AI时代,不仅对于我们在Meta公司来说,因为我们在AI方面有着雄心勃勃的路线图,而且看到AI在我们的日常生活中的变革能力。
So, AI, it's a great time to be here. So many of you mentioned this transformation, the impact AI has had with the evolution of AI. How has it impacted Meta's infrastructure that you've seen today? Also, as the company evolves, we're more and more reliant from family of apps to the metaverse to our devices that we produce. How are you able to adjust your infrastructure strategy as you do that? Rachel, do you want to kick us off on that?
Our current generation of data center designs is world-class energy and power use efficiency. And it's actually really supported us through multiple generations of server, storage, and network. And it's really able to serve our current AI workloads really well. But as we look towards the future, it's always been about planning for the future of AI hardware and systems. And how can we have the most performance systems in our fleet?
And to do this, we really had to redesign and rethink our data centers. And this is to support the future all the way from, you know, how do we look at the building design to power and cooling, and how does network fit into the data center. So, for example, as we look towards the future, we see a future in which the AI chips are expected to consume more than 5X the power of our typical CPU servers.
And this has really caused us to rethink the cooling of the data center and really provide liquid cooling to the chips in order to provide this level of power. And so for that, we had to, of course, redesign the system. But I'm really excited about what the future holds in our new data center designs as we look forward. And as we do these things, we're always looking forward and adjusting some right back to that question of like, how do we adjust for the change before the change happens?
And Alexis, I know you've done a bunch of that with your team that you lead. Yeah, absolutely. And just building off of everything that Rachel just shared, I think it should be very clear that we're designing end-to-end systems that really start from the physical side, from the data center all the way through the network that connects our global fleet. All the way into the data center to our AI training clusters and our distributed inference serving platforms.
Meta has a tremendous amount of experience in delivering distributed compute at scale. And some of the pivots that we had to take when we started building out our AI infrastructure a number of years ago was really shifting some of the thinking to deliver more highly integrated, tightly coupled systems to train our workloads. And then to do the inference serving. So not only do we design end-to-end for our AI platforms in the physical side, but also all the way up our software stack working with AI in for working with next generation research workloads, working with our product product group teams to make sure that we are trying that we're achieving that software defined hardware systems that our fleet can handle more heterogeneity that we're able to build the flexibility and capabilities that will service our future workloads. That we don't need to envision today.
Yeah, and you mentioned networking hardware software. And apparently we talk a lot about these shifts and being able to be nimble but at least and plan at the same time. So how do you manage that? You're able to plan ahead, but still adjust as the market changes.
One of the things I want to say is we've invested in AI for years as meta. If you really think about it, Newsfeed was launched in 2005. And Newsfeed is really a ranking algorithm. And then as relevance improved to Newsfeed evolved, that's all powered by AI. We've invested in helping our users stay safe with spam detection and with removing hate speech, for example, all of that is powered by AI. And then we have, I think, one of the world's best ads platforms and all of that is powered by AI. So this is really an evolution for us in infra from a very long time ago.
And what's shifting now, I think, is the pace of innovation is really rapidly increasing. Right. Model architectures are changing. Model sizes are changing. We're seeing really complex models that are evolving data volumes are growing. And so, you know, thinking about the software evolution that we've had. We've done a few different things. I mean, organizationally, we've now consolidated all of our software into the AI in for organizations. So we're really well positioned to handle all of these new changes and shifts. We've created whole new software stacks, PyTorch, the leading framework of choice for all machine learning development. It was built by Beth Metter and obviously open source to now a spot of a foundation. We've invested in entire new training stacks and we've invested in inference, which is sort of like our ability to serve these models at scale.
And can I know your next step was mentioned by Alexis and you're always looking forward for many, many years ahead of some of us. How do you manage that from like a research and foundational perspective? Yeah, I mean, in addition to all of the amazing product use cases that we just heard about, we're also evolving our infrastructure to deal with the research use case. And what's great about this is that it gives us this sort of preview of like what's potentially to come. And we can kind of evolve both the hardware infrastructure, the compute, the networking, the storage in anticipation of sort of what's coming.
Like as we've evolved from recommender systems into more recently generative AI, then you know, we were able to anticipate, okay, what does this mean from an infrastructure perspective? What are the, how are the demands different? What bottlenecks might we run into? And sort of how can we evolve our designs? And not just at the hardware level, at the software level as well.
Because we also get a preview of like the usability pain points that people will have. And so, you know, Aparna had mentioned PyTorch. Like so PyTorch originated out of the research organization, generally just to solve a problem. Like the problem was, you know, I want to be able to focus on the model itself. I don't want to have to worry about the infrastructure.
So let me just quickly build some infrastructure that lets me focus on what I want to be focusing on. So basically, how do I hide the complexity and not have to spend too much time and energy to be able to get what, what, what I need.
And you all mentioned AI transformations. More recently, Gen.A.I. I think nothing has captured the hearts and minds of people as much as Gen.A.I. and been a transformation. As we think about generative AI, and you mentioned the research side, we couldn't have done the product use cases without considering that research.
And how is infrastructure impacted when we make this massive change to generative AI? Alexis, do you want to talk a little bit about generative AI and how it's impacted your team? Absolutely. You know, the first thing that I could say is scale, peer scale. You know, when you, we've been delivering, we're on our third purpose built AI system with our internal hardware designs today.
And as we look to the future, it's the generative AI workloads and needs require much, or the models are much more complex. They require much larger scales of GPUs. As an example, I mean, are, you know, whereas traditional AI workloads may be run on tens or hundreds of GPUs at a time, the generative AI workloads are being run on thousands, if not more. So the actual requirements have changed as well.
Our recommender models were traditionally memory or network bound. Generative AI is incredibly compute intensive. So the compute density needs to increase. So what does this mean for us when we build? We are building much larger clusters. We need to be able to shard our workloads across many different GPUs. We're thinking about how do we optimize these, you know, power consumption is also rising.
So we've done a lot of work in advance for looking to figure out how do we cool these systems. So, but how do we cool them, you know, continually into the future? These are also incredibly capital intensive systems. So when we think about that, that's one of the main reasons we have the opportunity to innovate and to end. And that's one of the reasons we kicked off many years ago, our internal custom silicon development.
So that we could optimize our specific clusters, our specific infrastructure to meet the efficiency and the both the performance, power and capital efficiency required for the evolution of AI workloads. And, apparently, we've been talking a long time about the product experiences way before I think actually generative AI picked up.
So how do you manage planning across all these products, because it's not one size fits all? I mean, generative AI has been an interesting evolution. And I know Alexis talked a bunch about the systems impact and the latest center impact. I'll maybe mention a couple of areas on the software side.
First for training, these jobs run for weeks and months, not hours. And so fundamentally, the way that we think about check pointing, the way that we think about scheduling, fault tolerance, being able to find capacity that's contiguous, all of this, I think, are new challenges. That surface, because of the nature of generative AI.
Fortunately, in Africa, we have just an incredible core infratime that's been working on large scale problems for so long. So we're able to sort of pull from that expertise to be able to just build on that to be able to support these really large jobs. And it's a totally different ballgame running this in production versus smaller versions of this and research and to do it predictably, I think, is really hard.
The other thing I'll say is on the inference side, we're finding that geniei requests, like land up being about a thousand times more expensive than, you know, ranking or recommendation request and just the nature of the models themselves. And so there's just an incredible opportunity.
We have, I think, one of the best teams in the industry working on all of the performance optimizations here, everything from new runtimes, changing the languages of our runtimes to be more performant. We're working very closely with the Lexus team and your team, Irene, and co designing these models so that we can build more performance into these models themselves. And we're working on really innovative creative ways of serving these geniei v i models with, you know, different tiers for different context windows and things like that. So just very exciting all around.
One of our differentiators is our open science philosophy. So when you think about open source and a lot of that comes from your team, how do you think about it? Yeah, I'm really proud of the work that we're doing with open source.
I mean, the research teams have open source, Salama are geniei v i models and not just the models, but also the weights. And that's just unprecedented in the industry. And I think people are finding that really valuable in the community. We've open source pie torch, which is the leading machine learning framework for all model development. And recently we also handed over pie torch to a pie torch foundation under the Linux foundation to enable an ecosystem.
And I think it's really important, right? As research continues to evolve at this crazy pace, you want to be able to have all of the investments be be applicable to many people in the community so we can further a research together. And so we've been able to develop a pie torch 2.0 recently released with just doubling the performance. If you caught punk stock earlier today, she talked about touched dynamo and touched inductor. As part of the pie torch 2.0 release, really exciting technology. So I think overall that matter, we're committed to open source, we're committed to enabling research across the communities and I feel really proud about it.
And I know Kim, your team was very good at predicting the creativity, generative AI trend coming out with some of the launches. What are other areas of AI that you see coming in the future?
Yeah, I mean, for sure the big boom in generative AI has been amazing for the AI research community. It has opened a lot of doors, but it's also exposed a lot of investments that we need to be making as we look forward. First of all, the models themselves are really only as good as the data that is used to train them. And therefore we're going to need fundamental investments in data and data infrastructure and understanding how much data we need and pruning that data. And so that's a real key investment that we need to make and we should make.
Second of all, we look at generative AI today. There's a lot of singular modality efforts. So basically we have large language models. But going forward, we're really going to need to know how to understand different modalities besides language and then also multiple modalities at once. So images with text, speech with video. So that's going to be an investment that we also need to make.
And when we look at generative AI from a use case perspective, it's rare that it's sort of a one shot deal. Usually there's some iterative process to generating and regenerating, prompting and tuning. And so that's going to open up the doors for a lot of tools that we're going to need to build to be able to facilitate that process. And then finally the sky's the limit in terms of the applications that we're going to have and that we're going to see coming out of generative AI. So I'm really excited to see what would happen there.
So with AI, it's really easy to create silos, right? Whether you're working on networking, development of data centers, software. How do we avoid that? Because the company is big, right? And we're doing a lot of different things.
Yeah, so I've seen quite the opposite in practical experience here. I mean, our teams have always worked closely together. But since going down this, you know, AI in particular over the last many years here, it's really required us to work very, very tightly coupled across teams across the entire company.
So anywhere from software to hardware to data center to networking. You name it. We're all moving super fast. And in order for us to really deliver a very integrated stack that has as really reliable, it requires us to work very, very closely.
And the success here is very, very dependent on how we can have a common vision together and how we can execute and continue and execute seamlessly across this space. So it's really been all about that.
Each and every day, I'm really privileged to work alongside these industry experts as well as as well as many others at the company, which has been great. And it's really enabled to us to move very, very quickly together.
So I think as we look forward, our opportunity as our challenge as well is to move to continue to execute in a very seamless fashion as we've been doing, but in the face of rapidly changing technology rapidly changing use cases, both for what we're doing today and into the future.
Can I add to what Rachel said? This group is so amazing. I get to work with them every day. So I just wanted to shout out to all these incredible partners that I get to work with every day.
A.I. is one of those, it's interesting. We have to work together to make A.I. successful across the entire stack. But when you talk about silos, it occurred to me that it's not just about silos within the company or within infrastructure. We also really need to work hard to remove silos across the industry. So this group is building the future.
So as we wrap up this discussion, I want to hear from all of you what you envision for the next 10 years of infrastructure.
在我们结束本次讨论之际,我想听听大家对未来10年基础设施的愿景。
Rachel?
请问你需要翻译什么内容呢?请提供原文。
Sure. Well, from a data center perspective, I'm confident that the data center design that we're talking about, our next generation data center, will be able to support the business for years to come.
当然。从数据中心的角度来看,我相信我们正在谈论的下一代数据中心设计可以支持业务未来数年的发展。
This is because we've really optimized on flexibility across power, cooling, hardware and network, as well as fungibility to support A.I. various workloads and products and services.
So I'm confident about that. But that said, this is a really rapidly evolving space. So we're going to continue to innovate on our design and really continue to think about how we can support the business as needs dictate.
The other thing I want to call out is we look towards the future. Of course, sustainability is really important. Climate change is really important. We're all as an industry working on how do we bend that curve? And so that also is a very core focus for us here at Meta and our infrastructure.
And so we have our net-zero target by 2030. It's something that we are focused on is getting to net-zero emissions by 2030. And that's across our entire value chain and that impacts the way that we're looking at designing a data centers to reduce our emissions, as well as through our hardware, through our network, and so on and so forth.
So that's going to be a really big focus for our infrastructures. We move forward as well as of course the future of A.I.
因此,这将是我们基础设施的一个非常重要的关注点。我们同时还将着眼于人工智能的未来。
And Alexis?
安德烈克西斯呢?简单表达就是询问对方关于安德烈克西斯的情况或者下落。
Yeah, well, I certainly hope that in 10 years we'll have A.I. systems that will be generating and building designing our next generation silicon and systems. But before that arrives, I definitely were looking at, I believe the next 10 years will see an incredible amount of customization of application-specific and purpose-built compute, compute platforms.
I'll see tremendous evolution in the network and high-performance networks at scale. We'll have to innovate and deliver breakthroughs on every area that's a bottleneck today in order to enable the compute systems of the future. So that's not exciting. I'm not really sure what is the Kim.
Sure. So first of all, I'll just say that A.I. is definitely here to stay. Right. So so far we've already seen two waves of A.I.
当然。首先,我要说的是,人工智能肯定是来留下的。对的。到目前为止,我们已经见证了两波人工智能。
Right. We saw the recommender system era and more recently we've seen generative A.I. This is not going to be the last wave. There I envision probably at least two more waves that we're going to see in the next 10 years. And so it's going to be really, really important that we stay ahead of that, understand when the trends are shifting and respond accordingly with our infrastructure to be able to position ourselves to like take full advantage of each of the waves.
So I think it's going to be a really exciting time.
所以我认为这将是一个非常令人兴奋的时刻。
And a part of how do you feel about these waves?
你对这些波浪有什么感受?
I mean, I think A.I. You know, A.I. for infra. So I see, you know, infra wall over the next 10 years. We've already seen this. I mean, I don't know if you caught Michael Bolin's talk today. He talked about code compose. We're working on A.I. Augmented productivity within matter where we're actually helping engineers like right code and augmenting their work and enabling them with with knowledge that would otherwise be a lot of things. So I expect to see A.I. transforming various parts of infra and you know, Alexa's alluded to silicon designs being generated by A.I. And so I do see scheduling systems having really smart algorithms that are powered by A.I. I see very lots of info involving with the I.I. It's going to be exciting.
And as we wrap up, I know this is the group that keeps our services and products stable for tons of users and also so inspired that you're building the future of what this will be, what A.I. will be in that infrastructure. So really honored to share this stage with all of you. And I know you'll be leading that future in building A.I. And not just meta, but the industry at large. So thank you all for joining us. Thank you so much, Alexis, the partner Rachel Kim and of course, Irene. We're so grateful that you could join us today. That's a powerful group of women.
As you heard throughout the day today, folks, we're translating our A.I. and fine investments into new experiences across our whole family faps while ensuring that we have the ability to drive long term research and technical innovation. Over the next decade, what we can expect is the needs for A.I. and training and inference to ramp up pretty dramatically and will need to scale with it. We'll probably see increased specialization and customization in chip design. We will see purpose built and workload specific A.I. and for we'll see new systems and probably new tooling for deployment at scale and improved efficiency in product and design support. All of this will deliver increasingly sophisticated models that are probably built on the latest research, which is going pretty fast these days and products that give people around the whole world access to new and much more relevant experiences in their day to day life.
In the days, the months, the years ahead will continue to update you on our A.I. and for journey. Again, want to thank all of you for joining us today. If you have any questions you'd like us to answer or topics you'd like us to cover in future at scale events or on a meta technical blocks, just visit our website at scaleconference.com or scan the keyword code you have right on your screen. Also, please make sure to check out our next upcoming events. We have systems at scale networking at scale and both are taking place in July. Registration is open right now, I believe, to sign up and to and please join our mailing list for the latest news.
I've spent 13 years now here in infram and I've seen how we are not only able to react to the present but also look forward towards the future and make that future a reality. We're always focused on the long term and when I think about the next 5, 10, maybe 15 years, I'm pretty sure when we look back as a point in time, this is going to be a pivotal moment for us as we continue to build on a position as a leading AI company in the next decade and beyond. And an infrastructure, this team, which we're building today at scale will bring a lot of this vision to life. Thanks again for joining us, have a great day. Thank you for joining us. Thank you for joining us. Thank you for joining us. Thank you for joining us. Thank you for joining us. Thank you for joining us.